PowerPC Exittimings

From KVM
Revision as of 12:28, 15 July 2010 by HollisBlanchard (talk | contribs) (kvmppc_timing.py was never imported from the old wiki; it's gone now)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

About exit timing

PowerPC KVM exports information to debugfs about the exits caused by virtualization and the time consumed by them. This data can typically be found as /sys/kernel/debug/kvm/vm*_vcpu*_timing.

Because the PowerPC hardware currently supported by KVM has no hardware virtualization support, the guest must run in non-privileged mode. When the guest executes a privileged instruction, the KVM host must emulate its behavior, and this emulation time is accounted for as EMULINST. (It can be expected that upcoming hardware extension reduce most of these emulation exits as the guest can then run in privileged mode.)

Another frequent exit reason is caused by the memory/TLB management. Because a guest can not be allowed to program the real TLB (privileged commands and it would be an isolation violation anyway), the host has to track the state of the guest TLB and recover TLB faults caused because the guest is virtualized. Such a kind of TLB interrupts caused by virtualization is called [DI]TLBVIRT in the exit statistics. If the guest TLB tracked by the host does not have a mapping for the address reported by a TLB exception it is delivered to the guest as it is done on bare metal. This appears as [DI]TLBREAL in the exit statistics.

When the guest idles, it will enter "wait mode" until an interrupt is delivered. This time is accounted for by exit type of WAIT.

The other exits are less frequent like MMIO, DCR and SIGNAL which need to exit to kvm-userspace to be handled. The only value not really being an exit is the TIMEINGUEST which is the time in the guest.

The timing statistic itself tracks exit and reenter time as well as the type of the exit. Then the duration exit -> enter is accounted for the specific exit type while enter -> exit is accounted for the TIMEINGUEST values.

Note: All times are in usec's.

workloads and performance expectations

It can be expected that loads causing a high number of exits have a high overhead while loads that run in guest with only a few interceptions should be fine. Those loads with high exit ratios can be for example a guest booting and initializing all its virtual hardware (EMULINST), as well as load that creates memory pressure and therefore causes a lot of virtual TLB misses ([DI]TLBVIRT).

The following measurements are taken on a 440epx (Sequoia) board. Thins means running an unmodified guests on Hardware without virtualization support. therefore a lot of overhead can be expected. The following statistics give you an idea which exit reasons are frequent dependeing on the workload. And as mentioned above you can think what happens once you run that on virtualization powerpc Hardware coming with Power ISA 2.06 (http://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.06).

The following sections describe the workloads shown on this page. Workload classification.gif

boot guest

The workload traces a guest from the initialization until you see a boot prompt. Although being simple, this workload is useful to get a load with a high amount of privileged instructions.

Hashes

This is a custom written small program calculating a lot of hashes. It uses the hash algorithm used in the Berkley db and calculates hash of hash of and so on. This program represents a workload that has only a few I/O and privileged instructions and therefore has only a low virtualization overhead. attachment:foohash.c To execute it just call the binary without options:

  ./foohash

Fhourstone

The Fhourstone benchmark (http://homepages.cwi.nl/~tromp/c4/fhour.html) uses a 64Mb transposition table to solve positions of the game connect-4. In a small environment as the sequoia board is this is a high amount of memory pressure and therefore it is expected that it causes a lot of TLB exits. After compiling the sources you get a binary called SearchGame and a file called "inputs" which represents the workload. The benchmark is invoked with:

  SearchGame < inputs

Lame

Lame (http://lame.sourceforge.net/) is the known mp3 encoder and in the workload testcase used to convert a file in very high quality (option --preset insane). This workload has to do some I/O, as well as a lot of calculations that should not exit to the hypervisor. Therefore it can be expected that lame is a good example of average workload. The wav file used is from a free sample page in the web (http://bellatrix.ece.mcgill.ca/Documents/AudioFormats/WAVE/Samples.html). We used the M1F1-int32-AFsp.wav from that page using the "insane" preset to get a max quality mp3.

  lame --preset insane M1F1-int32-AFsp.wav M1F1-int32-AFsp.mp3

find

At last using find to find a file and search for it all over the disk is a simple workload using only a few CPU calculations but requiring a lot of I/O. As testcase find is executed on root to search for the wav file we use in the Lame testcase, but with a wildcard. The disk we have is the ELDK ramdisk plus the files for the workloads listed here. To be fair and use the same file system the bare metal test mounts the same as loop device.

  cd /
  time find -name "*.wav*"

Performance results

This Section lists at which % of the native bare metal speed these tests run on the current kvmppc implementation. As described below there are alerady known options for improvement like paravirtualization. The tests are run on the source level on 11. November 2008 which included some new performance improvements in memory management, interrupt delivery as well as several minor improvements. The tests are run on a Host and Guest with 64k page size. The Host uses a nfs mount as root file system while the guest is using virtio to access a disk image placed on the host root nfs mount.

workload % of bae metal speed
hashes 96.49%
lame 84.47%
boot ~80%
find 6.11%
fhourstone 5.96%

Note: network latency after the virtio indirection might be a big issue for the find testcase so treat the numbers unfinished until we verified that number it on e.g. a local usb stick.

Note: the time accounted for MMIO is the time a guest exits and KVM prepares the mmio until it returns to the guest. It is not the time until the IO arrives and is ready for the guest. Additional IO performance data may be obtained by running blktrace on the virtio disk inside the guest.

Timings results

The following tables show the results of the exit timing analysis using the 5 different workloads mentioned above.

boot guest

sum of time 8309940
        type    count      min      max          sum              avg       stddev     %
       MMIO:     9402       44     1997      1697610         180.5584      155.768 20.43
        DCR:      680       41       99        32096          47.2000        7.008  0.39
     SIGNAL:        1       98       98           98          98.0000        0.000  0.00
   ITLBREAL:      926        8       14         7810           8.4341        0.658  0.09
   ITLBVIRT:     3595       18      202        76185          21.1919        4.954  0.92
   DTLBREAL:      950        8       16         8891           9.3589        1.427  0.11
   DTLBVIRT:     6695       18      282       156727          23.4096       13.781  1.89
    SYSCALL:     1801        6       59        11372           6.3143        2.575  0.14
        ISI:      116        6        8          764           6.5862        0.588  0.01
        DSI:       43        6        7          292           6.7907        0.407  0.00
   EMULINST:    65247        7       96       484081           7.4192        1.818  5.83
  EMUL_WAIT:      801      659     9200      3721789        4646.4282     1687.218 44.79
  EMUL_CORE:    66806        7       86       540053           8.0839        1.895  6.50
 EMUL_MTSPR:    13415        8       62       111358           8.3010        2.583  1.34
 EMUL_MFSPR:     7635        8       61        62772           8.2216        1.921  0.76
 EMUL_MTMSR:     5678        8       59        45704           8.0493        1.434  0.55
 EMUL_MFMSR:    32853        7       67       267603           8.1455        1.875  3.22
 EMUL_TLBSX:      354        9       60         3745          10.5791        3.919  0.05
 EMUL_TLBWE:     6403        9      112        99522          15.5430        7.668  1.20
   EMUL_RFI:     9515        7       57        71420           7.5060        2.108  0.86
        DEC:      290       49      161        15786          54.4345        9.708  0.19
     EXTINT:        7       74       75          522          74.5714        0.495  0.01
TIMEINGUEST:   233213        0     3954       893740           3.8323       65.837 10.76

Hashes

sum of time 21576367
        type    count      min      max          sum              avg       stddev     %
       MMIO:      827       49     6700       224632         271.6227      259.231  1.04
        DCR:      161       42       94         7468          46.3851        4.314  0.03
     SIGNAL:        2      291     1214         1505         752.5000      461.500  0.01
   ITLBREAL:       53        8       12          445           8.3962        0.682  0.00
   ITLBVIRT:      216       19       68         4566          21.1389        3.346  0.02
   DTLBREAL:       44        9       16          420           9.5455        1.738  0.00
   DTLBVIRT:      407       19       73         8706          21.3907        3.687  0.04
    SYSCALL:       66        6        7          428           6.4848        0.500  0.00
        ISI:        5        6        8           34           6.8000        0.748  0.00
        DSI:        1        7        7            7           7.0000        0.000  0.00
   EMULINST:    67009        6       97       508311           7.5857        1.247  2.36
  EMUL_WAIT:      231     1254     8902      1074304        4650.6667     1699.150  4.98
  EMUL_CORE:    32964        7       59       262866           7.9743        0.622  1.22
 EMUL_MTSPR:     9201        8       14        74751           8.1242        0.339  0.35
 EMUL_MFSPR:      379        8       60         3134           8.2691        2.686  0.01
 EMUL_MTMSR:     4749        8        9        37996           8.0008        0.029  0.18
 EMUL_MFMSR:    14257        7       55       114282           8.0159        0.776  0.53
 EMUL_TLBSX:       18        9       15          185          10.2778        1.193  0.00
 EMUL_TLBWE:      393        9       69         5477          13.9364        6.905  0.03
   EMUL_RFI:     9006        7       57        67271           7.4696        0.722  0.31
        DEC:     5065       49      269       267567          52.8267       13.544  1.24
     EXTINT:        2       77      451          528         264.0000      187.000  0.00
TIMEINGUEST:   145056        0     3954     18911484         130.3737      678.943 87.65

Lame

sum of time 6592939
        type    count      min      max          sum              avg       stddev     %
       MMIO:     1827       48    18883       550073         301.0799      772.936  8.34
        DCR:      392       42     1074        22162          56.5357       83.884  0.34
     SIGNAL:        1     1812     1812         1812        1812.0000        0.000  0.03
   ITLBREAL:      142        8       13         1200           8.4507        0.623  0.02
   ITLBVIRT:     1860       18      118        39336          21.1484        4.514  0.60
   DTLBREAL:      164        8       66         1707          10.4085        4.885  0.03
   DTLBVIRT:     2724       18     1039        63063          23.1509       23.705  0.96
    SYSCALL:      255        6        8         1626           6.3765        0.531  0.02
        ISI:       17        6        8          114           6.7059        0.570  0.00
        DSI:        1        7        7            7           7.0000        0.000  0.00
   EMULINST:    26682        6      161       203151           7.6138        2.885  3.08
  EMUL_WAIT:      261      501     7683      1211114        4640.2835     1683.305 18.37
  EMUL_CORE:    19247        7      161       158089           8.2137        4.052  2.40
 EMUL_MTSPR:     4309        8      161        35697           8.2843        2.371  0.54
 EMUL_MFSPR:     1238        8       61        10395           8.3966        3.510  0.16
 EMUL_MTMSR:     2098        8       61        17045           8.1244        2.403  0.26
 EMUL_MFMSR:     9158        7      163        75613           8.2565        3.392  1.15
 EMUL_TLBSX:       36        9       13          364          10.1111        0.698  0.01
 EMUL_TLBWE:     1062        9       75        17011          16.0179        7.539  0.26
   EMUL_RFI:     3736        7      142        28230           7.5562        2.974  0.43
        DEC:     1109       49      263        61117          55.1100       15.174  0.93
     EXTINT:       36       52     1377        12085         335.6944      308.306  0.18
TIMEINGUEST:    76355        0     3954      4081928          53.4599      415.345 61.91

Fhourstone

sum of time 7483768
        type    count      min      max          sum              avg       stddev     %
       MMIO:      818       47     9565       221609         270.9156      501.827  2.96
        DCR:      301       40      473        14408          47.8671       29.403  0.19
     SIGNAL:        1     2521     2521         2521        2521.0000        0.000  0.03
   ITLBREAL:      322        8       58         2665           8.2764        2.810  0.04
   ITLBVIRT:     5773       18     1360       120111          20.8056       18.569  1.60
   DTLBREAL:    16433        8       73       184196          11.2089        3.709  2.46
   DTLBVIRT:    19913       18     1845       500349          25.1268       23.006  6.69
    SYSCALL:       91        6        7          579           6.3626        0.481  0.01
        ISI:        5        6        8           33           6.6000        0.800  0.00
        DSI:        1        7        7            7           7.0000        0.000  0.00
   EMULINST:   127113        6      102       949687           7.4712        2.244 12.69
  EMUL_WAIT:       76     3526     7578       354928        4670.1053     1679.367  4.74
  EMUL_CORE:    16733        7      159       134306           8.0264        1.701  1.79
 EMUL_MTSPR:    71886        8      149       594618           8.2717        2.595  7.95
 EMUL_MFSPR:    83877        8       93       689016           8.2146        2.580  9.21
 EMUL_MTMSR:     3166        8       57        25438           8.0347        1.220  0.34
 EMUL_MFMSR:     7392        7       61        59415           8.0377        1.294  0.79
 EMUL_TLBSX:       19        9       11          192          10.1053        0.640  0.00
 EMUL_TLBWE:    47187        9     2757      1131527          23.9796       13.359 15.12
   EMUL_RFI:    21782        7      132       172428           7.9161        2.323  2.30
        DEC:      715       50      224        42155          58.9580       15.052  0.56
     EXTINT:        2      142      161          303         151.5000        9.500  0.00
TIMEINGUEST:   423606        0      499      2283277           5.3901       25.629 30.51

find

sum of time 3426052
        type    count      min      max          sum              avg       stddev     %
       MMIO:      222       49      413        48228         217.2432      141.096  1.41
        DCR:       91       43       93         4239          46.5824        5.285  0.12
     SIGNAL:        3      476     5651         7952        2650.6667     2191.871  0.23
   ITLBREAL:       77        8       13          665           8.6364        0.754  0.02
   ITLBVIRT:     1341       18      120        28340          21.1335        4.968  0.83
   DTLBREAL:       59        8       16          573           9.7119        2.042  0.02
   DTLBVIRT:     2253       19      214        48630          21.5846        7.083  1.42
    SYSCALL:     4590        6       57        29503           6.4277        2.114  0.86
        ISI:       11        6        8           72           6.5455        0.656  0.00
        DSI:        1        7        7            7           7.0000        0.000  0.00
   EMULINST:    71560        6       77       525976           7.3501        1.945 15.35
  EMUL_WAIT:      374      184     9384      1724701        4611.5000     1752.946 50.34
  EMUL_CORE:    32646        7      100       262449           8.0392        1.792  7.66
 EMUL_MTSPR:     6668        8       78        54829           8.2227        2.213  1.60
 EMUL_MFSPR:      538        8       61         4507           8.3773        3.181  0.13
 EMUL_MTMSR:     5829        8       58        47036           8.0693        1.765  1.37
 EMUL_MFMSR:    15805        7       92       127426           8.0624        1.774  3.72
 EMUL_TLBSX:       29        9       14          297          10.2414        0.857  0.01
 EMUL_TLBWE:      462        9       27         6717          14.5390        6.606  0.20
   EMUL_RFI:    10855        7       57        81776           7.5335        2.132  2.39
        DEC:      160       50      403         9244          57.7750       32.861  0.27
     EXTINT:        4      427     1410         2991         747.7500      387.275  0.09
TIMEINGUEST:   153578        0      762       409894           2.6690        6.322 11.96


Paravirtualization improvement

As mentioned above improvements to all these overhead statistics are already known. On one hand the hardware virtualization support specified in the Power ISA 2.06, on the other hand on older hardware paravirtualization can be an option. For KVMPPC we wrote a simple paravirtualization interface to test hypercalls and measure some benefits from such implementations. In the concept measured here the hypervisor tells the guest that it supports special paravirtualization features if the guest (hyper)calls him passing a guest virtual and guest physical address and an amount of x (4096byte in the example) of ram big. This is a very basic, but also flexible interface as the hypervisor can now use this guest addressable memory to do all kind of things. In the example the hypervisor rewrites guest code to replace privileged instructions mfspr (SPRG1-3, ESR, DEAR) mtspr ((SPRG1-3, ESR, DEAR) and mfmsr. The hypervisor keeps the guest copies of these values updated in the memory area provided by the guest and rewrites the privileged instructions to simple non trapping load/store instructions. That save a lot of EMULINST exits while running the guest and has shown around 35-50% improvement for workloads with a high amount of EMULINST instructions (e.g. the boot workload)

The net benefit of that improvement is high while the "guest invasiveness" is very low (The guest only has to donate a small amount of ram and virtual address space, all the other things can be done by the hypervisor transparently). And remember this is just one simple example of pv extensions, there are a lot other areas e.g. collaborative guest/host TLB management that can improve performance significantly (could be as easy as telling the guest to program more virtual TLB entries in the guest TLB to allow the host fix more misses directly).

The following numbers show the improvement comparing the same workload with/without this paravirtualization feature. The workload used in this test is the boot workload, but using a more complex guest and Host environment (Ubuntu instead of ELDK) and it also uses an older version of our kernel and userspace code (also containing an older version of that exit timing and therefore not being directly comparable with the measurements above).

No paravirtualization

sum of time 144837890 => ~2:24 runtime
               count  min    max       sum     avg   stddev      %
       MMIO    10105   46   1534   2952467  292.17  295.933   2.04
        DCR     5428   40    209    246168   45.35    6.758   0.17
     SIGNAL      695   65   3755     89767  129.16  314.421   0.06
   ITLBREAL    80045   13    108   1063127   13.28    2.338   0.73
   ITLBVIRT  1000585   21 264827  24300596   24.28  264.753  16.78
   DTLBREAL    91206   13     69   1285214   14.09    2.225   0.89
   DTLBVIRT   977434   21   1446  24007008   24.56    4.426  16.58
    SYSCALL    10460   11     55    116447   11.13    1.929   0.08
        ISI    11724   11     61    130007   11.08    1.929   0.09
        DSI    20737   11     57    230009   11.09    1.914   0.16
   EMULINST  5683356   11   3778  79339467   13.96   50.275  54.78
        DEC    13079   50    826    732712   56.02   22.382   0.51
     EXTINT       55   30   1478     10996  199.92  238.150   0.01
 FP_UNAVAIL      280   11     53      3163   11.29    3.495   0.00
TIMEINGUEST  7905189    0   3688  10330742    1.30    8.970   7.13

paravirtualization

sum of time 92206510 => ~1:32 runtime (~37% net improvement)
               count  min    max       sum     avg   stddev      %
       MMIO    12505   46   3087   3693782  295.38  260.788   4.01
        DCR     5595   40    706    273578   48.89   31.305   0.30
     SIGNAL      654   65   4132    300027  458.75  571.130   0.33
   ITLBREAL    71711   13    104    943053   13.15    2.360   1.02
   ITLBVIRT   750649   21   1503  18178245   24.21    7.335  19.71
   DTLBREAL    83356   13    102   1146242   13.75    2.406   1.24
     DTLBPV    30086   20    237    653556   21.72    4.639   0.71
   DTLBVIRT   772811   21    713  19079477   24.68    6.593  20.69
    SYSCALL     7647   11     57     84821   11.09    1.897   0.09
      HCALL        1   19     19        19   19.00    0.000   0.00
        ISI     9895   11     73    109667   11.08    1.904   0.12
        DSI    17974   10     57    199504   11.09    2.046   0.22
   EMULINST  2567245   11   4212  40501314   15.77   65.673  43.92
        DEC     7488   51    641    426813   56.99   23.893   0.46
     EXTINT     2215   31   1677    297495  134.30  116.219   0.32
 FP_UNAVAIL      258   11     11      2838   11.00    0.000   0.00
TIMEINGUEST  4340090    0   3850   6316079    1.45   12.599   6.85

More Results

This should actually just be an overview and is already huge, some more results can be found in all kind of timing and improvement discussions on kvm-ppc@vger.kernel.org