PowerPC Exittimings
About exit timing
PowerPC KVM exports information to debugfs about the exits caused by virtualization and the time consumed by them. This data can typically be found as /sys/kernel/debug/kvm/vm*_vcpu*_timing
.
Because the PowerPC hardware currently supported by KVM has no hardware virtualization support, the guest must run in non-privileged mode. When the guest executes a privileged instruction, the KVM host must emulate its behavior, and this emulation time is accounted for as EMULINST. (It can be expected that upcoming hardware extension reduce most of these emulation exits as the guest can then run in privileged mode.)
Another frequent exit reason is caused by the memory/TLB management. Because a guest can not be allowed to program the real TLB (privileged commands and it would be an isolation violation anyway), the host has to track the state of the guest TLB and recover TLB faults caused because the guest is virtualized. Such a kind of TLB interrupts caused by virtualization is called [DI]TLBVIRT in the exit statistics. If the guest TLB tracked by the host does not have a mapping for the address reported by a TLB exception it is delivered to the guest as it is done on bare metal. This appears as [DI]TLBREAL in the exit statistics.
When the guest idles, it will enter "wait mode" until an interrupt is delivered. This time is accounted for by exit type of WAIT.
The other exits are less frequent like MMIO, DCR and SIGNAL which need to exit to kvm-userspace to be handled. The only value not really being an exit is the TIMEINGUEST which is the time in the guest.
The timing statistic itself tracks exit and reenter time as well as the type of the exit. Then the duration exit -> enter is accounted for the specific exit type while enter -> exit is accounted for the TIMEINGUEST values.
(!) All times are in usec's.
workloads and performance expectations
It can be expected that loads causing a high number of exits have a high overhead while loads that run in guest with only a few interceptions should be fine. Those loads with high exit ratios can be for example a guest booting and initializing all its virtual hardware (EMULINST), as well as load that creates memory pressure and therefore causes a lot of virtual TLB misses ([DI]TLBVIRT).
The following measurements are taken on a 440epx (Sequoia) board. Thins means running an unmodified guests on Hardware without virtualization support. therefore a lot of overhead can be expected. The following statistics give you an idea which exit reasons are frequent dependeing on the workload. And as mentioned above you can think what happens once you run that on virtualization powerpc Hardware coming with Power ISA 2.06 (http://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.06).
The following sections describe the workloads shown on this page.
boot guest
The workload traces a guest from the initialization until you see a boot prompt. Although being simple, this workload is useful to get a load with a high amount of privileged instructions.
Hashes
This is a custom written small program calculating a lot of hashes. It uses the hash algorithm used in the Berkley db and calculates hash of hash of and so on.
This program represents a workload that has only a few I/O and privileged instructions and therefore has only a low virtualization overhead.
attachment:foohash.c
To execute it just call the binary without options:
./foohash
Fhourstone
The Fhourstone benchmark (http://homepages.cwi.nl/~tromp/c4/fhour.html) uses a 64Mb transposition table to solve positions of the game connect-4. In a small environment as the sequoia board is this is a high amount of memory pressure and therefore it is expected that it causes a lot of TLB exits. After compiling the sources you get a binary called SearchGame and a file called "inputs" which represents the workload. The benchmark is invoked with:
SearchGame < inputs
Lame
Lame (http://lame.sourceforge.net/) is the known mp3 encoder and in the workload testcase used to convert a file in very high quality (option --preset insane). This workload has to do some I/O, as well as a lot of calculations that should not exit to the hypervisor. Therefore it can be expected that lame is a good example of average workload.
The wav file used is from a free sample page in the web (http://bellatrix.ece.mcgill.ca/Documents/AudioFormats/WAVE/Samples.html). We used the M1F1-int32-AFsp.wav from that page using the "insane" preset to get a max quality mp3.
lame --preset insane M1F1-int32-AFsp.wav M1F1-int32-AFsp.mp3
find
At last using find to find a file and search for it all over the disk is a simple workload using only a few CPU calculations but requiring a lot of I/O.
As testcase find is executed on root to search for the wav file we use in the Lame testcase, but with a wildcard. The disk we have is the ELDK ramdisk plus the files for the workloads listed here. To be fair and use the same file system the bare metal test mounts the same as loop device.
cd /
time find -name "*.wav*"
Performance results
This Section lists at which % of the native bare metal speed these tests run on the current kvmppc implementation. As described below there are alerady known options for improvement like paravirtualization. The tests are run on the source level on 11. November 2008 which included some new performance improvements in memory management, interrupt delivery as well as several minor improvements. The tests are run on a Host and Guest with 64k page size. The Host uses a nfs mount as root file system while the guest is using virtio to access a disk image placed on the host root nfs mount.
workload | % of bae metal speed |
hashes | 96.49% |
lame | 84.47% |
boot | ~80% |
find | 6.11% |
fhourstone | 5.96% |
(!) network latency after the virtio indirection might be a big issue for the find testcase so treat the numbers unfinished until we verified that number it on e.g. a local usb stick. /!\ On a side note it might be worth to explain that the time accounted for MMIO is the time a guest exits and KVM prepares the mmio until it returns to the guest. It is not the time until the IO arrives and is ready for the guest. Additional IO performance data may be obtained by running blktrace on the virtio disk inside the guest.
Timings results
The following tables show the results of the exit timing analysis using the 5 different workloads mentioned above. You can get similar postprocessed reports when using this script (attachment:kvmppc_timing.py) with the data reported by the debugfs interface.
boot guest
sum of time 8309940
type count min max sum avg stddev %
MMIO: 9402 44 1997 1697610 180.5584 155.768 20.43
DCR: 680 41 99 32096 47.2000 7.008 0.39
SIGNAL: 1 98 98 98 98.0000 0.000 0.00
ITLBREAL: 926 8 14 7810 8.4341 0.658 0.09
ITLBVIRT: 3595 18 202 76185 21.1919 4.954 0.92
DTLBREAL: 950 8 16 8891 9.3589 1.427 0.11
DTLBVIRT: 6695 18 282 156727 23.4096 13.781 1.89
SYSCALL: 1801 6 59 11372 6.3143 2.575 0.14
ISI: 116 6 8 764 6.5862 0.588 0.01
DSI: 43 6 7 292 6.7907 0.407 0.00
EMULINST: 65247 7 96 484081 7.4192 1.818 5.83
EMUL_WAIT: 801 659 9200 3721789 4646.4282 1687.218 44.79
EMUL_CORE: 66806 7 86 540053 8.0839 1.895 6.50
EMUL_MTSPR: 13415 8 62 111358 8.3010 2.583 1.34
EMUL_MFSPR: 7635 8 61 62772 8.2216 1.921 0.76
EMUL_MTMSR: 5678 8 59 45704 8.0493 1.434 0.55
EMUL_MFMSR: 32853 7 67 267603 8.1455 1.875 3.22
EMUL_TLBSX: 354 9 60 3745 10.5791 3.919 0.05
EMUL_TLBWE: 6403 9 112 99522 15.5430 7.668 1.20
EMUL_RFI: 9515 7 57 71420 7.5060 2.108 0.86
DEC: 290 49 161 15786 54.4345 9.708 0.19
EXTINT: 7 74 75 522 74.5714 0.495 0.01
TIMEINGUEST: 233213 0 3954 893740 3.8323 65.837 10.76
Hashes
sum of time 21576367
type count min max sum avg stddev %
MMIO: 827 49 6700 224632 271.6227 259.231 1.04
DCR: 161 42 94 7468 46.3851 4.314 0.03
SIGNAL: 2 291 1214 1505 752.5000 461.500 0.01
ITLBREAL: 53 8 12 445 8.3962 0.682 0.00
ITLBVIRT: 216 19 68 4566 21.1389 3.346 0.02
DTLBREAL: 44 9 16 420 9.5455 1.738 0.00
DTLBVIRT: 407 19 73 8706 21.3907 3.687 0.04
SYSCALL: 66 6 7 428 6.4848 0.500 0.00
ISI: 5 6 8 34 6.8000 0.748 0.00
DSI: 1 7 7 7 7.0000 0.000 0.00
EMULINST: 67009 6 97 508311 7.5857 1.247 2.36
EMUL_WAIT: 231 1254 8902 1074304 4650.6667 1699.150 4.98
EMUL_CORE: 32964 7 59 262866 7.9743 0.622 1.22
EMUL_MTSPR: 9201 8 14 74751 8.1242 0.339 0.35
EMUL_MFSPR: 379 8 60 3134 8.2691 2.686 0.01
EMUL_MTMSR: 4749 8 9 37996 8.0008 0.029 0.18
EMUL_MFMSR: 14257 7 55 114282 8.0159 0.776 0.53
EMUL_TLBSX: 18 9 15 185 10.2778 1.193 0.00
EMUL_TLBWE: 393 9 69 5477 13.9364 6.905 0.03
EMUL_RFI: 9006 7 57 67271 7.4696 0.722 0.31
DEC: 5065 49 269 267567 52.8267 13.544 1.24
EXTINT: 2 77 451 528 264.0000 187.000 0.00
TIMEINGUEST: 145056 0 3954 18911484 130.3737 678.943 87.65
Lame
sum of time 6592939
type count min max sum avg stddev %
MMIO: 1827 48 18883 550073 301.0799 772.936 8.34
DCR: 392 42 1074 22162 56.5357 83.884 0.34
SIGNAL: 1 1812 1812 1812 1812.0000 0.000 0.03
ITLBREAL: 142 8 13 1200 8.4507 0.623 0.02
ITLBVIRT: 1860 18 118 39336 21.1484 4.514 0.60
DTLBREAL: 164 8 66 1707 10.4085 4.885 0.03
DTLBVIRT: 2724 18 1039 63063 23.1509 23.705 0.96
SYSCALL: 255 6 8 1626 6.3765 0.531 0.02
ISI: 17 6 8 114 6.7059 0.570 0.00
DSI: 1 7 7 7 7.0000 0.000 0.00
EMULINST: 26682 6 161 203151 7.6138 2.885 3.08
EMUL_WAIT: 261 501 7683 1211114 4640.2835 1683.305 18.37
EMUL_CORE: 19247 7 161 158089 8.2137 4.052 2.40
EMUL_MTSPR: 4309 8 161 35697 8.2843 2.371 0.54
EMUL_MFSPR: 1238 8 61 10395 8.3966 3.510 0.16
EMUL_MTMSR: 2098 8 61 17045 8.1244 2.403 0.26
EMUL_MFMSR: 9158 7 163 75613 8.2565 3.392 1.15
EMUL_TLBSX: 36 9 13 364 10.1111 0.698 0.01
EMUL_TLBWE: 1062 9 75 17011 16.0179 7.539 0.26
EMUL_RFI: 3736 7 142 28230 7.5562 2.974 0.43
DEC: 1109 49 263 61117 55.1100 15.174 0.93
EXTINT: 36 52 1377 12085 335.6944 308.306 0.18
TIMEINGUEST: 76355 0 3954 4081928 53.4599 415.345 61.91
Fhourstone
sum of time 7483768
type count min max sum avg stddev %
MMIO: 818 47 9565 221609 270.9156 501.827 2.96
DCR: 301 40 473 14408 47.8671 29.403 0.19
SIGNAL: 1 2521 2521 2521 2521.0000 0.000 0.03
ITLBREAL: 322 8 58 2665 8.2764 2.810 0.04
ITLBVIRT: 5773 18 1360 120111 20.8056 18.569 1.60
DTLBREAL: 16433 8 73 184196 11.2089 3.709 2.46
DTLBVIRT: 19913 18 1845 500349 25.1268 23.006 6.69
SYSCALL: 91 6 7 579 6.3626 0.481 0.01
ISI: 5 6 8 33 6.6000 0.800 0.00
DSI: 1 7 7 7 7.0000 0.000 0.00
EMULINST: 127113 6 102 949687 7.4712 2.244 12.69
EMUL_WAIT: 76 3526 7578 354928 4670.1053 1679.367 4.74
EMUL_CORE: 16733 7 159 134306 8.0264 1.701 1.79
EMUL_MTSPR: 71886 8 149 594618 8.2717 2.595 7.95
EMUL_MFSPR: 83877 8 93 689016 8.2146 2.580 9.21
EMUL_MTMSR: 3166 8 57 25438 8.0347 1.220 0.34
EMUL_MFMSR: 7392 7 61 59415 8.0377 1.294 0.79
EMUL_TLBSX: 19 9 11 192 10.1053 0.640 0.00
EMUL_TLBWE: 47187 9 2757 1131527 23.9796 13.359 15.12
EMUL_RFI: 21782 7 132 172428 7.9161 2.323 2.30
DEC: 715 50 224 42155 58.9580 15.052 0.56
EXTINT: 2 142 161 303 151.5000 9.500 0.00
TIMEINGUEST: 423606 0 499 2283277 5.3901 25.629 30.51
find
sum of time 3426052
type count min max sum avg stddev %
MMIO: 222 49 413 48228 217.2432 141.096 1.41
DCR: 91 43 93 4239 46.5824 5.285 0.12
SIGNAL: 3 476 5651 7952 2650.6667 2191.871 0.23
ITLBREAL: 77 8 13 665 8.6364 0.754 0.02
ITLBVIRT: 1341 18 120 28340 21.1335 4.968 0.83
DTLBREAL: 59 8 16 573 9.7119 2.042 0.02
DTLBVIRT: 2253 19 214 48630 21.5846 7.083 1.42
SYSCALL: 4590 6 57 29503 6.4277 2.114 0.86
ISI: 11 6 8 72 6.5455 0.656 0.00
DSI: 1 7 7 7 7.0000 0.000 0.00
EMULINST: 71560 6 77 525976 7.3501 1.945 15.35
EMUL_WAIT: 374 184 9384 1724701 4611.5000 1752.946 50.34
EMUL_CORE: 32646 7 100 262449 8.0392 1.792 7.66
EMUL_MTSPR: 6668 8 78 54829 8.2227 2.213 1.60
EMUL_MFSPR: 538 8 61 4507 8.3773 3.181 0.13
EMUL_MTMSR: 5829 8 58 47036 8.0693 1.765 1.37
EMUL_MFMSR: 15805 7 92 127426 8.0624 1.774 3.72
EMUL_TLBSX: 29 9 14 297 10.2414 0.857 0.01
EMUL_TLBWE: 462 9 27 6717 14.5390 6.606 0.20
EMUL_RFI: 10855 7 57 81776 7.5335 2.132 2.39
DEC: 160 50 403 9244 57.7750 32.861 0.27
EXTINT: 4 427 1410 2991 747.7500 387.275 0.09
TIMEINGUEST: 153578 0 762 409894 2.6690 6.322 11.96
Paravirtualization improvement
As mentioned above improvements to all these overhead statistics are already known. On one hand the hardware virtualization support specified in the Power ISA 2.06, on the other hand on older hardware paravirtualization can be an option. For KVMPPC we wrote a simple paravirtualization interface to test hypercalls and measure some benefits from such implementations. In the concept measured here the hypervisor tells the guest that it supports special paravirtualization features if the guest (hyper)calls him passing a guest virtual and guest physical address and an amount of x (4096byte in the example) of ram big. This is a very basic, but also flexible interface as the hypervisor can now use this guest addressable memory to do all kind of things. In the example the hypervisor rewrites guest code to replace privileged instructions mfspr (SPRG1-3, ESR, DEAR) mtspr ((SPRG1-3, ESR, DEAR) and mfmsr. The hypervisor keeps the guest copies of these values updated in the memory area provided by the guest and rewrites the privileged instructions to simple non trapping load/store instructions. That save a lot of EMULINST exits while running the guest and has shown around 35-50% improvement for workloads with a high amount of EMULINST instructions (e.g. the boot workload)
The net benefit of that improvement is high while the "guest invasiveness" is very low (The guest only has to donate a small amount of ram and virtual address space, all the other things can be done by the hypervisor transparently). And remember this is just one simple example of pv extensions, there are a lot other areas e.g. collaborative guest/host TLB management that can improve performance significantly (could be as easy as telling the guest to program more virtual TLB entries in the guest TLB to allow the host fix more misses directly).
The following numbers show the improvement comparing the same workload with/without this paravirtualization feature. The workload used in this test is the boot workload, but using a more complex guest and Host environment (Ubuntu instead of ELDK) and it also uses an older version of our kernel and userspace code (also containing an older version of that exit timing and therefore not being directly comparable with the measurements above).
No paravirtualization
sum of time 144837890 => ~2:24 runtime
count min max sum avg stddev %
MMIO 10105 46 1534 2952467 292.17 295.933 2.04
DCR 5428 40 209 246168 45.35 6.758 0.17
SIGNAL 695 65 3755 89767 129.16 314.421 0.06
ITLBREAL 80045 13 108 1063127 13.28 2.338 0.73
ITLBVIRT 1000585 21 264827 24300596 24.28 264.753 16.78
DTLBREAL 91206 13 69 1285214 14.09 2.225 0.89
DTLBVIRT 977434 21 1446 24007008 24.56 4.426 16.58
SYSCALL 10460 11 55 116447 11.13 1.929 0.08
ISI 11724 11 61 130007 11.08 1.929 0.09
DSI 20737 11 57 230009 11.09 1.914 0.16
EMULINST 5683356 11 3778 79339467 13.96 50.275 54.78
DEC 13079 50 826 732712 56.02 22.382 0.51
EXTINT 55 30 1478 10996 199.92 238.150 0.01
FP_UNAVAIL 280 11 53 3163 11.29 3.495 0.00
TIMEINGUEST 7905189 0 3688 10330742 1.30 8.970 7.13
paravirtualization
sum of time 92206510 => ~1:32 runtime (~37% net improvement)
count min max sum avg stddev %
MMIO 12505 46 3087 3693782 295.38 260.788 4.01
DCR 5595 40 706 273578 48.89 31.305 0.30
SIGNAL 654 65 4132 300027 458.75 571.130 0.33
ITLBREAL 71711 13 104 943053 13.15 2.360 1.02
ITLBVIRT 750649 21 1503 18178245 24.21 7.335 19.71
DTLBREAL 83356 13 102 1146242 13.75 2.406 1.24
DTLBPV 30086 20 237 653556 21.72 4.639 0.71
DTLBVIRT 772811 21 713 19079477 24.68 6.593 20.69
SYSCALL 7647 11 57 84821 11.09 1.897 0.09
HCALL 1 19 19 19 19.00 0.000 0.00
ISI 9895 11 73 109667 11.08 1.904 0.12
DSI 17974 10 57 199504 11.09 2.046 0.22
EMULINST 2567245 11 4212 40501314 15.77 65.673 43.92
DEC 7488 51 641 426813 56.99 23.893 0.46
EXTINT 2215 31 1677 297495 134.30 116.219 0.32
FP_UNAVAIL 258 11 11 2838 11.00 0.000 0.00
TIMEINGUEST 4340090 0 3850 6316079 1.45 12.599 6.85
More Results
This should actually just be an overview and is already huge, some more results can be found in all kind of timing and improvement discussions on kvm-ppc@vger.kernel.org