<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://linux-kvm.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Stefanha</id>
	<title>KVM - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://linux-kvm.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Stefanha"/>
	<link rel="alternate" type="text/html" href="https://linux-kvm.org/page/Special:Contributions/Stefanha"/>
	<updated>2026-04-21T13:44:22Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.5</generator>
	<entry>
		<id>https://linux-kvm.org/index.php?title=FAQ&amp;diff=4851</id>
		<title>FAQ</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=FAQ&amp;diff=4851"/>
		<updated>2013-07-30T12:21:21Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=FAQ=&lt;br /&gt;
&lt;br /&gt;
== Preparing to use KVM ==&lt;br /&gt;
=== What do I need to use KVM? ===&lt;br /&gt;
You will need an x86 machine running a recent Linux kernel on an Intel processor with VT (virtualization technology) extensions, or an AMD processor with SVM extensions (also called AMD-V). Xen has a [http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors complete list] of compatible processors. For Intel processors, see also [http://ark.intel.com/VTList.aspx the Intel® Virtualization Technology List].&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
=== Are 64bits processors supported under KVM? ===&lt;br /&gt;
Yes they are supported and will allow you to run 32bits and 64 bits guests.&lt;br /&gt;
&lt;br /&gt;
See also &#039;&#039;&#039;Can KVM run a 32-bit guest on a 64-bit host? What about PAE?&#039;&#039;&#039; below.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What is Intel VT / AMD-V / hvm? ===&lt;br /&gt;
[http://www.intel.com/technology/itj/2006/v10i3/1-hardware/6-vt-x-vt-i-solutions.htm Intel VT] and [http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8826_14287,00.html AMD&#039;s AMD-V] are instruction set extensions that provide hardware assistance to virtual machine monitors. They enable running fully isolated virtual machines at native hardware speeds, for some workloads.&lt;br /&gt;
&lt;br /&gt;
HVM (for Hardware Virtual Machine) is a vendor-neutral term often used to designate the x86 instruction set extensions.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Where do I get my kvm kernel modules from? ===&lt;br /&gt;
&lt;br /&gt;
See the [[Getting the kvm kernel modules]] page.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How can I tell if I have Intel VT or AMD-V? ===&lt;br /&gt;
With a recent enough Linux kernel, run the command:&lt;br /&gt;
&lt;br /&gt;
 . egrep &#039;^flags.*(vmx|svm)&#039; /proc/cpuinfo&lt;br /&gt;
&lt;br /&gt;
If something shows up, you have VT. You can also check the processor model name (in `/proc/cpuinfo`) in the vendor&#039;s web site.&lt;br /&gt;
&lt;br /&gt;
Note:&lt;br /&gt;
&lt;br /&gt;
*  You&#039;ll never see (vmx|svm) in /proc/cpuinfo if you&#039;re currently running in  in a dom0 or domU.&amp;lt;br /&amp;gt; The Xen hypervisor suppresses these flags in order to prevent hijacking.&lt;br /&gt;
* Some manufacturers disable VT in the machine&#039;s BIOS, in such a way that it cannot be re-enabled.&lt;br /&gt;
* `/proc/cpuinfo` only shows virtualization capabilities starting with Linux 2.6.15 (Intel) and Linux 2.6.16 (AMD). Use the `uname -r` command to query your kernel version.&lt;br /&gt;
In case of doubt, contact your hardware vendor.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== &amp;quot;KVM: disabled by BIOS&amp;quot; error ===&lt;br /&gt;
Check if there is an option to enable it in the BIOS. If not, look for a more recent BIOS on the vendor&#039;s web site.&lt;br /&gt;
&lt;br /&gt;
Note:&lt;br /&gt;
&lt;br /&gt;
* On some hardware (e-g HP nx6320), you need to power-off/power-on the machine after enabling virtualization in the BIOS.&lt;br /&gt;
* Enabling some BIOS features may break VT support on some hardware (e-g Enabling Intel AMT on a Thinkpad T500 will prevent kvm-intel from loading with &amp;quot;disabled by bios&amp;quot;)&lt;br /&gt;
* On some Dell hardware, you also need to disable &amp;quot;Trusted Execution&amp;quot;, otherwise VT will not be enabled.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How can I use AMD-V extension? ===&lt;br /&gt;
 modprobe kvm-amd&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What user space tools does KVM use? ===&lt;br /&gt;
KVM uses a slightly modified [http://www.nongnu.org/qemu QEMU] program to instantiate the virtual machine. Once running, a virtual machine is just a regular process. You can use `top(1), kill(1), taskset(1)` and similar tools to manage virtual machines.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What virtual disk formats can KVM use? ===&lt;br /&gt;
KVM inherits a wealth of disk formats support from QEMU; it supports raw images, the native QEMU format (qcow2), VMware format, and many more.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How do I use KVM on a headless machine (without a local GUI?) ===&lt;br /&gt;
Install a management tool such as virt-manager on a remote machine.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Are there management tools available to help me manage my virtual machines? ===&lt;br /&gt;
Yes.  Please see the [[Management Tools]] page for some links.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
== Using KVM ==&lt;br /&gt;
=== How can I use KVM with a non-privileged user? ===&lt;br /&gt;
The cleanest way is probably to create a group, say &#039;&#039;kvm&#039;&#039;, and add the user(s) to that group. Then you will need change /dev/kvm to owned by group &#039;&#039;kvm&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
On a system that runs udev, you will probably need to add the following line somewhere in your udev configuration so it will automatically give the right group to the newly created device (i-e for ubuntu add a line to &#039;&#039;/etc/udev/rules.d/40-permissions.rules&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
 KERNEL==&amp;quot;kvm&amp;quot;, GROUP=&amp;quot;kvm&amp;quot;&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How can I get the most performance out of KVM? ===&lt;br /&gt;
&lt;br /&gt;
See the [[Tuning KVM]] page.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Is KVM stable? ===&lt;br /&gt;
KVM is stable and used in production.  As with most open source projects, development snapshots are less stable than the stable release series.&lt;br /&gt;
&lt;br /&gt;
If your name is Andreas Mohr, you&#039;re reporting bugs in the wrong place.&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== That&#039;s alright, but can I really use it for my daily use? ===&lt;br /&gt;
Sure. We continuously run the most often-used OSes and configurations and if anything breaks for the developers, it&#039;s fixed as soon as it was broken. See the [[Guest Support Status]] and [[Host Support Status]] pages to find out more. Please update them with success stories so that new users would benefit from the experience of the community.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How about production use? ===&lt;br /&gt;
For production use, it&#039;s recommended you use the KVM modules shipped by the distribution you&#039;re using to ensure stability. As mentioned above, it&#039;s tempting to use new features, but you never know of (unwanted) surprises hidden away. It&#039;ll be best if you can run the development snapshots with non-critical production load, so that the latest releases are stable for you when you decide to deploy them.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What happens if I kill -9 a VM process? ===&lt;br /&gt;
From the guest&#039;s perspective, it is as if you yanked the power cord out. From the host&#039;s perspective, the process is killed and all resources it uses are reclaimed.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== I need help to setup the network for my guest ===&lt;br /&gt;
You can have a look to the [[Networking]] page of this wiki for informations on the most classical networking setup for the guests. You can also refer to the QEMU documentation.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Where can I find more documention... ===&lt;br /&gt;
Most usability issues are covered in the QEMU [http://www.nongnu.org/qemu/user-doc.html documentation].  There is also an extensive [http://qemu-buch.de/cgi-bin/moin.cgi/FrequentlyAskedQuestions FAQ] (old vanished link: [http://kidsquid.com/cgi-bin/moin.cgi/FrequentlyAskedQuestions FAQ]).&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
== Troubleshooting ==&lt;br /&gt;
=== How can I check that I&#039;m not falling back to QEMU with no hardware acceleration? ===&lt;br /&gt;
&lt;br /&gt;
If you think that you might no be using the hardware acceleration provided by the KVM module, here are a few steps to help you check this.&lt;br /&gt;
&lt;br /&gt;
First of all, check that you don&#039;t have messages such as:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 qemu-system-x86_64 -hda myvm.qcow2&lt;br /&gt;
 open /dev/kvm: No such file or directory&lt;br /&gt;
 Could not initialize KVM, will disable KVM support&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In that case, you can check that:&lt;br /&gt;
* the modules are correctly loaded &amp;lt;code&amp;gt;lsmod|grep kvm&lt;br /&gt;
* you don&#039;t have a &amp;quot;KVM: disabled by BIOS&amp;quot; line in the output of dmesg&lt;br /&gt;
* /dev/kvm exists and you have the correct rights to use it&lt;br /&gt;
&lt;br /&gt;
Other ways to do the diagnostic:&lt;br /&gt;
* if you have access to the QEMU monitor (Ctrl-Alt-2, use Ctrl-Alt-1 to get back to the VM display), enter the &amp;quot;info kvm&amp;quot; command and it should respond with &amp;quot;KVM support: enabled&amp;quot;&lt;br /&gt;
* the right-end columns of the output from &amp;lt;code&amp;gt;lsmod|grep kvm&amp;lt;/code&amp;gt; on the host system, once the VM is started should show only non zero values. The value on the line corresponding to the architecture specific module (e-g kvm_intel, kvm_amd) show the number of VM using the module. For instance, if I have 2 VM running using the KVM module on a machine with vt, it will report:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 lsmod|grep kvm&lt;br /&gt;
 kvm_intel              44896  2&lt;br /&gt;
 kvm                   159656  1 kvm_intel&lt;br /&gt;
&lt;br /&gt;
=== &amp;quot;rect too big&amp;quot; Message when using VNC Display ===&lt;br /&gt;
When connection to a VNC Terminal, a &amp;quot;rect too big&amp;quot; message appears, and the VNC Session disconnects.&lt;br /&gt;
&lt;br /&gt;
This happens because of a VNC protocol flaw on the way on-the-fly pixel format changes are handled (more info at [http://www.mail-archive.com/qemu-devel@nongnu.org/msg04879.html this thread]). If you are using TigerVNC, you can avoid this problem by disabling on-the-fly selection of pixel encoding, using the &amp;lt;tt&amp;gt;-AutoSelect=0&amp;lt;/tt&amp;gt; command-line option of vncviewer. You may also want to check the encoding options on the vncviewer man page, as this will disable automatic selection of encoding based on connection speed.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&#039;&#039;&#039;How do I set up the network such that my guest is accessible from other machines?&#039;&#039;&#039; or&lt;br /&gt;
&lt;br /&gt;
=== My guest network is stuck what should I do? ===&lt;br /&gt;
KVM uses QEMU for its device emulation. Consult the [http://qemu-project.org/Documentation/Networking QEMU network wiki page] for detailed network setup instructions.&lt;br /&gt;
&lt;br /&gt;
One would probably be interested in the Root Networking Mode page and the Network Bridge page.&lt;br /&gt;
&lt;br /&gt;
Guest-side network lockups (fortunately restartable) may be happening due to tun/tap bridging erroneous MAC address reconfiguration on host side, see RHEL bug #571991 and others.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== I&#039;m experiencing timer drift issues in my VM guests, what to do? ===&lt;br /&gt;
&lt;br /&gt;
Especially in case of networked systems (e.g. via NFS or Samba) it is very important to ensure stable operation of timing (both system timer and RTC).&lt;br /&gt;
Tell-tale signs of related trouble in VMs (apparently qemu/KVM/VMWare etc. are all affected) are e.g.&lt;br /&gt;
&amp;quot;make[2]: Warning: File `XXXXX/cmakelists_rebuilder.stamp&#039; has modification time 0.37 s in the future&amp;quot;&lt;br /&gt;
&amp;quot;Clock skew detected. Your build may be incomplete.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
[http://maemovmware.garage.maemo.org/2nd_edition/requirements_documentation.html Maemo docs] state that it&#039;s important to disable UTC and set the correct time zone, however I don&#039;t really see how that would help in case of diverging host/guest clocks.&lt;br /&gt;
IMHO much more useful and important is to configure properly working NTP server (chrony recommended, or ntpd) on both host and guest.&lt;br /&gt;
The single most decisive trick IMHO is to specify the &#039;&#039;&#039;host&#039;&#039;&#039; NTP server as the main entry within guest VM instead of &amp;quot;foreign&amp;quot; NTP servers, to make sure to achieve the most precise coupling between these two related systems (timing drift vs. other systems does not matter nearly as much as a tight time precision for inner host/guest system interaction e.g. in the case of NFS/Samba shares etc.).&lt;br /&gt;
For verification, see chronyc &amp;quot;sources -v&amp;quot;, &amp;quot;tracking&amp;quot; (&amp;quot;System time&amp;quot; row) commands.&lt;br /&gt;
&lt;br /&gt;
After having applied this very tight NTP coupling, this seems to finally have gotten rid of make&#039;s time drift warnings.&lt;br /&gt;
&lt;br /&gt;
Perhaps qemu&#039;s -tdf (timing drift fix) option magically manages to help in your case, too.&lt;br /&gt;
&lt;br /&gt;
See also [https://espace.cern.ch/it-faqs/Lists/faqs/DispForm.aspx?ID=368 Faqs: I received a message about &amp;quot;clock skew&amp;quot;].&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
=== I get &amp;quot;rtc interrupts lost&amp;quot; messages, and the guest is very slow? ===&lt;br /&gt;
Try setting &amp;lt;code&amp;gt;CONFIG_HPET_EMULATE_RTC=y&amp;lt;/code&amp;gt; in your host &amp;lt;code&amp;gt;.config&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== I get an &amp;quot;Exception 13&amp;quot; or &amp;quot;Exception 12&amp;quot; message while booting a guest OS on my Intel host ===&lt;br /&gt;
See the [[Intel Real Mode Emulation Problems]] page.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== I have VMware/Parallels/VirtualBox installed and when I modprobe KVM, my system deadlocks. ===&lt;br /&gt;
Neither Intel VT nor AMD-V provide a mechanism to determine whether software is currently using the hardware virtualization extensions.  This means that if you have two kernel modules loaded attempting to use hardware virtualization extensions, very bad things will happen.  If you are using another type of virtualization software and experience any sort of weirdness with KVM, make sure you can reproduce the problem without the kernel modules for that software loaded before you report a bug in KVM.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== There&#039;s nothing on QEMU/KVM screen, but it&#039;s not hanged! I&#039;m trying to install Kubuntu. ===&lt;br /&gt;
Try to run kvm with -std-vga option. It helps if guest operating system uses framebuffer mode like Kubuntu/Ubuntu.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== When I click the guest operating system window, mouse is grabbed. How can I get mouse to not to do that? OR Mouse doesn&#039;t show up / doesn&#039;t work in the guest. What do I do? ===&lt;br /&gt;
From #qemu wiki, try to run kvm/qemu with&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 -usb -usbdevice tablet&lt;br /&gt;
If that doesn&#039;t work, try this:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 $ export SDL_VIDEO_X11_DGAMOUSE=0&lt;br /&gt;
(from http://wiki.clug.org.za/wiki/QEMU_mouse_not_working )&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
== General KVM information ==&lt;br /&gt;
=== What is the difference between KVM and Xen? ===&lt;br /&gt;
Xen is an external hypervisor; it assumes control of the machine and divides resources among guests. On the other hand, KVM is part of Linux and uses the regular Linux scheduler and memory management. This means that KVM is much smaller and simpler to use; it is also more featureful; for example KVM can swap guests to disk in order to free RAM.&lt;br /&gt;
&lt;br /&gt;
KVM only run on processors that supports x86 hvm (vt/svm instructions set) whereas Xen also allows running modified operating systems on non-hvm x86 processors using a technique called paravirtualization. KVM does not support paravirtualization for CPU but may support paravirtualization for device drivers to improve I/O performance.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What is the difference between KVM and VMware? ===&lt;br /&gt;
VMware is a proprietary product. KVM is Free Software released under the GPL.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What is the difference between KVM and QEMU? ===&lt;br /&gt;
QEMU uses emulation; KVM uses processor extensions (HVM) for virtualization.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Do you have a port of KVM for Windows? ===&lt;br /&gt;
Not officially.&lt;br /&gt;
&lt;br /&gt;
Kazushi Takahashi has been working on an experimental version though, called WinKVM, available [http://github.com/ddk50/winkvm here].&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What kernel version does it work with? ===&lt;br /&gt;
It depends on what version of KVM you are using. The last release of KVM should work with any recent kernel (2.6.17 and above), older releases even older kernels.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== How much RAM do I need? ===&lt;br /&gt;
You will need enough memory to let the guest run comfortably while keeping enough for the host. 1GB is probably a minimum configuration for the host OS.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Is dynamic memory management for guests supported? ===&lt;br /&gt;
This is a broad topic covering a few areas.&lt;br /&gt;
&lt;br /&gt;
A. KVM only allocates memory as the guest tries to use it. Once it&#039;s allocated, KVM keeps it. Some guests (namely Microsoft guests) zero all memory at boot time. So they will use all memory.&lt;br /&gt;
&lt;br /&gt;
B. Certain guests (only Linux at the moment) have a balloon driver, so the host can have the guest allocate a certain amount of memory which the guest won&#039;t be able to use anymore and it can then be freed on the host. Ballooning is controlled in the host via the [http://www.nongnu.org/qemu/qemu-doc.html#SEC12 balloon monitor command].&lt;br /&gt;
&lt;br /&gt;
C. Some hosts (presently only RHEL5.4 / CentOS 5.4) have a feature called KSM (Kernel Sharedpage Merging), which collapses together identical pages; this requires kernel support on the host, as well as a kvm new enough to opt in to the behavior. As some guest platforms (most notably Windows) zero out free&#039;d memory, such pages are trivially collapsed. The ksmctl command needs to be used to enable KSM; alternately, the ksmtuned service found in Fedora 12 can be run to dynamically adjust KSM&#039;s aggressiveness based on the amount of free memory available&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== What OSs can I run inside KVM VM? ===&lt;br /&gt;
Several.  See the [[Guest Support Status]] page for details. Note that several Linux flavors are known to hang on Intel processors during startup. Workaround is to disable splash screens in grub.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Does KVM support a live migration feature to move virtual machines from one host to another without downtime? ===&lt;br /&gt;
Yes.  See the [[Migration]] page for details.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Does KVM support live migration from an AMD host to an Intel host and back? ===&lt;br /&gt;
Yes.  There may be issues on 32-bit Intel hosts which don&#039;t support NX (or XD), but for 64-bit hosts back and forth migration should work well. Migration of 32-bit guests should work between 32-bit hosts and 64-bit hosts.&lt;br /&gt;
If one of your hosts does not support NX, you may consider disabling NX when starting the guest on a NX-capable system. You can do it by passing &amp;quot;-cpu qemu64,-nx&amp;quot; parameter to the guest.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Can KVM run a 32-bit guest on a 64-bit host? What about PAE? ===&lt;br /&gt;
KVM supports 32-bit guests on 64-bit hosts, and any combination of PAE and non-PAE guests and hosts. The only unsupported combination is a 64-bit guest on a 32-bit host.&lt;br /&gt;
&lt;br /&gt;
If you are running a Windows Virtual Machine and have problems enabling PAE in your guest see the [[Windows PAE Workaround]] page.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Is it possible to use USB devices with a guest OS? ===&lt;br /&gt;
Yes, look up how to do it with QEMU, it&#039;s the same way.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Can I have higher or widescreen resolutions (eg 1680 x 1050) in KVM? ===&lt;br /&gt;
Use the -vga std parameter while starting the VM to allow high resolution and widescreen displays.&lt;br /&gt;
&lt;br /&gt;
If the resolution you want to use is not available, you can patch the corresponding source files (see http://article.gmane.org/gmane.comp.emulators.kvm.devel/13557 as a reference), or send a mail to the KVM mailing list if you are not able to patch the source yourself.&lt;br /&gt;
&lt;br /&gt;
When using Windows as guest OS and not having issues with people violating the GPL you might want to use the driver from the VBEMP x86 project (http://www.bearwindows.boot-land.net/vbemp.htm) which is based on ReactOS code.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Does KVM support SMP hosts? ===&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Does KVM support SMP guests? ===&lt;br /&gt;
Yes. Up to 16 CPUs can be specified using the -smp option.&lt;br /&gt;
&lt;br /&gt;
----------&lt;br /&gt;
&lt;br /&gt;
=== Is the name &#039;KVM&#039; trademarked? ===&lt;br /&gt;
No.&lt;br /&gt;
&lt;br /&gt;
----------&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Code&amp;diff=4692</id>
		<title>Code</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Code&amp;diff=4692"/>
		<updated>2013-04-22T12:28:46Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Code=&lt;br /&gt;
&lt;br /&gt;
[[Category:Architechture]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== kernel git tree ==&lt;br /&gt;
&lt;br /&gt;
The kvm kernel code is available through a git tree (like the kernel itself).  To create a repository using git, type&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/virt/kvm/kvm.git&lt;br /&gt;
&lt;br /&gt;
Alternatively, it is also accessible through the kernel.org gitweb interface:               &lt;br /&gt;
[http://git.kernel.org/?p=virt/kvm/kvm.git;a=summary]&lt;br /&gt;
&lt;br /&gt;
For subsequent upgrades use the command&lt;br /&gt;
                                       &lt;br /&gt;
 git pull&lt;br /&gt;
&lt;br /&gt;
in the git working directory.&lt;br /&gt;
&lt;br /&gt;
== kernel git workflow ==&lt;br /&gt;
&lt;br /&gt;
See [[Kvm-Git-Workflow]]&lt;br /&gt;
&lt;br /&gt;
== userspace git tree ==&lt;br /&gt;
&lt;br /&gt;
As of QEMU 1.3, the KVM userspace code is in mainline QEMU.  Please use and develop with&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.qemu-project.org/qemu.git&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;If you want to contribute code&#039;&#039;&#039;, please see the [http://wiki.qemu.org/Contribute/StartHere guidelines] and submit patches to qemu-devel@nongnu.org.&lt;br /&gt;
&lt;br /&gt;
The old qemu-kvm.git fork repository is still available but outdated, type&lt;br /&gt;
                                                                                      &lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git&lt;br /&gt;
&lt;br /&gt;
Alternatively, it is also accessible through the kernel.org gitweb interface:         &lt;br /&gt;
[http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=summary]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== building an external module with older kernels ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;This only works for the x86 architecture.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
1. If you wish to use a distribution kernel (or just some random kernel you like) with kvm,&lt;br /&gt;
you can use the external module kit.  You will need the kvm-kmod repository:&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kiszka.org/kvm-kmod.git&lt;br /&gt;
 cd kvm-kmod&lt;br /&gt;
 git submodule update --init&lt;br /&gt;
 ./configure [--kerneldir=/path/to/kernel/dir]&lt;br /&gt;
 make sync&lt;br /&gt;
 make&lt;br /&gt;
&lt;br /&gt;
=== Tip about building against Red Hat Enterprise Linux kernels ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;kvm-userspace/kernel&amp;lt;/code&amp;gt; has some compat code to allow it to compile against older kernels, and also some code specific to features that are normally not present on older kernels but are present on RHEL kernels.&lt;br /&gt;
&lt;br /&gt;
So, when building against a RHEL kernel tree, check if the &amp;lt;code&amp;gt;RHEL_*&amp;lt;/code&amp;gt; macros at &amp;lt;code&amp;gt;${kerneldir}/include/linux/version.h&amp;lt;/code&amp;gt; are defined correctly, corresponding to the RHEL version where the kernel source comes from. If those macros aren&#039;t defined correctly, the compat code that allows compilation against RHEL kernels will break and you will get build errors.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== release tags ==&lt;br /&gt;
&lt;br /&gt;
kvm stable releases (based off of Qemu&#039;s stable branch) are tagged with &amp;lt;code&amp;gt;kvm-qemu-0.NN.N&amp;lt;/code&amp;gt; where &#039;&#039;N&#039;&#039; equates to the upstream Qemu branch versions. Note that kvm has them tagged not branched.&lt;br /&gt;
&lt;br /&gt;
kvm development releases are tagged with &amp;lt;code&amp;gt;kvm-nn&amp;lt;/code&amp;gt; where &#039;&#039;nn&#039;&#039; is the release number.&lt;br /&gt;
&lt;br /&gt;
== Binary Packages ==&lt;br /&gt;
&lt;br /&gt;
=== CentOS / RHEL ===&lt;br /&gt;
&lt;br /&gt;
Unofficial packages of latest releases can be found at:&lt;br /&gt;
&amp;lt;code&amp;gt;http://www.lfarkas.org/linux/packages/centos/5/&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Debian ===&lt;br /&gt;
&lt;br /&gt;
For Debian Lenny, please use packages from backports.debian.org - for &amp;lt;em&amp;gt;both&amp;lt;/em&amp;gt; qemu-kvm and kernel (at least 2.6.32).  It is important to use more recent kernel - 2.6.26 does not work well with kvm.&lt;br /&gt;
&lt;br /&gt;
Note that package &amp;quot;kvm&amp;quot; has been renamed to &amp;quot;qemu-kvm&amp;quot; in Squeeze and in Lenny backports (and kvm is now transitional package that installs qemu-kvm automatically).&lt;br /&gt;
&lt;br /&gt;
Debian Squeeze will have qemu-kvm based on 0.12, available in standard repositories.&lt;br /&gt;
&lt;br /&gt;
Experimental 0.13 packages are available at &lt;br /&gt;
&amp;lt;code&amp;gt;http://www.corpit.ru/debian/tls/kvm/0.13/&amp;lt;/code&amp;gt; , pending upload to debian -experimental.&lt;br /&gt;
&lt;br /&gt;
== nightly snapshots ==&lt;br /&gt;
&lt;br /&gt;
Nightly snapshots, for those who are uncomfortable with git, are [http://people.qumranet.com/avi/snapshots available].  When reporting a problem with a snapshot, please quote the snapshot name (which includes the date) and the contents of the SOURCES file in the snapshot tarball.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Status&amp;diff=4691</id>
		<title>Status</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Status&amp;diff=4691"/>
		<updated>2013-04-22T12:20:55Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Status=&lt;br /&gt;
&lt;br /&gt;
KVM is included in the mainline linux kernel since 2.6.20 and is stable and fast for most workloads.&lt;br /&gt;
&lt;br /&gt;
The userspace tools are part of mainline QEMU since 1.3.&lt;br /&gt;
&lt;br /&gt;
It is also available as a patch for recent Linux kernel versions and as an external module that can be used with your favorite distro- provided kernel going back up to 2.6.16, therefore including all latest versions for Enterprise Linux Distributions.&lt;br /&gt;
&lt;br /&gt;
===Working:===&lt;br /&gt;
&lt;br /&gt;
* Intel-based hosts (requires VT capable processors)&lt;br /&gt;
* AMD-based hosts (requires SVM capable processors)&lt;br /&gt;
* Windows/Linux/Unix guests (32-bit and 64-bit)&lt;br /&gt;
* SMP hosts&lt;br /&gt;
* SMP guests (as of kvm-61, max 16 cpu supported)&lt;br /&gt;
* Live [[Migration]] of guests from one host to another (32-bit and 64-bit)&lt;br /&gt;
* See the [[Guest Support Status]] page for a list of guest operating systems known to work&lt;br /&gt;
* See the [[Host Support Status]] page for information on host hardware.&lt;br /&gt;
* Guest swapping&lt;br /&gt;
* [[Paravirtualized networking]]&lt;br /&gt;
* [[Paravirtualized block device]]&lt;br /&gt;
* [[How_to_assign_devices_with_VT-d_in_KVM|PCI-Express passthrough]]&lt;br /&gt;
&lt;br /&gt;
===In progress:===&lt;br /&gt;
&lt;br /&gt;
* [[PowerPC|PowerPC port]]&lt;br /&gt;
* IA64 port&lt;br /&gt;
* xenner (http://kraxel.fedorapeople.org/xenner), a project to run x86 xen guest (domU) kernels&lt;br /&gt;
* [http://systems.cs.columbia.edu/projects/kvm-arm/ ARM port]&lt;br /&gt;
* [[VGA_device_assignment|VGA device assignment]]&lt;br /&gt;
&lt;br /&gt;
===Related===&lt;br /&gt;
* [http://twit.tv/show/floss-weekly/229 FLOSS weekly KVM interview by Avi Kivity &amp;amp; Dor Laor]&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Downloads&amp;diff=4690</id>
		<title>Downloads</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Downloads&amp;diff=4690"/>
		<updated>2013-04-22T12:19:56Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Downloads=&lt;br /&gt;
&lt;br /&gt;
Most Linux distros already have KVM kernel modules and userspace tools available through their packaging systems. This is the easiest and recommended way of using KVM.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;b&amp;gt;KVM kernel modules&amp;lt;/b&amp;gt; are part of the Linux kernel package&lt;br /&gt;
* &amp;lt;b&amp;gt;Userspace tools&amp;lt;/b&amp;gt; are usually called &amp;quot;qemu-kvm&amp;quot; or &amp;quot;kvm&amp;quot;&lt;br /&gt;
* &amp;lt;b&amp;gt;Linux guest drivers&amp;lt;/b&amp;gt; are part of the Linux kernel package&lt;br /&gt;
* &amp;lt;b&amp;gt;Windows guest drivers&amp;lt;/b&amp;gt; are available [http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers here]&lt;br /&gt;
&lt;br /&gt;
Please try your distro&#039;s packages first. Normally you do not need to patch anything or build from source.&lt;br /&gt;
&lt;br /&gt;
===Getting old versions of KVM===&lt;br /&gt;
&lt;br /&gt;
If you want to use specific versions of KVM kernel modules and supporting userspace, you can download the latest version from [http://sourceforge.net/project/showfiles.php?group_id=180599 http://sourceforge.net/project/showfiles.php?group_id=180599].  Note that as of QEMU 1.3, the userspace code comes straight from [http://wiki.qemu.org/Download http://wiki.qemu.org/Download].&lt;br /&gt;
&lt;br /&gt;
For the userspace components, you will find both qemu-kvm-&amp;lt;version&amp;gt; and kvm-&amp;lt;version&amp;gt; there.&lt;br /&gt;
qemu-kvm is the stable branch of kvm, it&#039;s based off of qemu&#039;s point releases with the kvm extras on top; kvm-NN releases were previously known as the development releases, but are deprecated today and should not be used.&lt;br /&gt;
&lt;br /&gt;
The kernel modules can be found in kvm-kmod-&amp;lt;kernel version&amp;gt;. A kernel version of 2.6.32.3 means that these are the &#039;&#039;same&#039;&#039; modules as those included with the 2.6.32.3 kernel from http://www.kernel.org&lt;br /&gt;
&lt;br /&gt;
You can consult the changelog files included in the download-directory with each qemu-kvm and kvm-kmod release on Sourceforge for changes in the releases.&lt;br /&gt;
&lt;br /&gt;
If you use a kernel from http://www.kernel.org or one provided from your distribution and &#039;&#039;&#039;do not&#039;&#039;&#039; use the modules provided by kvm-kmod releases:&lt;br /&gt;
* your kernel has to be 2.6.29 or newer to run any version of qemu-kvm (kernel 2.6.27/2.6.28 with kvm-kmod 2.6.29 will also work)&lt;br /&gt;
* your kernel has to be 2.6.25 or newer to run the kvm 76 userspace (or any newer kvm-XX release)&lt;br /&gt;
* the modules provided by Linux 2.6.22 or later require kvm-22 or any later version.  Some features are available only with newer kernels or userspace.  It is recommended to use the latest available version.&lt;br /&gt;
* the modules provided by Linux 2.6.21 require &#039;&#039;&#039;[http://downloads.sourceforge.net/kvm/kvm-17.tar.gz kvm-17]&#039;&#039;&#039;.  If you use the external module, use the latest available version.&lt;br /&gt;
* the modules provided by Linux 2.6.20 require &#039;&#039;&#039;[http://downloads.sourceforge.net/kvm/kvm-12.tar.gz kvm-12]&#039;&#039;&#039;.  If you use the external module, use the latest available version.&lt;br /&gt;
&lt;br /&gt;
Refer to [[choose the right kvm &amp;amp; kernel version]] for more information.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Downloads&amp;diff=4689</id>
		<title>Downloads</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Downloads&amp;diff=4689"/>
		<updated>2013-04-22T12:15:26Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Downloads=&lt;br /&gt;
&lt;br /&gt;
Most Linux distros already have KVM kernel modules and userspace tools available through their packaging systems. This is the easiest and recommended way of using KVM.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;b&amp;gt;KVM kernel modules&amp;lt;/b&amp;gt; are part of the Linux kernel package&lt;br /&gt;
* &amp;lt;b&amp;gt;Userspace tools&amp;lt;/b&amp;gt; are usually called &amp;quot;qemu-kvm&amp;quot; or &amp;quot;kvm&amp;quot;&lt;br /&gt;
* &amp;lt;b&amp;gt;Linux guest drivers&amp;lt;/b&amp;gt; are part of the Linux kernel package&lt;br /&gt;
* &amp;lt;b&amp;gt;Windows guest drivers&amp;lt;/b&amp;gt; are available [http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers here]&lt;br /&gt;
&lt;br /&gt;
===Getting old versions of KVM===&lt;br /&gt;
&lt;br /&gt;
If you want to use specific versions of KVM kernel modules and supporting userspace, you can download the latest version from [http://sourceforge.net/project/showfiles.php?group_id=180599 http://sourceforge.net/project/showfiles.php?group_id=180599]. &lt;br /&gt;
&lt;br /&gt;
For the userspace components, you will find both qemu-kvm-&amp;lt;version&amp;gt; and kvm-&amp;lt;version&amp;gt; there.&lt;br /&gt;
qemu-kvm is the stable branch of kvm, it&#039;s based off of qemu&#039;s point releases with the kvm extras on top; kvm-NN releases were previously known as the development releases, but are deprecated today and should not be used.&lt;br /&gt;
&lt;br /&gt;
The kernel modules can be found in kvm-kmod-&amp;lt;kernel version&amp;gt;. A kernel version of 2.6.32.3 means that these are the &#039;&#039;same&#039;&#039; modules as those included with the 2.6.32.3 kernel from http://www.kernel.org&lt;br /&gt;
&lt;br /&gt;
You can consult the changelog files included in the download-directory with each qemu-kvm and kvm-kmod release on Sourceforge for changes in the releases.&lt;br /&gt;
&lt;br /&gt;
If you use a kernel from http://www.kernel.org or one provided from your distribution and &#039;&#039;&#039;do not&#039;&#039;&#039; use the modules provided by kvm-kmod releases:&lt;br /&gt;
* your kernel has to be 2.6.29 or newer to run any version of qemu-kvm (kernel 2.6.27/2.6.28 with kvm-kmod 2.6.29 will also work)&lt;br /&gt;
* your kernel has to be 2.6.25 or newer to run the kvm 76 userspace (or any newer kvm-XX release)&lt;br /&gt;
* the modules provided by Linux 2.6.22 or later require kvm-22 or any later version.  Some features are available only with newer kernels or userspace.  It is recommended to use the latest available version.&lt;br /&gt;
* the modules provided by Linux 2.6.21 require &#039;&#039;&#039;[http://downloads.sourceforge.net/kvm/kvm-17.tar.gz kvm-17]&#039;&#039;&#039;.  If you use the external module, use the latest available version.&lt;br /&gt;
* the modules provided by Linux 2.6.20 require &#039;&#039;&#039;[http://downloads.sourceforge.net/kvm/kvm-12.tar.gz kvm-12]&#039;&#039;&#039;.  If you use the external module, use the latest available version.&lt;br /&gt;
&lt;br /&gt;
Refer to [[choose the right kvm &amp;amp; kernel version]] for more information.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Main_Page&amp;diff=4688</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Main_Page&amp;diff=4688"/>
		<updated>2013-04-22T12:06:59Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Kernel Based Virtual Machine ==&lt;br /&gt;
&lt;br /&gt;
KVM (for Kernel-based Virtual Machine) is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). It consists of a loadable kernel module, kvm.ko, that provides the core virtualization infrastructure and a processor specific module, kvm-intel.ko or kvm-amd.ko.&lt;br /&gt;
&lt;br /&gt;
Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.&lt;br /&gt;
&lt;br /&gt;
The kernel component of KVM is included in mainline Linux, as of 2.6.20.&lt;br /&gt;
&lt;br /&gt;
The userspace component of KVM is included in mainline QEMU, as of 1.3.&lt;br /&gt;
&lt;br /&gt;
KVM is open source software.&lt;br /&gt;
&lt;br /&gt;
== Common Pages ==&lt;br /&gt;
{| style=&amp;quot;border:none&amp;quot;&lt;br /&gt;
|style=&amp;quot;width:50%;border:none;&amp;quot;|&lt;br /&gt;
* [[KVM Forum]]&lt;br /&gt;
** [[KVM Forum 2012]]&lt;br /&gt;
** [[KVM Forum 2011]]&lt;br /&gt;
** [[KVM Forum 2010]]&lt;br /&gt;
** [[KVM Forum 2008]]&lt;br /&gt;
** [[KVM Forum 2007]]&lt;br /&gt;
* Linux Plumbers Conference&lt;br /&gt;
** [[LinuxPlumbers2010|LPC 2010]]&lt;br /&gt;
** [[LPC 2012]]&lt;br /&gt;
* [[TODO]]&lt;br /&gt;
* [[KVM-Autotest]]&lt;br /&gt;
* [[KVM Features]]&lt;br /&gt;
|style=&amp;quot;width:50%;border:none;&amp;quot;|&lt;br /&gt;
* [[Management Tools]]&lt;br /&gt;
* [[Documents]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Common External Pages ==&lt;br /&gt;
&lt;br /&gt;
{|style=&amp;quot;border:none&amp;quot;&lt;br /&gt;
|style=&amp;quot;width:50%;border:none;&amp;quot;|&lt;br /&gt;
* [http://www.qemu.org/ QEMU]&lt;br /&gt;
* [http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors Xen&#039;s HVM Compatible Processors List]&lt;br /&gt;
|style=&amp;quot;width:50%;border:none;&amp;quot;|&lt;br /&gt;
* [http://qemu-buch.de Book &amp;quot;qemu-kvm &amp;amp; libvirt&amp;quot;]&lt;br /&gt;
* [http://qemu-buch.de/cgi-bin/moin.cgi/ QEMU Wiki]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=KVM_Forum_2011&amp;diff=3766</id>
		<title>KVM Forum 2011</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=KVM_Forum_2011&amp;diff=3766"/>
		<updated>2011-08-15T12:49:35Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= KVM Forum 2011 =&lt;br /&gt;
= Vancouver Canada, August 15-16, 2011 =&lt;br /&gt;
The KVM Forum 2011 will be held &lt;br /&gt;
at the Hyatt Regency Vancouver in Vancouver, Canada on August 15-16, 2011. We will be co-located with LinuxCon North America 2011&lt;br /&gt;
&lt;br /&gt;
http://events.linuxfoundation.org/events/linuxcon&lt;br /&gt;
&lt;br /&gt;
== Scope ==&lt;br /&gt;
KVM is an industry leading open source hypervisor that provides an ideal&lt;br /&gt;
platform for datacenter virtualization, virtual desktop infrastructure,&lt;br /&gt;
and cloud computing.  Once again, it&#039;s time to bring together the&lt;br /&gt;
community of developers and users that define the KVM ecosystem for&lt;br /&gt;
our annual technical conference.  We will discuss the current state of&lt;br /&gt;
affairs and plan for the future of KVM, its surrounding infrastructure,&lt;br /&gt;
and management tools.  So mark your calendar and join us in advancing KVM.&lt;br /&gt;
&lt;br /&gt;
http://events.linuxfoundation.org/events/kvm-forum/&lt;br /&gt;
&lt;br /&gt;
== CFP ==&lt;br /&gt;
[[KVMForum2011CFP|KVM Forum 2011 CFP]] (now closed, see [[#Schedule|Schedule]])&lt;br /&gt;
&lt;br /&gt;
== Registration ==&lt;br /&gt;
&lt;br /&gt;
Please visit this page to register:&lt;br /&gt;
&lt;br /&gt;
http://events.linuxfoundation.org/events/kvm-forum/register&lt;br /&gt;
&lt;br /&gt;
== Hotel and Travel ==&lt;br /&gt;
The KVM Forum 2011 will be held in Vancouver BC at the Hyatt Regency Vancouver.&lt;br /&gt;
See the Linux Foundation&#039;s KVM Forum page for more details on hotels and travel.&lt;br /&gt;
&lt;br /&gt;
http://events.linuxfoundation.org/events/kvm-forum/travel&lt;br /&gt;
&lt;br /&gt;
== Schedule ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Monday, August 15th&#039;&#039;&#039;&lt;br /&gt;
{|&lt;br /&gt;
! Time !! Title !! Speaker &lt;br /&gt;
|-&lt;br /&gt;
|09:00 - 09:15 || colspan=&amp;quot;2&amp;quot; align=&amp;quot;center&amp;quot;| Welcome&lt;br /&gt;
|-&lt;br /&gt;
|09:15 - 09:30 || Keynote || Avi Kivity&lt;br /&gt;
|-&lt;br /&gt;
|09:30 - 10:00 || KVM on the IBM POWER7 Processor || Paul Mackerras&lt;br /&gt;
|-&lt;br /&gt;
|10:00 - 10:30 || VFIO: PCI device assignment breaks free of KVM || Alex Williamson&lt;br /&gt;
|-&lt;br /&gt;
| 10:30 - 10:45  || colspan=&amp;quot;2&amp;quot; align=&amp;quot;center&amp;quot;| Break&lt;br /&gt;
|-&lt;br /&gt;
| 10:45 - 11:15 || The reinvention of qcow2 || Kevin Wolf&lt;br /&gt;
|-&lt;br /&gt;
| 11:15 - 11:45 || Virtio SCSI: An alternative virtualized storage stack for KVM || Stefan Hajnoczi &amp;amp; Paolo Bonzini&lt;br /&gt;
|-&lt;br /&gt;
| 11:45 - 12:15 || Native Linux KVM tool || Asias He&lt;br /&gt;
|-&lt;br /&gt;
| 12:15 - 13:30 || colspan=&amp;quot;2&amp;quot; align=&amp;quot;center&amp;quot;| Lunch&lt;br /&gt;
|}&lt;br /&gt;
{|&lt;br /&gt;
! !! colspan=&amp;quot;2&amp;quot;|Track 1 !! colspan=&amp;quot;2&amp;quot;|Track 2&lt;br /&gt;
|-&lt;br /&gt;
! Time !! Title !! Speaker !! Title !! Speaker&lt;br /&gt;
|-&lt;br /&gt;
| 13:30 - 14:00 || What&#039;s coming from the MM for KVM || Andrea Arcangeli || KVM on Embedded Power Architecture Platforms || Stuart Yoder&lt;br /&gt;
|-&lt;br /&gt;
| 14:00 - 14:30 || Guest memory resizing - free page hinting &amp;amp; more || Rik van Riel || Using KVM as a Real-Time Hypervisor || Jan Kiszka&lt;br /&gt;
|-&lt;br /&gt;
| 14:30 - 15:00 || Optimizing Your KVM Instances || Mark Wagner || Experiences porting KVM to SmartOS || Bryan Cantrill&lt;br /&gt;
|-&lt;br /&gt;
| 15:00 - 15:20 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|Break&lt;br /&gt;
|-&lt;br /&gt;
|15:20 - 15:50 || virtio networking status update and case study || Michael S. Tsirkin || Introduction to the libvirt APIs for KVM management and their future development || Daniel Berrange&lt;br /&gt;
|-&lt;br /&gt;
|15:50 - 16:20 || IO Throttling in QEMU || Ryan Harper || VDSM is now Free || Dan Kenigsberg&lt;br /&gt;
|-&lt;br /&gt;
|16:20 - 16:50 || ||  || The best of both worlds: Network virtualization and KVM || Yoshi Tamura&lt;br /&gt;
|-&lt;br /&gt;
|16:50 - 17:10 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|Break&lt;br /&gt;
|-&lt;br /&gt;
|17:10 - 19:00 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|[[#BoFs|BoFs]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Tuesday, August 16th&#039;&#039;&#039;&lt;br /&gt;
{|&lt;br /&gt;
! Time !! Title !! Speaker&lt;br /&gt;
|-&lt;br /&gt;
| 9:00 - 9:15 || Keynote || Anthony Liguori&lt;br /&gt;
|-&lt;br /&gt;
| 9:15 - 9:45 || Performance monitoring in KVM guests || Avi Kivity&lt;br /&gt;
|-&lt;br /&gt;
| 9:45 - 10:15 || AHCI - doing storage right || Alexander Graf&lt;br /&gt;
|-&lt;br /&gt;
| 10:15 - 10:45 || Code Generation for Fun and Profit || Anthony Liguori&lt;br /&gt;
|-&lt;br /&gt;
| 10:45 - 11:00 || colspan=&amp;quot;2&amp;quot; align=&amp;quot;center&amp;quot; | Break&lt;br /&gt;
|-&lt;br /&gt;
| 11:00 - 11:30 || QEMU&#039;s device model qdev: Where do we go from here? || Markus Armbruster&lt;br /&gt;
|-&lt;br /&gt;
| 11:30 - 12:00 || SPICE Roadmap || Alon Levy&lt;br /&gt;
|-&lt;br /&gt;
| 12:00 - 12:30 || Fixing the USB disaster || Gerd Hoffmann&lt;br /&gt;
|-&lt;br /&gt;
| 12:30 - 13:45 || colspan=&amp;quot;2&amp;quot; align=&amp;quot;center&amp;quot; | Lunch&lt;br /&gt;
|}&lt;br /&gt;
{|&lt;br /&gt;
! !! colspan=&amp;quot;2&amp;quot;|Track 1 !! colspan=&amp;quot;2&amp;quot;|Track 2&lt;br /&gt;
|-&lt;br /&gt;
! Time !! Title !! Speaker !! Title !! Speaker&lt;br /&gt;
|-&lt;br /&gt;
| 13:45 - 14:15 || KVM Graphics Direct Assignment || Allen Kay || Low-Latency, High-Bandwidth Use Cases for Nahanni/ivshmem || Paul Lu&lt;br /&gt;
|-&lt;br /&gt;
| 14:15 - 14:45 || Making KVM autotest useful for KVM developers || Lucas Meneghel Rodrigues || QEMU: live block copy || Marcelo Tosatti&lt;br /&gt;
|-&lt;br /&gt;
| 14:45 - 15:15 || AMD IOMMU Version 2 Support in KVM || Joerg Roedel || Livebackup - A Complete Solution for making Full and Incremental Disk Backups of Running VMs || Jagane Sundar&lt;br /&gt;
|-&lt;br /&gt;
| 15:15 - 15:30 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|Break&lt;br /&gt;
|-&lt;br /&gt;
| 15:30 - 16:00 || Geographically distributed HPC Clouds using KVM || Conrad Wood || Migration: one year later || Juan Quintela&lt;br /&gt;
|-&lt;br /&gt;
| 16:00 - 16:30 || Implementing a Hardware Appliance Product: Applied usage of qemu/KVM and libvirt || Ricardo Marin Matinata || Rapid VM Synchronization with I/O Emulation Logging-Replay || Kei Ohmura&lt;br /&gt;
|-&lt;br /&gt;
| 16:30 - 17:00 || Improving the Out-of-box Performance When Using KVM ||Andrew Theurer || Enhancing Live Migration Process for CPU and/or memory intensive VMs running Enterprise applications || Benoit Hudzia&lt;br /&gt;
|-&lt;br /&gt;
| 17:00 - 17:30 || ||  || Yabusame: Postcopy Live Migration for Qemu/KVM || Takahiro Hirofuchi&lt;br /&gt;
|-&lt;br /&gt;
| 17:30 - 17:45 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|Closing&lt;br /&gt;
|-&lt;br /&gt;
| 18:00 - 19:00 || colspan=&amp;quot;4&amp;quot; align=&amp;quot;center&amp;quot;|[[#BoFs|BoFs]]&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== BoFs ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Date TBD&#039;&#039;&#039;&lt;br /&gt;
{|&lt;br /&gt;
! Title !! Leader&lt;br /&gt;
|-&lt;br /&gt;
| Tracing from the virtual machine || Dhaval Giani&lt;br /&gt;
|-&lt;br /&gt;
| Guest and Host Communication || Amit Shah&lt;br /&gt;
|-&lt;br /&gt;
| 3D emulation and remoting || Alon Levy&lt;br /&gt;
|-&lt;br /&gt;
| Approaches to Cloud Storage || Igor Serebryany&lt;br /&gt;
|-&lt;br /&gt;
| Moving away from C || Avi Kivity&lt;br /&gt;
|-&lt;br /&gt;
| Device Assignment || Alexander Graf&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Monday, August 15th&#039;&#039;&#039;&lt;br /&gt;
{|&lt;br /&gt;
! Title !! Leader&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Tuesday, August 16th&#039;&#039;&#039;&lt;br /&gt;
{|&lt;br /&gt;
! Title !! Leader&lt;br /&gt;
|-&lt;br /&gt;
| USB mini-summit || Sarah Sharp&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=File:2011-forum-virtio-scsi.pdf&amp;diff=3733</id>
		<title>File:2011-forum-virtio-scsi.pdf</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=File:2011-forum-virtio-scsi.pdf&amp;diff=3733"/>
		<updated>2011-08-13T06:20:13Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: Viritio-scsi: An alternative virtualized storage stack for KVM&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Viritio-scsi: An alternative virtualized storage stack for KVM&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Code&amp;diff=3642</id>
		<title>Code</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Code&amp;diff=3642"/>
		<updated>2011-06-23T12:32:05Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Code=&lt;br /&gt;
&lt;br /&gt;
[[Category:Architechture]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== kernel git tree ==&lt;br /&gt;
&lt;br /&gt;
The kvm kernel code is available through a git tree (like the kernel itself).  To create a repository using git, type&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/virt/kvm/kvm.git&lt;br /&gt;
&lt;br /&gt;
Alternatively, it is also accessible through the kernel.org gitweb interface:               &lt;br /&gt;
[http://git.kernel.org/?p=virt/kvm/kvm.git;a=summary]&lt;br /&gt;
&lt;br /&gt;
For subsequent upgrades use the command&lt;br /&gt;
                                       &lt;br /&gt;
 git pull&lt;br /&gt;
&lt;br /&gt;
in the git working directory.&lt;br /&gt;
&lt;br /&gt;
== userspace git tree ==&lt;br /&gt;
&lt;br /&gt;
The kvm userspace code (libkvm and qemu) is available through a git tree. To create a repository using git, type&lt;br /&gt;
                                                                                      &lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git&lt;br /&gt;
&lt;br /&gt;
Alternatively, it is also accessible through the kernel.org gitweb interface:         &lt;br /&gt;
[http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=summary]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;If you want to contribute code&#039;&#039;&#039;, please [http://wiki.qemu.org/Contribute/StartHere develop against qemu.git] and submit patches to qemu-devel@nongnu.org.  The qemu-kvm.git tree regularly gets changes from qemu.git and patches against qemu-kvm.git should be minimized.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== building an external module with older kernels ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;This only works for the x86 architecture.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
1. If you wish to use a distribution kernel (or just some random kernel you like) with kvm,&lt;br /&gt;
you can use the external module kit.  You will need the kvm-kmod repository:&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/virt/kvm/kvm-kmod.git&lt;br /&gt;
 cd kvm-kmod&lt;br /&gt;
 git submodule update --init&lt;br /&gt;
 ./configure [--kerneldir=/path/to/kernel/dir]&lt;br /&gt;
 make sync&lt;br /&gt;
 make&lt;br /&gt;
&lt;br /&gt;
=== Tip about building against Red Hat Enterprise Linux kernels ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;kvm-userspace/kernel&amp;lt;/code&amp;gt; has some compat code to allow it to compile against older kernels, and also some code specific to features that are normally not present on older kernels but are present on RHEL kernels.&lt;br /&gt;
&lt;br /&gt;
So, when building against a RHEL kernel tree, check if the &amp;lt;code&amp;gt;RHEL_*&amp;lt;/code&amp;gt; macros at &amp;lt;code&amp;gt;${kerneldir}/include/linux/version.h&amp;lt;/code&amp;gt; are defined correctly, corresponding to the RHEL version where the kernel source comes from. If those macros aren&#039;t defined correctly, the compat code that allows compilation against RHEL kernels will break and you will get build errors.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== release tags ==&lt;br /&gt;
&lt;br /&gt;
kvm stable releases (based off of Qemu&#039;s stable branch) are tagged with &amp;lt;code&amp;gt;kvm-qemu-0.NN.N&amp;lt;/code&amp;gt; where &#039;&#039;N&#039;&#039; equates to the upstream Qemu branch versions. Note that kvm has them tagged not branched.&lt;br /&gt;
&lt;br /&gt;
kvm development releases are tagged with &amp;lt;code&amp;gt;kvm-nn&amp;lt;/code&amp;gt; where &#039;&#039;nn&#039;&#039; is the release number.&lt;br /&gt;
&lt;br /&gt;
== Binary Packages ==&lt;br /&gt;
&lt;br /&gt;
=== CentOS / RHEL ===&lt;br /&gt;
&lt;br /&gt;
Unofficial packages of latest releases can be found at:&lt;br /&gt;
&amp;lt;code&amp;gt;http://www.lfarkas.org/linux/packages/centos/5/&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Debian ===&lt;br /&gt;
&lt;br /&gt;
For Debian Lenny, please use packages from backports.debian.org - for &amp;lt;em&amp;gt;both&amp;lt;/em&amp;gt; qemu-kvm and kernel (at least 2.6.32).  It is important to use more recent kernel - 2.6.26 does not work well with kvm.&lt;br /&gt;
&lt;br /&gt;
Note that package &amp;quot;kvm&amp;quot; has been renamed to &amp;quot;qemu-kvm&amp;quot; in Squeeze and in Lenny backports (and kvm is now transitional package that installs qemu-kvm automatically).&lt;br /&gt;
&lt;br /&gt;
Debian Squeeze will have qemu-kvm based on 0.12, available in standard repositories.&lt;br /&gt;
&lt;br /&gt;
Experimental 0.13 packages are available at &lt;br /&gt;
&amp;lt;code&amp;gt;http://www.corpit.ru/debian/tls/kvm/0.13/&amp;lt;/code&amp;gt; , pending upload to debian -experimental.&lt;br /&gt;
&lt;br /&gt;
== nightly snapshots ==&lt;br /&gt;
&lt;br /&gt;
Nightly snapshots, for those who are uncomfortable with git, are [http://people.qumranet.com/avi/snapshots available].  When reporting a problem with a snapshot, please quote the snapshot name (which includes the date) and the contents of the SOURCES file in the snapshot tarball.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3061</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3061"/>
		<updated>2010-07-02T16:02:35Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
The following data shows the code path executed in QEMU when the seqread benchmark runs inside the guest:&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Measure&lt;br /&gt;
!Time (us)&lt;br /&gt;
|-&lt;br /&gt;
|Virtqueue notify to completion interrupt time [aio=native]&lt;br /&gt;
|147.611&lt;br /&gt;
|-&lt;br /&gt;
|Virtqueue notify to completion interrupt time [aio=threads, old QEMU binary]&lt;br /&gt;
|159.628&lt;br /&gt;
|-&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from guest&lt;br /&gt;
|190.229&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from host&lt;br /&gt;
|128.862&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Observations:&lt;br /&gt;
* virtqueue_pop 2.560 us is expensive, probably due to vring accesses.  RAM API would make this faster since vring could be permanently mapped.&lt;br /&gt;
* Overhead at QEMU level is still 147.611 / 128.862 = 14.5%.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3060</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3060"/>
		<updated>2010-07-02T15:52:22Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
The following data shows the code path executed in QEMU when the seqread benchmark runs inside the guest:&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Measure&lt;br /&gt;
!Time (us)&lt;br /&gt;
|-&lt;br /&gt;
|Completion interrupt to next virtqueue notify time&lt;br /&gt;
|38.654&lt;br /&gt;
|-&lt;br /&gt;
|Virtqueue notify to completion interrupt time&lt;br /&gt;
|147.611&lt;br /&gt;
|-&lt;br /&gt;
|Total&lt;br /&gt;
|186.265&lt;br /&gt;
|-&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from guest&lt;br /&gt;
|190.229&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from host&lt;br /&gt;
|128.862&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3059</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3059"/>
		<updated>2010-07-02T15:50:54Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
The following data shows the code path executed in QEMU when the seqread benchmark runs inside the guest:&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
|Completion interrupt to next virtqueue notify time&lt;br /&gt;
|38.654 us&lt;br /&gt;
|-&lt;br /&gt;
|Virtqueue notify to completion interrupt time&lt;br /&gt;
|147.611 us&lt;br /&gt;
|-&lt;br /&gt;
|Total&lt;br /&gt;
|186.265 us&lt;br /&gt;
|-&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from guest&lt;br /&gt;
|190.229 us&lt;br /&gt;
|-&lt;br /&gt;
|seqread latency figure from host&lt;br /&gt;
|128.862 us&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3058</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3058"/>
		<updated>2010-07-02T15:46:58Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
The following data shows the code path executed in QEMU when the seqread benchmark runs inside the guest:&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
 Completion interrupt to next virtqueue notify time: 38.654 us&lt;br /&gt;
 Virtqueue notify to completion interrupt time: 147.611 us&lt;br /&gt;
 Total: 186.265 us&lt;br /&gt;
&lt;br /&gt;
 seqread latency figure from guest: 190.229 us&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3057</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3057"/>
		<updated>2010-07-02T15:30:35Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
The following data shows the code path executed in QEMU when the seqread benchmark runs inside the guest:&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3056</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3056"/>
		<updated>2010-07-02T15:27:36Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;br /&gt;
&lt;br /&gt;
=== Read request lifecycle ===&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Time since previous event (us)&lt;br /&gt;
!Thread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|0.265&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|0.422&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|35.678&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|0.694&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|2.560&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|1.012&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|16.313&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|0.923&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|0.273&lt;br /&gt;
|vcpu&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|118.307&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|0.410&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|1.624&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.318&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.282&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|0.269&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|3.626&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|1.337&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|0.741&lt;br /&gt;
|iothread&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|0.211&lt;br /&gt;
|iothread&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3055</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3055"/>
		<updated>2010-07-02T15:22:21Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;br /&gt;
* The standard deviation of the iothread return from select(2) is high.  There is no complicated code in the path, I think iothread lock contention occassionally causes high latency here.  Most of the time select_post only takes 1 us, not 8 us as suggested by the mean.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3054</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3054"/>
		<updated>2010-07-02T15:17:43Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
* The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.  It would be interesting to using ioeventfd for VIRTIO_PCI_QUEUE_NOTIFY pio so that the iothread performs the io_submit() instead of the vcpu thread.  This will increase latency but should reduce guest system time stealing.&lt;br /&gt;
* The Linux AIO eventfd() could be modified to reduce latency in the case where a single AIO request has completed.  The read() = -EAGAIN could be avoided by not looping in qemu_laio_completion_cb().  The iothread select(2) call should detect that more AIO events have completed since the file descriptor is still readable next time around the main loop.  This increases latency when AIO requests complete while still in qemu_laio_completion_cb().&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3053</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3053"/>
		<updated>2010-07-02T15:10:17Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;br /&gt;
&lt;br /&gt;
Observations on this data:&lt;br /&gt;
 * The VIRTIO_PCI_QUEUE_NOTIFY pio has a latency of over 22 us!  This is largely due to io_submit() taking 16 us.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3052</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3052"/>
		<updated>2010-07-02T15:02:13Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select()&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read()&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_IRQ_LINE)&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3051</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3051"/>
		<updated>2010-07-02T14:59:31Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The iothread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|select&lt;br /&gt;
|210480&lt;br /&gt;
|0.000271&lt;br /&gt;
|0.001690&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.030008&lt;br /&gt;
|57.102602&lt;br /&gt;
|-&lt;br /&gt;
|select_post&lt;br /&gt;
|209097&lt;br /&gt;
|0.000009&lt;br /&gt;
|0.000470&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.030010&lt;br /&gt;
|1.879496&lt;br /&gt;
|-&lt;br /&gt;
|read&lt;br /&gt;
|418439&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000021&lt;br /&gt;
|0.325694&lt;br /&gt;
|-&lt;br /&gt;
|read_post&lt;br /&gt;
|310035&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000052&lt;br /&gt;
|0.459388&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents&lt;br /&gt;
|204800&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000008&lt;br /&gt;
|0.161967&lt;br /&gt;
|-&lt;br /&gt;
|io_getevents_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000074&lt;br /&gt;
|0.388233&lt;br /&gt;
|-&lt;br /&gt;
|ioctl&lt;br /&gt;
|204829&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000025&lt;br /&gt;
|0.807423&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|204828&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000000&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000013&lt;br /&gt;
|0.257511&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vcpu thread latency statistics are as follows:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3050</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3050"/>
		<updated>2010-07-02T14:55:37Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the userspace/system call times for the iothread and vcpu threads:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:threads.png]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The vcpu thread has the following behavior:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=File:Threads.png&amp;diff=3049</id>
		<title>File:Threads.png</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=File:Threads.png&amp;diff=3049"/>
		<updated>2010-07-02T14:53:44Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: Virtio-blk iothread and vcpu thread latency diagram&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Virtio-blk iothread and vcpu thread latency diagram&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3048</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3048"/>
		<updated>2010-07-02T14:43:04Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Userspace and System Call times ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The vcpu thread has the following behavior:&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl(KVM_RUN)&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit()&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The *_post statistics show the time spent inside QEMU userspace after a system call.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3047</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3047"/>
		<updated>2010-07-02T14:35:31Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;The time between request submission and completion is lower with Linux AIO.&#039;&#039;&#039;  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;br /&gt;
&lt;br /&gt;
Note that the 8 us latency from virtio_queue_notify to submit is because the QEMU binary used to gather these results does not have the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Vcpu/iothread interaction ===&lt;br /&gt;
&lt;br /&gt;
Trace events inside QEMU have a hard time showing the latency breakdown between userspace and system calls.  Because trace events are inside QEMU and the iothread mutex must be held, it is not possible to measure the exact boundaries of blocking system calls like select(2) and ioctl(KVM_RUN).&lt;br /&gt;
&lt;br /&gt;
The ftrace raw_syscalls events can be used like strace to gather system call entry/exit times for threads.&lt;br /&gt;
&lt;br /&gt;
The following data&lt;br /&gt;
{|&lt;br /&gt;
!Event&lt;br /&gt;
!Count&lt;br /&gt;
!Mean (s)&lt;br /&gt;
!Std deviation (s)&lt;br /&gt;
!Minimum (s)&lt;br /&gt;
!Maximum (s)&lt;br /&gt;
!Total (s)&lt;br /&gt;
|-&lt;br /&gt;
|ioctl&lt;br /&gt;
|224793&lt;br /&gt;
|0.000224&lt;br /&gt;
|0.011423&lt;br /&gt;
|0.000000&lt;br /&gt;
|1.991701&lt;br /&gt;
|50.438935&lt;br /&gt;
|-&lt;br /&gt;
|ioctl_post&lt;br /&gt;
|224785&lt;br /&gt;
|0.000004&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000054&lt;br /&gt;
|0.994368&lt;br /&gt;
|-&lt;br /&gt;
|io_submit&lt;br /&gt;
|204800&lt;br /&gt;
|0.000016&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000015&lt;br /&gt;
|0.000111&lt;br /&gt;
|3.303320&lt;br /&gt;
|-&lt;br /&gt;
|io_submit_post&lt;br /&gt;
|204800&lt;br /&gt;
|0.000002&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000001&lt;br /&gt;
|0.000039&lt;br /&gt;
|0.331057&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3046</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3046"/>
		<updated>2010-07-02T14:23:59Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
The seqread benchmark reports aio=threads 200.309 us and aio=native 193.374 us latency.  The Linux AIO mechanism has lower latency than POSIX AIO emulation; here is the detailed latency trace to support this observation:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads (us)&lt;br /&gt;
!aio=native (us)&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|&#039;&#039;&#039;143.724&#039;&#039;&#039;&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The time between request submission and completion is lower with Linux AIO.  paio_submit -&amp;gt; posix_aio_read takes 143.724 us while laio_submit -&amp;gt; qemu_laio_completion_cb takes only 136.241 us.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3045</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3045"/>
		<updated>2010-07-02T14:19:38Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
aio=threads seqread 200.309 us&lt;br /&gt;
aio=native  seqread 193.374 us&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads&lt;br /&gt;
!aio=native&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|143.724&lt;br /&gt;
|&#039;&#039;&#039;136.241&#039;&#039;&#039;&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3044</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3044"/>
		<updated>2010-07-02T14:17:18Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
&lt;br /&gt;
aio=threads seqread 200.309 us&lt;br /&gt;
aio=native  seqread 193.374 us&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!aio=threads&lt;br /&gt;
!aio=native&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|45.292&lt;br /&gt;
|44.464&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit/laio_submit&lt;br /&gt;
|8.023&lt;br /&gt;
|8.377&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read/qemu_laio_completion_cb&lt;br /&gt;
|143.724&lt;br /&gt;
|136.241&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue/qemu_laio_enqueue_completed&lt;br /&gt;
|1.965&lt;br /&gt;
|1.754&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|0.260&lt;br /&gt;
|0.294&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|1.034&lt;br /&gt;
|1.342&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3043</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3043"/>
		<updated>2010-07-02T14:12:28Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== posix-aio-compat versus linux-aio ===&lt;br /&gt;
&lt;br /&gt;
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.&lt;br /&gt;
&lt;br /&gt;
The following results compare latency of the two AIO mechanisms.  All time measurements in microseconds.&lt;br /&gt;
 &lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200.309 us&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193.374 us&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3042</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3042"/>
		<updated>2010-07-02T14:04:08Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
&lt;br /&gt;
Latency numbers can be calculated by recording timestamps along the I/O code path.  The trace events work, which adds static trace points to QEMU, is a good mechanism for this sort of instrumentation.&lt;br /&gt;
&lt;br /&gt;
The following trace events were added to QEMU:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|Read/write request completion&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|Asynchronous I/O request submission to worker threads &lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|Asynchronous I/O request completion&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|Asynchronous I/O completion events pending&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|Linux AIO completion events are about to be processed&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|Linux AIO request completion&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|Linux AIO request is being issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit_done&lt;br /&gt;
|Linux AIO request has been issued to the kernel&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_entry&lt;br /&gt;
|Iothread main loop iteration start&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_exit&lt;br /&gt;
|Iothread main loop iteration finish&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_pre_select&lt;br /&gt;
|Iothread about to block in the select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_post_select&lt;br /&gt;
|Iothread resumed after select(2) system call&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_iohandlers_done&lt;br /&gt;
|Iothread callbacks for select(2) file descriptors finished&lt;br /&gt;
|-&lt;br /&gt;
|main_loop_wait_timers_done&lt;br /&gt;
|Iothread timer processing done&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level&lt;br /&gt;
|About to raise interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_set_irq_level_done&lt;br /&gt;
|Finished raising interrupt in guest&lt;br /&gt;
|-&lt;br /&gt;
|pre_kvm_run&lt;br /&gt;
|Vcpu about to enter guest&lt;br /&gt;
|-&lt;br /&gt;
|post_kvm_run&lt;br /&gt;
|Vcpu has exited the guest&lt;br /&gt;
|-&lt;br /&gt;
|kvm_run_exit_io_done&lt;br /&gt;
|Vcpu io exit handler finished&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Results ===&lt;br /&gt;
&lt;br /&gt;
 All time measurements in microseconds.&lt;br /&gt;
 &lt;br /&gt;
 * QEMU userspace latency:&lt;br /&gt;
  * paio_submit/laio_submit 8 us latency:&lt;br /&gt;
    * backport memset patch&lt;br /&gt;
    * stash last unused virtio block request instead of alloc and free&lt;br /&gt;
    * avoid vring access overhead with RAM API&lt;br /&gt;
    * virtqueue_pop trace event helps split up this latency&lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200.309 us&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193.374 us&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;br /&gt;
  * aio=native + memset&lt;br /&gt;
    * seqread 186.468 us&lt;br /&gt;
    * virtio_queue_notify 44.696&lt;br /&gt;
    * virtqueue_pop 2.446&lt;br /&gt;
    * laio_submit 1.711&lt;br /&gt;
    * qemu_laio_completion_cb 134.478&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.737&lt;br /&gt;
    * virtio_blk_rw_complete 0.296&lt;br /&gt;
    * virtio_notify 1.087&lt;br /&gt;
    * total 141.755&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3041</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3041"/>
		<updated>2010-07-02T13:38:49Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
The time spent in QEMU servicing a read request made up 29 us or a 23% overhead compared to a host read request.  This deserves closer study so that the overhead can be reduced.&lt;br /&gt;
&lt;br /&gt;
The benchmark QEMU binary was updated to qemu-kvm.git upstream [Tue Jun 29 13:59:10 2010 +0100] in order to take advantage of the latest optimizations that have gone into qemu-kvm.git, including the virtio-blk memset elimination patch.&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|multiwrite_cb&lt;br /&gt;
|Multiwrite operations have completed&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_earlyfail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_latefail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_handle_write&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Results ===&lt;br /&gt;
&lt;br /&gt;
 All time measurements in microseconds.&lt;br /&gt;
 &lt;br /&gt;
 * QEMU userspace latency:&lt;br /&gt;
  * paio_submit/laio_submit 8 us latency:&lt;br /&gt;
    * backport memset patch&lt;br /&gt;
    * stash last unused virtio block request instead of alloc and free&lt;br /&gt;
    * avoid vring access overhead with RAM API&lt;br /&gt;
    * virtqueue_pop trace event helps split up this latency&lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200.309 us&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193.374 us&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;br /&gt;
  * aio=native + memset&lt;br /&gt;
    * seqread 186.468 us&lt;br /&gt;
    * virtio_queue_notify 44.696&lt;br /&gt;
    * virtqueue_pop 2.446&lt;br /&gt;
    * laio_submit 1.711&lt;br /&gt;
    * qemu_laio_completion_cb 134.478&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.737&lt;br /&gt;
    * virtio_blk_rw_complete 0.296&lt;br /&gt;
    * virtio_notify 1.087&lt;br /&gt;
    * total 141.755&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3039</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3039"/>
		<updated>2010-06-25T14:05:19Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|multiwrite_cb&lt;br /&gt;
|Multiwrite operations have completed&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_earlyfail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_latefail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_handle_write&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Results ===&lt;br /&gt;
&lt;br /&gt;
 All time measurements in microseconds.&lt;br /&gt;
 &lt;br /&gt;
 * QEMU userspace latency:&lt;br /&gt;
  * paio_submit/laio_submit 8 us latency:&lt;br /&gt;
    * backport memset patch&lt;br /&gt;
    * stash last unused virtio block request instead of alloc and free&lt;br /&gt;
    * avoid vring access overhead with RAM API&lt;br /&gt;
    * virtqueue_pop trace event helps split up this latency&lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200.309 us&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193.374 us&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;br /&gt;
  * aio=native + memset&lt;br /&gt;
    * seqread 186.468 us&lt;br /&gt;
    * virtio_queue_notify 44.696&lt;br /&gt;
    * virtqueue_pop 2.446&lt;br /&gt;
    * laio_submit 1.711&lt;br /&gt;
    * qemu_laio_completion_cb 134.478&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.737&lt;br /&gt;
    * virtio_blk_rw_complete 0.296&lt;br /&gt;
    * virtio_notify 1.087&lt;br /&gt;
    * total 141.755&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3038</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3038"/>
		<updated>2010-06-25T14:04:32Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|multiwrite_cb&lt;br /&gt;
|Multiwrite operations have completed&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_earlyfail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_latefail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_handle_write&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Results ===&lt;br /&gt;
&lt;br /&gt;
 * QEMU userspace latency:&lt;br /&gt;
  * paio_submit/laio_submit 8 us latency:&lt;br /&gt;
    * backport memset patch&lt;br /&gt;
    * stash last unused virtio block request instead of alloc and free&lt;br /&gt;
    * avoid vring access overhead with RAM API&lt;br /&gt;
    * virtqueue_pop trace event helps split up this latency&lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200.309 us&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193.374 us&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;br /&gt;
  * aio=native + memset&lt;br /&gt;
    * seqread 186.468 us&lt;br /&gt;
    * virtio_queue_notify 44.696&lt;br /&gt;
    * virtqueue_pop 2.446&lt;br /&gt;
    * laio_submit 1.711&lt;br /&gt;
    * qemu_laio_completion_cb 134.478&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.737&lt;br /&gt;
    * virtio_blk_rw_complete 0.296&lt;br /&gt;
    * virtio_notify 1.087&lt;br /&gt;
    * total 141.755&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3037</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3037"/>
		<updated>2010-06-25T13:56:33Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;br /&gt;
&lt;br /&gt;
== Zooming in on QEMU userspace virtio-blk latency ==&lt;br /&gt;
&lt;br /&gt;
=== Trace events ===&lt;br /&gt;
{|&lt;br /&gt;
!Trace event&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
|virtio_add_queue&lt;br /&gt;
|Device has registered a new virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_queue_notify&lt;br /&gt;
|Guest -&amp;gt; host virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|virtqueue_pop&lt;br /&gt;
|A buffer has been removed from the virtqueue&lt;br /&gt;
|-&lt;br /&gt;
|virtio_notify&lt;br /&gt;
|Host -&amp;gt; guest virtqueue notify&lt;br /&gt;
|-&lt;br /&gt;
|multiwrite_cb&lt;br /&gt;
|Multiwrite operations have completed&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_earlyfail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|bdrv_aio_multiwrite_latefail&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_rw_complete&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|virtio_blk_handle_write&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|paio_submit&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_process_queue&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|posix_aio_read&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_enqueue_completed&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|qemu_laio_completion_cb&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|laio_submit&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Results ===&lt;br /&gt;
&lt;br /&gt;
 * QEMU userspace latency:&lt;br /&gt;
  * paio_submit/laio_submit 8 us latency:&lt;br /&gt;
    * backport memset patch&lt;br /&gt;
    * stash last unused virtio block request instead of alloc and free&lt;br /&gt;
    * avoid vring access overhead with RAM API&lt;br /&gt;
    * virtqueue_pop trace event helps split up this latency&lt;br /&gt;
  * aio=threads&lt;br /&gt;
    * seqread 200309&lt;br /&gt;
    * virtio_queue_notify 45.292&lt;br /&gt;
    * paio_submit 8.023&lt;br /&gt;
    * posix_aio_read 143.724&lt;br /&gt;
    * posix_aio_process_queue 1.965&lt;br /&gt;
    * virtio_blk_rw_complete 0.260&lt;br /&gt;
    * virtio_notify 1.034&lt;br /&gt;
    * total 155.006&lt;br /&gt;
  * aio=native&lt;br /&gt;
    * seqread 193374&lt;br /&gt;
    * virtio_queue_notify 44.464&lt;br /&gt;
    * laio_submit 8.377&lt;br /&gt;
    * qemu_laio_completion_cb 136.241&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.754&lt;br /&gt;
    * virtio_blk_rw_complete 0.294&lt;br /&gt;
    * virtio_notify 1.342&lt;br /&gt;
    * total 148.008&lt;br /&gt;
  * aio=native + memset&lt;br /&gt;
    * seqread 186468&lt;br /&gt;
    * virtio_queue_notify 44.696&lt;br /&gt;
    * virtqueue_pop 2.446&lt;br /&gt;
    * laio_submit 1.711&lt;br /&gt;
    * qemu_laio_completion_cb 134.478&lt;br /&gt;
    * qemu_laio_enqueue_completed 1.737&lt;br /&gt;
    * virtio_blk_rw_complete 0.296&lt;br /&gt;
    * virtio_notify 1.087&lt;br /&gt;
    * total 141.755&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3029</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3029"/>
		<updated>2010-06-04T14:30:34Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch]:&lt;br /&gt;
&lt;br /&gt;
 $ gdb -batch -x futex.gdb -p $(pgrep qemu) # to find futex addresses&lt;br /&gt;
 # echo &#039;uaddr == 0x89b800 || uaddr == 0x89b9e0&#039; &amp;gt;events/syscalls/sys_enter_futex/filter # to trace only those futexes&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_enter_futex/enable&lt;br /&gt;
 # echo 1 &amp;gt;events/syscalls/sys_exit_futex/enable&lt;br /&gt;
 [...run benchmark...]&lt;br /&gt;
 # ./futex.py &amp;lt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3028</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3028"/>
		<updated>2010-06-04T14:20:09Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch].&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.  I have copies of the raw trace data which can be used to look at the latency distribution.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3027</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3027"/>
		<updated>2010-06-04T14:17:44Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== Host/guest switching =====&lt;br /&gt;
&lt;br /&gt;
===== Host/QEMU switching =====&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.  QEMU is building AIO requests for each virtio-blk read command and transforming the results back again before raising an interrupt.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;QEMU iothread mutex contention&#039;&#039;&#039; due to the architecture of qemu-kvm.  In preliminary futex wait profiling on my laptop, I have seen threads blocking on average 20 us when the iothread mutex is contended.  Further work could investigate whether this is the case here and then how to structure QEMU in a way that solves the lock contention.  See &amp;lt;tt&amp;gt;futex.gdb&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;futex.py&amp;lt;/tt&amp;gt; for futex profiling using ftrace in [http://repo.or.cz/w/qemu-kvm/stefanha.git/tree/tracing-dev-0.12.4:/latency_scripts my tracing branch].&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3026</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3026"/>
		<updated>2010-06-04T14:08:50Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
===== Host/guest switching =====&lt;br /&gt;
&lt;br /&gt;
===== Host/QEMU switching =====&lt;br /&gt;
&lt;br /&gt;
===== QEMU =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; 29393 ns latency (~15% of total) is high.  The &amp;lt;tt&amp;gt;QEMU&amp;lt;/tt&amp;gt; layer accounts for the time between virtqueue notify until issuing the &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscall and return of the syscall until raising an interrupt to notify the guest.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3025</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3025"/>
		<updated>2010-06-04T13:59:32Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
Possible explanations:&lt;br /&gt;
* &#039;&#039;&#039;Inefficiency in the guest kernel I/O path&#039;&#039;&#039; as suggested by past oprofile data.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Expensive operations&#039;&#039;&#039; performed by the guest, besides the pio write vmexit and interrupt injection which are accounted for by &amp;lt;tt&amp;gt;Host/guest switching&amp;lt;/tt&amp;gt; and not included in this figure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest latency stacks up with host latency.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3024</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3024"/>
		<updated>2010-06-04T13:40:40Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which is what single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks show for sequential read throughput.  The results I collected only measure 4k sequential reads, perhaps the picture may vary with writes or different block sizes.&lt;br /&gt;
&lt;br /&gt;
===== Guest =====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest&amp;lt;/tt&amp;gt; 202095 ns latency (13% of total) is high.  The guest should be filling in virtio-blk read commands and talking to the virtio-blk PCI device, there isn&#039;t much interesting work going on inside the guest.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark inside the guest is doing sequential &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls in a loop.  A timestamp is taken before the loop and after all requests have finished; the mean latency is calculated by dividing this total time by the number of &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; calls.&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;tt&amp;gt;Guest virtio-pci&amp;lt;/tt&amp;gt; tracepoints provide timestamps when the guest performs the virtqueue notify via a pio write and when the interrupt handler is executed to service the response from the host.&lt;br /&gt;
&lt;br /&gt;
Between the &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; userspace program and &amp;lt;tt&amp;gt;virtio-pci&amp;lt;/tt&amp;gt; are several kernel layers, including the vfs, block, and io scheduler.  Previous guest oprofile data from Khoa Huynh showed &amp;lt;tt&amp;gt;__make_request&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;get_request&amp;lt;/tt&amp;gt; taking significant amounts of CPU time.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mean average latencies&#039;&#039;&#039; don&#039;t show the full picture of the system.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3023</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3023"/>
		<updated>2010-06-04T12:46:05Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  I think this is why the latency numbers are in the microsecond range, not the usual millisecond seek time expected from disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
The results give a 33% virtualization overhead.  I expected the overhead to be higher, around 50%, which single-process &amp;lt;tt&amp;gt;dd bs=8k iflag=direct&amp;lt;/tt&amp;gt; benchmarks give as sequential read throughput.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3022</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3022"/>
		<updated>2010-06-04T12:34:32Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches:&lt;br /&gt;
&lt;br /&gt;
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4&lt;br /&gt;
&lt;br /&gt;
This particular [http://repo.or.cz/w/qemu-kvm/stefanha.git/commit/deaa69d19c14b0ce902c9f5f10455f9cbefeff5b commit message] explains how to use the simple trace backend for latency tracing.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3021</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3021"/>
		<updated>2010-06-04T10:48:21Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Cumulative latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3020</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3020"/>
		<updated>2010-06-04T10:46:52Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3019</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3019"/>
		<updated>2010-06-04T10:45:24Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|25699&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|7561&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|3640&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|29393&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
The following numbers for the layers of the stack are derived from the previous numbers by subtracting successive latency readings:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3018</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3018"/>
		<updated>2010-06-04T10:30:06Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|25699&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|7561&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|3640&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|29393&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Delta (%)&lt;br /&gt;
|-&lt;br /&gt;
|Guest&lt;br /&gt;
|25699&lt;br /&gt;
|13.08%&lt;br /&gt;
|-&lt;br /&gt;
|Host/guest switching&lt;br /&gt;
|7561&lt;br /&gt;
|3.85%&lt;br /&gt;
|-&lt;br /&gt;
|Host/QEMU switching&lt;br /&gt;
|3640&lt;br /&gt;
|1.85%&lt;br /&gt;
|-&lt;br /&gt;
|QEMU&lt;br /&gt;
|29393&lt;br /&gt;
|14.96%&lt;br /&gt;
|-&lt;br /&gt;
|Host I/O&lt;br /&gt;
|130235&lt;br /&gt;
|66.27%&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3017</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3017"/>
		<updated>2010-06-04T09:23:13Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
The benchmark I use is a simple C program that performs sequential 4k reads on an &amp;lt;tt&amp;gt;O_DIRECT&amp;lt;/tt&amp;gt; file descriptor, bypassing the page cache.  The aim is to observe the raw per-request latency when accessing the disk.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|25699&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|7561&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|3640&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|29393&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
The sequential read case is optimized by the presence of a disk read cache.  This means the latency numbers are in the microsecond range, not the usual millisecond seek time expected on disks.  However, read caching is not an issue for measuring the latency overhead imposed by virtualization since the cache is active for both host and guest measurements.&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3016</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3016"/>
		<updated>2010-06-04T09:03:06Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram compares the benchmark when run on the host against run inside the guest:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-comparison.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency-breakdown.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|25699&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|7561&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|3640&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|29393&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=File:Virtio-blk-latency-comparison.jpg&amp;diff=3015</id>
		<title>File:Virtio-blk-latency-comparison.jpg</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=File:Virtio-blk-latency-comparison.jpg&amp;diff=3015"/>
		<updated>2010-06-04T09:01:52Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: Virtio block latency native-virtualized comparison&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Virtio block latency native-virtualized comparison&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=File:Virtio-blk-latency-breakdown.jpg&amp;diff=3014</id>
		<title>File:Virtio-blk-latency-breakdown.jpg</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=File:Virtio-blk-latency-breakdown.jpg&amp;diff=3014"/>
		<updated>2010-06-04T09:01:02Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: Virtio block latency breakdown&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Virtio block latency breakdown&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
	<entry>
		<id>https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3013</id>
		<title>Virtio/Block/Latency</title>
		<link rel="alternate" type="text/html" href="https://linux-kvm.org/index.php?title=Virtio/Block/Latency&amp;diff=3013"/>
		<updated>2010-06-04T08:47:23Z</updated>

		<summary type="html">&lt;p&gt;Stefanha: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes how virtio-blk latency can be measured.  The aim is to build a picture of the latency at different layers of the virtualization stack for virtio-blk.&lt;br /&gt;
&lt;br /&gt;
== Benchmarks ==&lt;br /&gt;
&lt;br /&gt;
Single-threaded read or write benchmarks are suitable for measuring virtio-blk latency.  The guest should have 1 vcpu only, which simplifies the setup and analysis.&lt;br /&gt;
&lt;br /&gt;
== Tools ==&lt;br /&gt;
&lt;br /&gt;
Linux kernel tracing (ftrace and trace events) can instrument host and guest kernels.  This includes finding system call and device driver latencies.&lt;br /&gt;
&lt;br /&gt;
Trace events in QEMU can instrument components inside QEMU.  This includes virtio hardware emulation and AIO.  Trace events are not upstream as of writing but can be built from git branches.&lt;br /&gt;
&lt;br /&gt;
== Instrumenting the stack ==&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The single-threaded read/write benchmark prints the mean time per operation at the end.  This number is the total latency including guest, host, and QEMU.  All latency numbers from layers further down the stack should be smaller than the guest number.&lt;br /&gt;
&lt;br /&gt;
==== Guest virtio-pci ====&lt;br /&gt;
&lt;br /&gt;
The virtio-pci latency is the time from the virtqueue notify pio write until the vring interrupt.  The guest performs the notify pio write in virtio-pci code.  The vring interrupt comes from the PCI device in the form of a legacy interrupt or a message-signaled interrupt.&lt;br /&gt;
&lt;br /&gt;
Ftrace can instrument virtio-pci inside the guest:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;vp_notify vring_interrupt&#039; &amp;gt;set_ftrace_filter&lt;br /&gt;
 echo function &amp;gt;current_tracer&lt;br /&gt;
 cat trace_pipe &amp;gt;/path/to/tmpfs/trace&lt;br /&gt;
&lt;br /&gt;
Note that putting the trace file in a tmpfs filesystem avoids causing disk I/O in order to store the trace.&lt;br /&gt;
&lt;br /&gt;
==== Host kvm ====&lt;br /&gt;
&lt;br /&gt;
The kvm latency is the time from the virtqueue notify pio exit until the interrupt is set inside the guest.  This number does not include vmexit/entry time.&lt;br /&gt;
&lt;br /&gt;
Events tracing can instrument kvm latency on the host:&lt;br /&gt;
 cd /sys/kernel/debug/tracing&lt;br /&gt;
 echo &#039;port == 0xc090&#039; &amp;gt;events/kvm/kvm_pio/filter&lt;br /&gt;
 echo &#039;gsi == 26&#039; &amp;gt;events/kvm/kvm_set_irq/filter&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_pio/enable&lt;br /&gt;
 echo 1 &amp;gt;events/kvm/kvm_set_irq/enable&lt;br /&gt;
 cat trace_pipe &amp;gt;/tmp/trace&lt;br /&gt;
&lt;br /&gt;
Note how &amp;lt;tt&amp;gt;kvm_pio&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;kvm_set_irq&amp;lt;/tt&amp;gt; can be filtered to only trace events for the relevant virtio-blk device.  Use &amp;lt;tt&amp;gt;lspci -vv -nn&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;cat /proc/interrupts&amp;lt;/tt&amp;gt; inside the guest to find the pio address and interrupt.&lt;br /&gt;
&lt;br /&gt;
==== QEMU virtio ====&lt;br /&gt;
&lt;br /&gt;
The virtio latency inside QEMU is the time from virtqueue notify until the interrupt is raised.  This accounts for time spent in QEMU servicing I/O.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable virtio_queue_notify() and virtio_notify() trace events.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Find vdev pointer for correct virtio-blk device in trace (should be easy because most requests will go to it).&lt;br /&gt;
* Use qemu_virtio.awk only on trace entries for the correct vdev.&lt;br /&gt;
&lt;br /&gt;
==== QEMU paio ====&lt;br /&gt;
&lt;br /&gt;
The paio latency is the time spent performing pread()/pwrite() syscalls.  This should be similar to latency seen when running the benchmark on the host.&lt;br /&gt;
&lt;br /&gt;
* Run with &#039;simple&#039; trace backend, enable the posix_aio_process_queue() trace event.&lt;br /&gt;
* Use ./simpletrace.py trace-events /path/to/trace to pretty-print the binary trace.&lt;br /&gt;
* Only keep reads (&amp;lt;tt&amp;gt;type=0x1&amp;lt;/tt&amp;gt; requests) and remove vm boot/shutdown from the trace file by looking at timestamps.&lt;br /&gt;
* Use qemu_paio.py to calculate the latency statistics.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
==== Host ====&lt;br /&gt;
&lt;br /&gt;
The host is 2x4-cores, 8 GB RAM, with 12 LVM striped FC LUNs.  Read and write caches are enabled on the disks.&lt;br /&gt;
&lt;br /&gt;
The host kernel is kvm.git 37dec075a7854f0f550540bf3b9bbeef37c11e2a from Sat May 22 16:13:55 2010 +0300.&lt;br /&gt;
&lt;br /&gt;
The qemu-kvm is 0.12.4 with patches as necessary for instrumentation.&lt;br /&gt;
&lt;br /&gt;
==== Guest ====&lt;br /&gt;
&lt;br /&gt;
The guest is a 1 vcpu, x2apic, 4 GB RAM virtual machine running a 2.6.32-based distro kernel.  The root disk image is raw and the benchmark storage is an LVM volume passed through as a virtio disk with cache=none.&lt;br /&gt;
&lt;br /&gt;
==== Performance data ====&lt;br /&gt;
&lt;br /&gt;
The following diagram shows the time spent in the different layers of the virtualization stack:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
[[Image:Virtio-blk-latency.jpg]]&lt;br /&gt;
&amp;lt;br style=&amp;quot;clear: both&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the raw data used to plot the diagram:&lt;br /&gt;
{|&lt;br /&gt;
!Layer&lt;br /&gt;
!Latency (ns)&lt;br /&gt;
!Delta (ns)&lt;br /&gt;
!Guest benchmark control (ns)&lt;br /&gt;
|-&lt;br /&gt;
|Guest benchmark&lt;br /&gt;
|196528&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|Guest virtio-pci&lt;br /&gt;
|170829&lt;br /&gt;
|25699&lt;br /&gt;
|202095&lt;br /&gt;
|-&lt;br /&gt;
|Host kvm.ko&lt;br /&gt;
|163268&lt;br /&gt;
|7561&lt;br /&gt;
|&lt;br /&gt;
|-&lt;br /&gt;
|QEMU virtio&lt;br /&gt;
|159628&lt;br /&gt;
|3640&lt;br /&gt;
|205165&lt;br /&gt;
|-&lt;br /&gt;
|QEMU paio&lt;br /&gt;
|130235&lt;br /&gt;
|29393&lt;br /&gt;
|202777&lt;br /&gt;
|-&lt;br /&gt;
|Host benchmark&lt;br /&gt;
|128862&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Delta (ns)&#039;&#039;&#039; column is the time between two layers, e.g. &#039;&#039;&#039;Guest benchmark&#039;&#039;&#039; and &#039;&#039;&#039;Guest virtio-pci&#039;&#039;&#039;.  The delta time tells us how long is being spent in a layer of the virtualization stack.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Guest benchmark control (ns)&#039;&#039;&#039; column is the latency reported by the guest benchmark for that run.  It is useful for checking that overall latency has remained relatively similar across benchmarking runs.&lt;br /&gt;
&lt;br /&gt;
==== Analysis ====&lt;br /&gt;
&lt;br /&gt;
==== Known issues ====&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Timing inside the guest&#039;&#039;&#039; can be inaccurate due to the virtualization architecture.  I believe this issue is not too severe on the kernels and qemu binaries used because the guest numbers are in the right ballpark.  Ideally, guest tracing could be performed using host timestamps so guest and host timestamps can be compared accurately.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Choice of I/O syscalls&#039;&#039;&#039; may result in different performance.  The &amp;lt;tt&amp;gt;seqread&amp;lt;/tt&amp;gt; benchmark uses 4k &amp;lt;tt&amp;gt;read()&amp;lt;/tt&amp;gt; syscalls while the qemu binary services these I/O requests using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; syscalls.  Comparison between the host benchmark and QEMU paio would be more correct when using &amp;lt;tt&amp;gt;pread64()&amp;lt;/tt&amp;gt; in the benchmark itself.&lt;/div&gt;</summary>
		<author><name>Stefanha</name></author>
	</entry>
</feed>