VhostNet
vhost-net: a kernel-level virtio-net server
What is vhost-net
vhost is a kernel-level backend for virtio. The main motivation for vhost is to reduce virtualization overhead for virtio by removing system calls on data path, without guest changes. For virtio-net, this removes up to 4 system calls per packet: vm exit for kick, reentry for kick, iothread wakeup for packet, interrupt injection for packet.
vhost is as minimal as possible. It relies on userspace for all setup work.
Status
vhost is fully functional, and it already shows improvement over userspace virtio.
How to use
Download, build an install kernel from:
kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
userspace:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git vhost
Usage instructions:
vhost currently requires MSI-X support in guest virtio. This means guests kernel version should be >= 2.6.31.
To enable vhost, simply add ",vhost" flag to nic options. Example with tap backend:
qemu-system-x86_64 -m 1G disk-c.qcow2 \ -net nic,model=virtio,netdev=foo \ -netdev type=tap,id=foo,ifname=msttap0,script=/home/mst/ifup,downscript=no,vhost=on
Older (demo) version usage:
Example with tap backend:
qemu-system-x86_64 -m 1G disk-c.qcow2 \ -net tap,ifname=msttap0,script=/home/mst/ifup,downscript=no \ -net nic,model=virtio,vhost
Example with raw socket backend:
ifconfig eth3 promisc qemu-system-x86_64 -m 1G disk-c.qcow2 \ -net raw,ifname=eth3 \ -net nic,model=virtio,vhost
Note: in raw socket mode, when binding to a physical ethernet device, host to guest communication will only work if your device is connected to a bridge configured to mirror outgoing packets back at the originating link. If you do not know whether this is the case, this most likely means it isn't. Use another box to access the guest, or use tap.
Limitations
- vhost currently requires MSI-X support in guest virtio. This means guests kernel version should be >= 2.6.31.
- with raw sockets, host to guest, and guest to guest communication on the same host does not always work. Use bridge+tap if you need that.
- driver unloading in guest and device hot-unplug are broken, because the relevant code in qemu is stubbed out. Need to implement them.
Performance
Still tuning performance, especially guest to host.
External to system numbers with bridge+tap and 10GE vxge card: qemu with bridge+tap, run with: -cpu host,-rdtscp,+x2apic, host+guest 2.6.33-rc2. mtu 1500.
- netperf TCP_STREAM, default setup, 100 secs run
native: 81XX Mb/s without vhost-net: 72XX Mb/s with vhost-net: 78XX Mb/s
- TCP_RR, 100 secs run
native: 48 usec/Trans without vhost-net: 395 usec/Trans with vhost-net: 86 usec/Trans
Here are some local numbers coutesy of Shirley Ma:
- netperf TCP_STREAM, default setup, 60 secs run
guest->host increases from 3XXXMb/s to 5XXXMb/s host->guest increases from 3XXXMb/s to 4XXXMb/s
- TCP_RR, 60 secs run
guest->host trans/s increases from 2XXX/s to 13XXX/s host->guest trans/s increases from 2XXX/s to 13XXX/s
TODOs
vhost-net driver projects
- profiling would be very helpful, I have not done any yet.
- merged buffers.
- scalability tuning: figure out the best threading model to use.
qemu projects
- migration support
- level triggered interrupts
- driver unloading/hotplug
- general cleanup and upstreaming
- upstream support for injecting interrupts from kernel, from qemu-kvm.git to qemu.git (this is a vhost dependency, without it vhost can't be upstreamed, or it can, but without real benefit)
virtio projects
- improve small packet/large buffer performance: support "reposting" buffers, pool for indirect buffers
- guest kernel 2.6.31 seems to work well. Under certain workloads,
virtio performance has regressed with guest kernels 2.6.32 and up (but still better than userspace). A patch has been posted: http://www.spinics.net/lists/netdev/msg115292.html
projects involing other kernel components and/or networking stack
- rx mac filtering in tun
- extend raw sockets to support GSO/checksum offloading, and teach vhost to use that capability [one way to do this: virtio net header support]; will allow working with e.g. macvlan
- improve locking: e.g. RX/TX poll should not need a lock
- multicast ICMPs snooping in bridge
long term projects
- kvm eventfd support for injecting level interrupts
- multiqueue (involves all of vhost, qemu, virtio, networking stack)
- zero copy tx for tun/raw sockets
Other
- More testing is always good
Short term plans for MST
- get vhost net merged in linux kernel 2.6.33
- address most vhost qemu TODOs
- get vhost support merged in upstream qemu
Short term plans for IBM(Sridhar Samudrala, David Stevens, Shirley MA)
- Add GSO/checksum offload support to AF_PACKET(raw) sockets.
- Mergeable RX buffers support in vhost-net.
- Defer SKB allocation in virtio_net receive path.