Internals of NDIS driver for VirtIO based network adapter

From KVM
Revision as of 05:30, 28 July 2009 by Yan (talk | contribs)

Internals of NDIS driver for VIRTIO based network adapter


This document contains implementation notes of Windows network adapter driver of VirtIO network device.

Abbreviations in the document

  • LSO, a.k.a. GSO, a.k.a. TSO – Large, Global, Transmit segment offload – the same thing in context of Windows driver.
  • OOB – out-of-band packet data; set of structures, associated with packet
  • SG – scatter-gather
  • MTU – maximal transfer unit
  • MSS – maximal segment size
  • CS – checksum
  • MDL – memory descriptor list
  • WL – waiting list
  • DUT – device under test


NDIS driver features

Basic networking operations

The NetKVM driver supports non-serialized data sending and receiving. Multiple send operations are supported by driver and not required to be serialized by NDIS (Logo requirements). For send operation scatter-gather mode is a default, it can be disabled by configuration to force copy mode to preallocated buffers. In receive operations the driver can indicate up to NDIS as many packets as VirtIO device can receive without waiting for NDIS to free receive buffers. The default MTU reported by driver to NDIS is 1514 bytes, i.e. NDIS sends without LSO packets up to MTU, the driver can not indicate reception of packets bigger than MTU. MTU can be changed through the device properties in device manager. For bidirectional data channel to/from device, the driver initializes 2 VirtIO queues (transmit and receive). VirtIO queues contain memory blocks for SG entry only (physical address + size), all the data buffers, where the physical address points to, are not related to VirtIO. Buffers, required for VirtIO headers and network data (when necessary) and OS-specific object, required for data indication, are allocated by the driver additionally during initialization.

Checksum offload

NDIS driver can declare following types of TX checksum offload

  • IP checksum. NDIS does not initialize IP header checksum, device/driver fill it.
  • TCP checksum. NDIS initializes TCP header checksum with “pseudo header checksum” value (pseudo header does not exist in the packet, checksum calculated on virtual structure), the device/driver finish it
  • Support IP options and TCP options when calculating checksum.
  • UDP checksum – exactly as TCP without options.
  • IP and TCP checksum for IPV6.

NDIS driver may implement also support of RX checksum offload (ability to verify IP/TCP checksum in incoming packets) – this feature is not supported in NetKVM. The NetKVM driver has implementation of all the CS offloads for IPV4, it works functionally. It is very helpful during development and diagnostic; some of these mechanisms used for LSO also. But the implementation does not pass corner cases of NDIS tests and disabled in the configuration by default.

Large segment offload

In LSO operation the NDIS provides packet that usually (not always) bigger than MTU; driver/device shall divide it into several smaller fragments before sending to other side. Each fragment, except of last one must contain MSS bytes of TCP data. During fragmentation, the device shall properly populate IP and TCP headers including IP and most of TCP options and fill all the necessary checksums. The NetKVM driver implements LSO offload for IPV4 with maximal packet size of almost 64K before offloading. Note that at least in current implementation the LSO requires scatter-gather for operation, as packets bigger than MTU can not be placed into pre-allocated buffers.

Priority and VLAN tagging

The NetKVM driver on its lower edge supports both Regular Ethernet header of 14 bytes and Ethernet type II header of 18 bytes. The second one includes Priority/VLAN tag (32 bits) with 16-bit field of Priority and VLAN. Ethernet type II header can not be indicated up to NDIS; NDIS also does not provide this tag in outgoing packets. Instead Priority and VLAN data reside in OOB block of NDIS packet(s) and populating this tag in outgoing packets is the responsibility of the driver. It is responsible also for removal of this tag from incoming packets (if present) and place Priority data into OOB block. The driver also checks the VLAN value and ignores incoming packets addressed to VLAN different than one configured for the specific instance of the driver.

Connect detection

Typically, the VirtIO network device driver always indicates connection to the network. During manual WHQL tests the test operator is required to remove the network cable from the socket. The VirtIO network device implementation in QEMU indicates up to the driver connection status, which allows the driver to indicate up to NDIS connect/disconnect events. When disconnected, the driver suspends its send operation and fails to send packets.

Filtering of incoming packets

Filtering is required feature of network adapter under Windows. The NetKVM driver maintains NDIS-managed filter mask, controlling which packets the driver shall indicate and which shall drop. This filter can allow one or more of:

  • Unicast packets (to device own MAC address)
  • Multicast packets (to one of multicast addresses configured from NDIS)
  • All multicast packets (to any multicast addresses)
  • Broadcast packets
  • All packets

For each incoming packet the driver analyses the destination MAC address and takes decision based on current filter mask. The buffer of the packet is available – it is one of buffers allocated by the driver during initialization.

Statistics

NDIS miniport required to support statistics for send and receive operations: number of successfully/unsuccessfully send/received bytes/packets per kind of packet (unicast, multicast, broadcast). Main consumer of this statistic information is NDIS test.

Sources files

Common files

Files under Common Functionality
ParaNdis-Common.c

ndis56common.h ethernetutils.h

All the common mechanisms and flow: system-independent initialization and cleanup, receive and transmit processing, interactions with VirtIO library
ParaNdis-Oid.c Implementation of system-independent part of OID requests
ParaNdis-Debug.c

kdebugprint.h

Debug printouts implementation
sw-offload.c All the procedures related to IP packets parsing and checksum calculation using software
osdep.h Common header for VirtIO
IONetDescriptor.h VirtIO header definition – must be aligned with one used by device (QEMU)
quverp.h Header used for resource generation (visible on file properties)

NDIS5-specific files

Files under WXP Functionality
ParaNdis5-Driver.c Implementation of NDIS5 driver part
ParaNdis5-Impl.c

ParaNdis5.h

NDIS5-specific implementation of procedure required by common part
ParaNdis5-Oid.c NDIS5-specific implementation of OID
netkvm.inf INF file for XP/2003

NDIS6-specific files

Files under WLH Functionality
ParaNdis6-Driver.c Implementation of NDIS5 driver part
ParaNdis6-Impl.c

ParaNdis6.h

NDIS6-specific implementation of procedure required by common part
ParaNdis6-Oid.c NDIS6-specific implementation of OID
netkvm.inf INF file for Vista/2008

Flows and operation

Initialization

Configurable parameters

Upon initialization the driver retrieves configurable parameters from the adapter registry setting. It checks for each one of parameters the validity of retrieved value (for that the driver contains for each parameter default value, minimal and maximal value) and initializes the adapter according to this configuration. The goal of validity check is preventing of unexpected behavior of the driver during NDIS tests, where the test procedure writes registry with random and invalid values and then starts the adapter (the driver can fail initialization, but can not cause crashes or stop responding). See “List of configurable parameters” below for complete list.

Initialization of the driver and device instance

Driver initialization always starts from system-dependent implementation of DriverEntry procedure. In it the driver registers set of callback procedures to implement per-adapter operation

  • initialization
  • packets sending
  • returning of received packets
  • OID support
  • pausing and resuming (for NDIS6)
  • restart
  • halting

This initialization happens only once. For each PCI device created by QEMU, the registered initialization procedure is called to allocate and prepare per-device context in the driver. There is no dependency between different devices, supported by the same driver; they do not have any common resources. Upon adapter initialization, the system-dependent procedure

  • Allocates storage for device context
  • Does minimal registration of the device instance (just to allow further access to configuration interface and hardware resources)
  • Retrieves configurable parameters
  • Uses allocated hardware resources (IO ports) to access the device and retrieve its features (host features) and MAC address, configured by QEMU command-line (can be overwritten by NDIS)
  • Initializes the device context according to retrieved configuration and device features
  • Allocates and initializes VirtIO library objects
  • Allocates and initializes pre-allocated descriptors, data blocks and binds necessary system-dependent objects to them (according to configurable parameters and device capabilities and features)
  • Enables necessary features in the device (guest features)
  • Enables the interrupt streaming from the device (using interface with device)

Some mechanisms required by driver during initialization have different implementation for NDIS5 and NDIS6:

  • Virtual memory allocation
  • Obtaining configuration handle to retrieve configurable parameters
  • Physical memory allocation and deallocation
  • Allocation and freeing of NDIS object to indicate received data
  • Configuring of interrupt handler and handler of DPC procedure
  • Registering of DMA capabilities (to be able receiving hardware addresses of transmitted buffers)

Initialization process combines system-independent and system-dependent code in order to minimize code duplication and use system-dependent code only when required. NDIS6 driver has advanced initialization process, including report of initial settings (under NDIS5 the NDIS needs to query many driver parameters via OID requests immediately after initialization).

Initialization and clean up of VIRTIO object

Upon initialization, the driver creates two queue objects in VirtIO library – one for RX, one for TX. The VirtIO library retrieves available number of blocks from the device and allocates storage for required number of blocks. These two objects used for all the functional operation after startup. TODO: VirtIO library in its internal code uses non-NDIS calls for allocation of physical memory and translation between physical and virtual addresses. In general, these calls are illegal, but currently the WHQL procedure that verifies calls made from NDIS driver, does not produce error on that. Upon device shutdown (disable or remove) operation, the driver deletes these objects using VirtIO library call. Special case is driver support for power management events (see in “Power management”).

Guest-Host negotiable parameters

Host features do not receive any acknowledge from guest when they do not affect driver-to-device interface. Support for mergeable buffer does, so it is activated only when the driver enables the same feature in guest features set (using PCI part of VirtIO library interface).

Feature Mask in HOST features Mask in GUEST features
Checksum calculation by device VIRTIO_NET_F_CSUM Not required
LSO support by device VIRTIO_NET_F_GUEST_TSO4 Not required
Mergeable RX buffers VIRTIO_NET_F_MRG_RXBUF VIRTIO_NET_F_MRG_RXBUF
Generation of connect / disconnect events VIRTIO_NET_F_STATUS No required

Interrupt processing

System-dependent procedure of interrupt handling pass control to common interrupt handling procedure and reads the interrupt status bit mask from VirtIO device (read operation clears the status in the device): currently there are two bits of status, when bit 1 indicates connect detection and bit 0 indicates any other VirtIO event (ability to read data from TX queue or freed buffers in TX queue). Common interrupt handling procedure serves two main tasks: sending and receiving, by call to system-dependent implementation of ParaNdis_ProcessTx and common procedure ParaNdis_ProcessRxPath.

Sending data

Initialization of transmit path

For send operation, the driver pre-allocates following resources:

  • Descriptors for TX blocks, suitable for linked list. Initial number of TX descriptors is defined in configuration and cut down if related VirtIO queue may accommodate less blocks than configured.
  • Buffer in physical memory for VirtIO header for each descriptor
  • Buffer in physical memory of (MTU + 4) bytes for each descriptor. This buffer able to accommodate packet, transmitted by copy with possible population of priority tag

Each TX transaction with VirtIO requires at least 2 physical blocks: header of 10 bytes (12 when using mergeable buffers) and one or more data blocks. So, if TX queue of VirtIO has capacity 256 blocks, the driver prepares 128 descriptors with two attached memory blocks each. The driver sets the limit of hardware blocks it has to 256. The data block attached to descriptor may be used or not used for each specific transaction. The VirtIO header is always in use.

Configurable parameters, related to transmit path:

  • Number of buffers to use. Can be set less than half or VirtIO capacity. If set to more, will be cut.
  • Using of scatter-gather
  • Checksum offload support
    • IP
    • TCP
    • UDP
  • LSO offload support; if SG not set – ignored

There are also OS-manageable parameters for LSO and checksum offload (at least testing software verifies their functionality – they are used in NDIS6). Their names start from “*” and they can be used to set initial state of offload features upon driver startup.

Common principles

The driver receives packets for sending on its standard NDIS procedure, dependent on NDIS version. Upon call from NDIS for packet sending this procedure is responsible for:

  • allocate and initialize per-packet send entry with preliminary check of ability to deal with packet
  • ensure the packet has its physical buffers ready (automatically happens in NDIS5 only)
  • queue packet (or send entry) into internal list
  • call main body to process the list

Upon interrupt from VirtIO device the activated DPC procedure

  • calls main body to process the list

Main body procedure:

  • calls common procedure to retrieves from VirtIO device previously submitted blocks which may be already transmitted
  • if some buffers are released by VirtIO device, they are prepared for further completion
  • processes the list of packets to send
  • prepares transfer parameters for current packet
    • how many physical buffer the packet contains
    • length of the data
    • offload requirements (checksum, LSO)
    • priority tagging
  • calls common mechanism for packet submission while it is possible
    • packet may be submitted
    • packet may be failed
    • packet may be delayed if we can send it but now currently (no room)
  • packets successfully submitted may be moved to waiting list for further completion
  • failed packets may be moved to waiting list for further completion
  • break from sending loop when there is no room for sending
  • process waiting list, complete finished packets, free system resources related to the operation

There are two possible paths of data transmission: using copying to preallocated buffers from packet data in virtual memory and using immediate hardware addresses (SG table) provided per packet by the system. For very short packets which require padding to minimal size, the driver must use copy. It also uses copy operation, when configured to run without SG support or when it supports software version of TCP or IP checksum offload and it is required for specific packet. For other packets it uses SG table, including those where LSO required; in this last case the driver receives packets with wrong IP checksum and it must set it properly before pass it to the device. When the packet is transmitted using SG table but priority/VLAN tags are to be inserted or IP checksum shall be fixed, the driver partially replaces the data in the outgoing packet with modified copy of the data in preallocated data buffer (the original packet is not touched, of course). Depending on which physical buffers included in the packet, the driver may ignore one or more packet’s hardware buffers fully or partially. Instead of submitting original buffers from the packet, the driver copies required starting part of the data from the packet to own buffer (attached to VirtIO descriptor), makes all the required modifications in it and submits to VirtIO array of buffers starting from its own buffer (see ParaNdis_PacketMapper and ParaNdis_PacketCopier).

In order to decide whether the packet can be transmitted now or must wait for room, the driver needs to know:

  • is there one available descriptor (for one or more buffers to transmit)
  • how many available hardware buffers we have in VirtIO for SG elements
  • how many hardware buffers will require packet we currently transmit

During the operation of packet mapping before transfer, the driver calculates number of physical buffers required to transmit the packet. For copy operation this number is two (VirtIO device header and preallocated buffer for data payload).

For SG operation, before it processes the packet, the driver makes worst case estimation of required number of hardware buffers as number of hardware fragments in packet + one always (for VirtIO header) + one conditional (for case priority tag shall be populated or LSO required and IP header must be fixed).

When in case of copy operation the driver could report the completion of send operation immediately, it is impossible when it uses SG with system-owned physical buffers. The driver takes decision how to transfer this specific packet on per-packet basis, so it uses common completion scheme: the packet reported as completed when VirtIO returns buffers, submitted for this packet.

This also ensures the order of packets completion to NDIS is the same as order they were sent (NDIS5 is sensitive to it). The opaque data value, provided to VirtIO library upon add_buf operation and returned upon get_buf operation is a pointer to per-packet send entry, allocated by driver and containing all the information required to complete the packet and free the resources allocated for it. Structure of this entry differs between NDIS5 and NDIS6 implementations.

The driver always keeps the number or available hardware buffers in VirtIO, decrementing it on each successful add_buf operation and incrementing it on each successful get_buf operation using the number of hardware buffers kept in per-packet send entry.

All the members of adapter context structure related to sending path are protected by Send Lock. All the functional calls to upstream Send Queue of VirtIO library (get_buf, add_buf, kick) must be protected by the same lock object. Note that TX packets completion must be executed without holding spinlocks in both NDIS5 and NDIS6 (the reason is that completion when holding spinlock may cause deadlock, for ex., if NDIS will immediately call send operation). See also Synchronization.

Related procedures:

File Procedure Responsibility
1 ParaNdis-Common.c ParaNdis_DoSubmitPacket Decide which mechanism to use for sending.

Call 2 or 5 to prepare the data. Call add_buf to submit. Return final status of operation

2 ParaNdis_DoCopyPacketData Obtain descriptor and buffer for data copy, call 6 to copy data
3 ParaNdis_VirtIONetReleaseTransmitBuffers Call get_buf to retrieve data. Return available descriptor to the pool. Track number of available descriptor and buffers. Call 7 to do packet dependent operations
Implementation (NDIS version dependent)
4 ParaNdisX-Impl.c ParaNdis_ProcessTx Main body.

Process packets list, call 1 to submit, finish processing depending on result

5 ParaNdis_PacketMapper Process packet and create final list of SG elements. If required, replace it partially using mechanism in 6.

Make all the preparations for LSO.

6 ParaNdis_PacketCopier Process packet and copy data to provided buffer. If needed, populate priority tag.
7 ParaNdis_OnTransmitBufferReleased Label packet as ready for further completion.

NDIS miniport is required to support statistics for send operations: number of successfully/unsuccessfully sent bytes/packets per kind of packet (unicast, multicast and broadcast), i.e. in any case the driver needs to preview the packet’s header. The same ParaNdis_PacketCopier procedure does the job to copy only Ethernet header out of packet for preview.

Sending data in NDIS5 scheme

The driver receives array of NDIS_PACKET structures, each of them describes single packet to be sent. Each NDIS_PACKET contains chain of NDIS_BUFFER structures, describing packet’s fragments in virtual memory. Each of these fragments may contain more than one part in physical memory.

It traverses the chain of NDIS_BUFFER structures in order to access their data for copying in ParaNdis_PacketCopier.

The driver obtains per-packet information from packet’s OOB block

  • SG list of fragments in physical memory (in ParaNdis_PacketMapper)
  • Priority tagging requirements
  • Offload requirements for LSO (TcpLargeSendPacketInfo)
  • Offload requirements for checksum (TcpIpChecksumPacketInfo)

On entry to “Send” procedure the driver allocates Send Entry (tSendEntry structure) for each packet and in further processing maintains Send Queue and its WL as lists of Send Entries.

When the packet processed by ParaNdis_DoSubmitPacket, the packets, that are not delayed, moved to WL. Failed packets are labeled as if they transmit buffer was released. WL is processed on each exit from main TX procedure; finished packets are completed, attached resources (send entry) freed.

Sending data in NDIS6 scheme

The driver receives list of NET_BUFFER_LIST structures; each NET_BUFFER_LIST contains chain of NET_BUFFER structures, when each NET_BUFFER represents packet. Each NET_BUFFER contains list of MDL structures, when each MDL is list of fragments in virtual memory.

The driver shall report completion using list of same NET_BUFFER_LIST structures, although it can group them as it wants, keeping the data and set of NET_BUFFER untouched. The status is also reported per NET_BUFFER_LIST, OOB data is also bound to it. Sending packets, the driver needs to track completion of packets and complete the NET_BUFFER_LIST when the last packed from it is finished.

Due to this reason the driver allocates one structure (tNetBufferListEntry) per NET_BUFFER_LIST and one structure (tNetBufferEntry, starting from list entry) per buffer. The Send Queue contains NET_BUFFER_LIST elements, when each tNetBufferListEntry includes list of its tNetBufferEntry elements.

Unlike NDIS5, in NDIS6 the packet initially does not disclose its hardware addresses. In order to know them, the driver must initiate mapping operation per packet and only in callback procedure it receives SG list of the packet. The mapping request (NdisMAllocateNetBufferSGList) must be issued on DPC level and the callback may be called synchronously or asynchronously. In the last case, the order on which the callback is called for different packets in not guaranteed, at least this is not documented. Thus, the driver hardly can start processing of NET_BUFFER_LIST until all the packets from it are mapped.

ParaNdis6_Send procedure received the list of NET_BUFFER_LIST and

  • For each NET_BUFFER_LIST allocates tNetBufferListEntry and initialize in it all the list-wide parameters
  • Writes tNetBufferListEntry into Scratch field of NET_BUFFER_LIST
  • For each NET_BUFFER in the NET_BUFFER_LIST allocates tNetBufferEntry
  • tNetBufferListEntry contains linked list of its tNetBufferEntry structures
  • each tNetBufferEntry keeps pointer to its NET_BUFFER_LIST
  • for each tNetBufferEntry request mapping
  • if mapping fails for one or more of tNetBufferEntry, call callback directly with empty set of addresses

Upon callback with hardware addresses available

  • keep addresses in tNetBufferEntry (later we need to free them)
  • increment number of mapped buffers belong parent tNetBufferListEntry
  • when all the buffers are mapped
    • insert parent NET_BUFFER_LIST into Send Queue
    • start main body of TX operation (ParaNdis_ProcessTx)

Inside main body of TX operation the driver peeks the packet list at the head of the Send Queue, retrieves from it next packet to send and tries to submit it.

If the packet successfully submitted (as described in common part), the non-empty tNetBufferListEntry is processed again, empty tNetBufferListEntry moved to WL. If the packet is failed, the packet completion procedure is called immediately.

When the packet is completed by VirtIO (or failed), the packet completion procedure labels it as finished and increments number of finished packets in tNetBufferListEntry and frees resources associated with specific tNetBufferEntry.

The main body of TX exits its loop when all the packets are submitted or when the next packet is delayed (no VirtIO buffers for it). Then the procedure checks its WL and completes each NET_BUFFER_LIST in it who have all their packets completed.

Canceling packets in process

NDIS may request canceling of packets in process, providing Cancel ID, which is saved in OOB block.

For NDIS5, this is ID of the packet and it is simple to find it in Send Queue and complete.

For NDIS6, this is ID of NET_BUFFER_LIST which may have some of its packets in process. So, if the NET_BUFFER_LIST is not started yet, it will be completed; if it is in process, it will not be cancelled.

Checksum offload

The NetKVM driver includes implementation of all the TX CS offloads for IPV4 (IP + options, TCP + options, UDP) and it works functionally. CS offload can work with SW emulation and can use host support (when host declares on host features bit mask that it is able to do it, setting VIRTIO_NET_F_CSUM).

It is very helpful during development and diagnostic; some of these mechanisms used for LSO also.

There are 3 problems:

  • IP and TCP or IP and UDP can be enabled together when hardware CS used. VirtIO header is TCP checksum oriented. Possible it can do IP CS, but not together with TCP.
  • The implementation does not pass corner cases of NDIS test: the test sends packets with invalid IP checksum and invalid TCP checksum, the driver required in OOB NOT to fix the CS, the test expects to receive identical packets on the second side, the host does not deliver them
  • There is no performance improvement due to CS support.

The checksum offload is controlled by configuration and disabled by default. Implementation of SW emulation is in sw-offload.c.

All the procedures work with data block starting from the beginning of IP header. First of all they parse header (at least IP + good to have basic header of TCP) to understand the kind of packet and save result of parting in tTcpIpPacketParsingResult structure (actually 32-bit bit mask).

Then, when possible and requested, procedures verify and/or calculate required type of checksum. The exact required modification or verification of checksum is set on parameter bit mask for procedure.

See following procedures:

  • ParaNdis_CheckSumCalculate (parses and calculates)
  • ParaNdis_CheckSumVerify (parses, calculates and modifies or leaves as is)
  • ParaNdis_ReviewIPPacket (parses only)

The driver always receives the offset of IP header in OID parameters when the NDIS configures the miniport for offload task. Currently we declare support only for one encapsulation (802.3) and the offset is always 14 bytes.

Note that the packet data must be in contiguous buffer in virtual space in order to process the packet data. As the traversal of all the data blocks in the packet is already implemented in ParaNdis_PacketCopier procedures, it is used by any procedure also for getting of starting part of the packet for preview.