Internals of NDIS driver for VirtIO based network adapter
Internals of NDIS driver for VIRTIO based network adapter
This document contains implementation notes of Windows network adapter driver of VirtIO network device.
Abbreviations in the document
- LSO, a.k.a. GSO, a.k.a. TSO – Large, Global, Transmit segment offload – the same thing in context of Windows driver.
- OOB – out-of-band packet data; set of structures, associated with packet
- SG – scatter-gather
- MTU – maximal transfer unit
- MSS – maximal segment size
- CS – checksum
- MDL – memory descriptor list
- WL – waiting list
- DUT – device under test
NDIS driver features
Basic networking operations
The NetKVM driver supports non-serialized data sending and receiving. Multiple send operations are supported by driver and not required to be serialized by NDIS (Logo requirements). For send operation scatter-gather mode is a default, it can be disabled by configuration to force copy mode to preallocated buffers. In receive operations the driver can indicate up to NDIS as many packets as VirtIO device can receive without waiting for NDIS to free receive buffers. The default MTU reported by driver to NDIS is 1514 bytes, i.e. NDIS sends without LSO packets up to MTU, the driver can not indicate reception of packets bigger than MTU. MTU can be changed through the device properties in device manager. For bidirectional data channel to/from device, the driver initializes 2 VirtIO queues (transmit and receive). VirtIO queues contain memory blocks for SG entry only (physical address + size), all the data buffers, where the physical address points to, are not related to VirtIO. Buffers, required for VirtIO headers and network data (when necessary) and OS-specific object, required for data indication, are allocated by the driver additionally during initialization.
Checksum offload
NDIS driver can declare following types of TX checksum offload
- IP checksum. NDIS does not initialize IP header checksum, device/driver fill it.
- TCP checksum. NDIS initializes TCP header checksum with “pseudo header checksum” value (pseudo header does not exist in the packet, checksum calculated on virtual structure), the device/driver finish it
- Support IP options and TCP options when calculating checksum.
- UDP checksum – exactly as TCP without options.
- IP and TCP checksum for IPV6.
NDIS driver may implement also support of RX checksum offload (ability to verify IP/TCP checksum in incoming packets) – this feature is not supported in NetKVM. The NetKVM driver has implementation of all the CS offloads for IPV4, it works functionally. It is very helpful during development and diagnostic; some of these mechanisms used for LSO also. But the implementation does not pass corner cases of NDIS tests and disabled in the configuration by default.
Large segment offload
In LSO operation the NDIS provides packet that usually (not always) bigger than MTU; driver/device shall divide it into several smaller fragments before sending to other side. Each fragment, except of last one must contain MSS bytes of TCP data. During fragmentation, the device shall properly populate IP and TCP headers including IP and most of TCP options and fill all the necessary checksums. The NetKVM driver implements LSO offload for IPV4 with maximal packet size of almost 64K before offloading. Note that at least in current implementation the LSO requires scatter-gather for operation, as packets bigger than MTU can not be placed into pre-allocated buffers.
Priority and VLAN tagging
The NetKVM driver on its lower edge supports both Regular Ethernet header of 14 bytes and Ethernet type II header of 18 bytes. The second one includes Priority/VLAN tag (32 bits) with 16-bit field of Priority and VLAN. Ethernet type II header can not be indicated up to NDIS; NDIS also does not provide this tag in outgoing packets. Instead Priority and VLAN data reside in OOB block of NDIS packet(s) and populating this tag in outgoing packets is the responsibility of the driver. It is responsible also for removal of this tag from incoming packets (if present) and place Priority data into OOB block. The driver also checks the VLAN value and ignores incoming packets addressed to VLAN different than one configured for the specific instance of the driver.
Connect detection
Typically, the VirtIO network device driver always indicates connection to the network. During manual WHQL tests the test operator is required to remove the network cable from the socket. The VirtIO network device implementation in QEMU indicates up to the driver connection status, which allows the driver to indicate up to NDIS connect/disconnect events. When disconnected, the driver suspends its send operation and fails to send packets.
Filtering of incoming packets
Filtering is required feature of network adapter under Windows. The NetKVM driver maintains NDIS-managed filter mask, controlling which packets the driver shall indicate and which shall drop. This filter can allow one or more of:
- Unicast packets (to device own MAC address)
- Multicast packets (to one of multicast addresses configured from NDIS)
- All multicast packets (to any multicast addresses)
- Broadcast packets
- All packets
For each incoming packet the driver analyses the destination MAC address and takes decision based on current filter mask. The buffer of the packet is available – it is one of buffers allocated by the driver during initialization.
Statistics
NDIS miniport required to support statistics for send and receive operations: number of successfully/unsuccessfully send/received bytes/packets per kind of packet (unicast, multicast, broadcast). Main consumer of this statistic information is NDIS test.
Sources files
Common files
Files under Common | Functionality |
ParaNdis-Common.c
ndis56common.h ethernetutils.h |
All the common mechanisms and flow: system-independent initialization and cleanup, receive and transmit processing, interactions with VirtIO library |
ParaNdis-Oid.c | Implementation of system-independent part of OID requests |
ParaNdis-Debug.c
kdebugprint.h |
Debug printouts implementation |
sw-offload.c | All the procedures related to IP packets parsing and checksum calculation using software |
osdep.h | Common header for VirtIO |
IONetDescriptor.h | VirtIO header definition – must be aligned with one used by device (QEMU) |
quverp.h | Header used for resource generation (visible on file properties) |
NDIS5-specific files
Files under WXP | Functionality |
ParaNdis5-Driver.c | Implementation of NDIS5 driver part |
ParaNdis5-Impl.c
ParaNdis5.h |
NDIS5-specific implementation of procedure required by common part |
ParaNdis5-Oid.c | NDIS5-specific implementation of OID |
netkvm.inf | INF file for XP/2003 |
NDIS6-specific files
Files under WLH | Functionality |
ParaNdis6-Driver.c | Implementation of NDIS5 driver part |
ParaNdis6-Impl.c
ParaNdis6.h |
NDIS6-specific implementation of procedure required by common part |
ParaNdis6-Oid.c | NDIS6-specific implementation of OID |
netkvm.inf | INF file for Vista/2008 |
Flows and operation
Initialization
Configurable parameters
Upon initialization the driver retrieves configurable parameters from the adapter registry setting. It checks for each one of parameters the validity of retrieved value (for that the driver contains for each parameter default value, minimal and maximal value) and initializes the adapter according to this configuration. The goal of validity check is preventing of unexpected behavior of the driver during NDIS tests, where the test procedure writes registry with random and invalid values and then starts the adapter (the driver can fail initialization, but can not cause crashes or stop responding). See “List of configurable parameters” below for complete list.
Initialization of the driver and device instance
Driver initialization always starts from system-dependent implementation of DriverEntry procedure. In it the driver registers set of callback procedures to implement per-adapter operation
- initialization
- packets sending
- returning of received packets
- OID support
- pausing and resuming (for NDIS6)
- restart
- halting
This initialization happens only once. For each PCI device created by QEMU, the registered initialization procedure is called to allocate and prepare per-device context in the driver. There is no dependency between different devices, supported by the same driver; they do not have any common resources. Upon adapter initialization, the system-dependent procedure
- Allocates storage for device context
- Does minimal registration of the device instance (just to allow further access to configuration interface and hardware resources)
- Retrieves configurable parameters
- Uses allocated hardware resources (IO ports) to access the device and retrieve its features (host features) and MAC address, configured by QEMU command-line (can be overwritten by NDIS)
- Initializes the device context according to retrieved configuration and device features
- Allocates and initializes VirtIO library objects
- Allocates and initializes pre-allocated descriptors, data blocks and binds necessary system-dependent objects to them (according to configurable parameters and device capabilities and features)
- Enables necessary features in the device (guest features)
- Enables the interrupt streaming from the device (using interface with device)
Some mechanisms required by driver during initialization have different implementation for NDIS5 and NDIS6:
- Virtual memory allocation
- Obtaining configuration handle to retrieve configurable parameters
- Physical memory allocation and deallocation
- Allocation and freeing of NDIS object to indicate received data
- Configuring of interrupt handler and handler of DPC procedure
- Registering of DMA capabilities (to be able receiving hardware addresses of transmitted buffers)
Initialization process combines system-independent and system-dependent code in order to minimize code duplication and use system-dependent code only when required. NDIS6 driver has advanced initialization process, including report of initial settings (under NDIS5 the NDIS needs to query many driver parameters via OID requests immediately after initialization).
Initialization and clean up of VIRTIO object
Upon initialization, the driver creates two queue objects in VirtIO library – one for RX, one for TX. The VirtIO library retrieves available number of blocks from the device and allocates storage for required number of blocks. These two objects used for all the functional operation after startup. TODO: VirtIO library in its internal code uses non-NDIS calls for allocation of physical memory and translation between physical and virtual addresses. In general, these calls are illegal, but currently the WHQL procedure that verifies calls made from NDIS driver, does not produce error on that. Upon device shutdown (disable or remove) operation, the driver deletes these objects using VirtIO library call. Special case is driver support for power management events (see in “Power management”).
Guest-Host negotiable parameters
Host features do not receive any acknowledge from guest when they do not affect driver-to-device interface. Support for mergeable buffer does, so it is activated only when the driver enables the same feature in guest features set (using PCI part of VirtIO library interface).
Feature | Mask in HOST features | Mask in GUEST features |
Checksum calculation by device | VIRTIO_NET_F_CSUM | Not required |
LSO support by device | VIRTIO_NET_F_GUEST_TSO4 | Not required |
Mergeable RX buffers | VIRTIO_NET_F_MRG_RXBUF | VIRTIO_NET_F_MRG_RXBUF |
Generation of connect / disconnect events | VIRTIO_NET_F_STATUS | No required |
Interrupt processing
System-dependent procedure of interrupt handling pass control to common interrupt handling procedure and reads the interrupt status bit mask from VirtIO device (read operation clears the status in the device): currently there are two bits of status, when bit 1 indicates connect detection and bit 0 indicates any other VirtIO event (ability to read data from TX queue or freed buffers in TX queue). Common interrupt handling procedure serves two main tasks: sending and receiving, by call to system-dependent implementation of ParaNdis_ProcessTx and common procedure ParaNdis_ProcessRxPath.
Sending data
Initialization of transmit path
For send operation, the driver pre-allocates following resources:
- Descriptors for TX blocks, suitable for linked list. Initial number of TX descriptors is defined in configuration and cut down if related VirtIO queue may accommodate less blocks than configured.
- Buffer in physical memory for VirtIO header for each descriptor
- Buffer in physical memory of (MTU + 4) bytes for each descriptor. This buffer able to accommodate packet, transmitted by copy with possible population of priority tag
Each TX transaction with VirtIO requires at least 2 physical blocks: header of 10 bytes (12 when using mergeable buffers) and one or more data blocks. So, if TX queue of VirtIO has capacity 256 blocks, the driver prepares 128 descriptors with two attached memory blocks each. The driver sets the limit of hardware blocks it has to 256. The data block attached to descriptor may be used or not used for each specific transaction. The VirtIO header is always in use.
Configurable parameters, related to transmit path:
- Number of buffers to use. Can be set less than half or VirtIO capacity. If set to more, will be cut.
- Using of scatter-gather
- Checksum offload support
- IP
- TCP
- UDP
- LSO offload support; if SG not set – ignored
There are also OS-manageable parameters for LSO and checksum offload (at least testing software verifies their functionality – they are used in NDIS6). Their names start from “*” and they can be used to set initial state of offload features upon driver startup.
Common principles
The driver receives packets for sending on its standard NDIS procedure, dependent on NDIS version. Upon call from NDIS for packet sending this procedure is responsible for:
- allocate and initialize per-packet send entry with preliminary check of ability to deal with packet
- ensure the packet has its physical buffers ready (automatically happens in NDIS5 only)
- queue packet (or send entry) into internal list
- call main body to process the list
Upon interrupt from VirtIO device the activated DPC procedure
- calls main body to process the list
Main body procedure:
- calls common procedure to retrieves from VirtIO device previously submitted blocks which may be already transmitted
- if some buffers are released by VirtIO device, they are prepared for further completion
- processes the list of packets to send
- prepares transfer parameters for current packet
- how many physical buffer the packet contains
- length of the data
- offload requirements (checksum, LSO)
- priority tagging
- calls common mechanism for packet submission while it is possible
- packet may be submitted
- packet may be failed
- packet may be delayed if we can send it but now currently (no room)
- packets successfully submitted may be moved to waiting list for further completion
- failed packets may be moved to waiting list for further completion
- break from sending loop when there is no room for sending
- process waiting list, complete finished packets, free system resources related to the operation
There are two possible paths of data transmission: using copying to preallocated buffers from packet data in virtual memory and using immediate hardware addresses (SG table) provided per packet by the system. For very short packets which require padding to minimal size, the driver must use copy. It also uses copy operation, when configured to run without SG support or when it supports software version of TCP or IP checksum offload and it is required for specific packet. For other packets it uses SG table, including those where LSO required; in this last case the driver receives packets with wrong IP checksum and it must set it properly before pass it to the device. When the packet is transmitted using SG table but priority/VLAN tags are to be inserted or IP checksum shall be fixed, the driver partially replaces the data in the outgoing packet with modified copy of the data in preallocated data buffer (the original packet is not touched, of course). Depending on which physical buffers included in the packet, the driver may ignore one or more packet’s hardware buffers fully or partially. Instead of submitting original buffers from the packet, the driver copies required starting part of the data from the packet to own buffer (attached to VirtIO descriptor), makes all the required modifications in it and submits to VirtIO array of buffers starting from its own buffer (see ParaNdis_PacketMapper and ParaNdis_PacketCopier).
In order to decide whether the packet can be transmitted now or must wait for room, the driver needs to know:
- is there one available descriptor (for one or more buffers to transmit)
- how many available hardware buffers we have in VirtIO for SG elements
- how many hardware buffers will require packet we currently transmit
During the operation of packet mapping before transfer, the driver calculates number of physical buffers required to transmit the packet. For copy operation this number is two (VirtIO device header and preallocated buffer for data payload).
For SG operation, before it processes the packet, the driver makes worst case estimation of required number of hardware buffers as number of hardware fragments in packet + one always (for VirtIO header) + one conditional (for case priority tag shall be populated or LSO required and IP header must be fixed).
When in case of copy operation the driver could report the completion of send operation immediately, it is impossible when it uses SG with system-owned physical buffers. The driver takes decision how to transfer this specific packet on per-packet basis, so it uses common completion scheme: the packet reported as completed when VirtIO returns buffers, submitted for this packet.
This also ensures the order of packets completion to NDIS is the same as order they were sent (NDIS5 is sensitive to it). The opaque data value, provided to VirtIO library upon add_buf operation and returned upon get_buf operation is a pointer to per-packet send entry, allocated by driver and containing all the information required to complete the packet and free the resources allocated for it. Structure of this entry differs between NDIS5 and NDIS6 implementations.
The driver always keeps the number or available hardware buffers in VirtIO, decrementing it on each successful add_buf operation and incrementing it on each successful get_buf operation using the number of hardware buffers kept in per-packet send entry.
All the members of adapter context structure related to sending path are protected by Send Lock. All the functional calls to upstream Send Queue of VirtIO library (get_buf, add_buf, kick) must be protected by the same lock object. Note that TX packets completion must be executed without holding spinlocks in both NDIS5 and NDIS6 (the reason is that completion when holding spinlock may cause deadlock, for ex., if NDIS will immediately call send operation). See also Synchronization.
Related procedures:
File | Procedure | Responsibility | |
1 | ParaNdis-Common.c | ParaNdis_DoSubmitPacket | Decide which mechanism to use for sending.
Call 2 or 5 to prepare the data. Call add_buf to submit. Return final status of operation |
2 | ParaNdis_DoCopyPacketData | Obtain descriptor and buffer for data copy, call 6 to copy data | |
3 | ParaNdis_VirtIONetReleaseTransmitBuffers | Call get_buf to retrieve data. Return available descriptor to the pool. Track number of available descriptor and buffers. Call 7 to do packet dependent operations | |
Implementation (NDIS version dependent) | |||
4 | ParaNdisX-Impl.c | ParaNdis_ProcessTx | Main body.
Process packets list, call 1 to submit, finish processing depending on result |
5 | ParaNdis_PacketMapper | Process packet and create final list of SG elements. If required, replace it partially using mechanism in 6.
Make all the preparations for LSO. | |
6 | ParaNdis_PacketCopier | Process packet and copy data to provided buffer. If needed, populate priority tag. | |
7 | ParaNdis_OnTransmitBufferReleased | Label packet as ready for further completion. |
NDIS miniport is required to support statistics for send operations: number of successfully/unsuccessfully sent bytes/packets per kind of packet (unicast, multicast and broadcast), i.e. in any case the driver needs to preview the packet’s header. The same ParaNdis_PacketCopier procedure does the job to copy only Ethernet header out of packet for preview.
Sending data in NDIS5 scheme
The driver receives array of NDIS_PACKET structures, each of them describes single packet to be sent. Each NDIS_PACKET contains chain of NDIS_BUFFER structures, describing packet’s fragments in virtual memory. Each of these fragments may contain more than one part in physical memory.
It traverses the chain of NDIS_BUFFER structures in order to access their data for copying in ParaNdis_PacketCopier.
The driver obtains per-packet information from packet’s OOB block
- SG list of fragments in physical memory (in ParaNdis_PacketMapper)
- Priority tagging requirements
- Offload requirements for LSO (TcpLargeSendPacketInfo)
- Offload requirements for checksum (TcpIpChecksumPacketInfo)
On entry to “Send” procedure the driver allocates Send Entry (tSendEntry structure) for each packet and in further processing maintains Send Queue and its WL as lists of Send Entries.
When the packet processed by ParaNdis_DoSubmitPacket, the packets, that are not delayed, moved to WL. Failed packets are labeled as if they transmit buffer was released. WL is processed on each exit from main TX procedure; finished packets are completed, attached resources (send entry) freed.
Sending data in NDIS6 scheme
The driver receives list of NET_BUFFER_LIST structures; each NET_BUFFER_LIST contains chain of NET_BUFFER structures, when each NET_BUFFER represents packet. Each NET_BUFFER contains list of MDL structures, when each MDL is list of fragments in virtual memory.
The driver shall report completion using list of same NET_BUFFER_LIST structures, although it can group them as it wants, keeping the data and set of NET_BUFFER untouched. The status is also reported per NET_BUFFER_LIST, OOB data is also bound to it. Sending packets, the driver needs to track completion of packets and complete the NET_BUFFER_LIST when the last packed from it is finished.
Due to this reason the driver allocates one structure (tNetBufferListEntry) per NET_BUFFER_LIST and one structure (tNetBufferEntry, starting from list entry) per buffer. The Send Queue contains NET_BUFFER_LIST elements, when each tNetBufferListEntry includes list of its tNetBufferEntry elements.
Unlike NDIS5, in NDIS6 the packet initially does not disclose its hardware addresses. In order to know them, the driver must initiate mapping operation per packet and only in callback procedure it receives SG list of the packet. The mapping request (NdisMAllocateNetBufferSGList) must be issued on DPC level and the callback may be called synchronously or asynchronously. In the last case, the order on which the callback is called for different packets in not guaranteed, at least this is not documented. Thus, the driver hardly can start processing of NET_BUFFER_LIST until all the packets from it are mapped.
ParaNdis6_Send procedure received the list of NET_BUFFER_LIST and
- For each NET_BUFFER_LIST allocates tNetBufferListEntry and initialize in it all the list-wide parameters
- Writes tNetBufferListEntry into Scratch field of NET_BUFFER_LIST
- For each NET_BUFFER in the NET_BUFFER_LIST allocates tNetBufferEntry
- tNetBufferListEntry contains linked list of its tNetBufferEntry structures
- each tNetBufferEntry keeps pointer to its NET_BUFFER_LIST
- for each tNetBufferEntry request mapping
- if mapping fails for one or more of tNetBufferEntry, call callback directly with empty set of addresses
Upon callback with hardware addresses available
- keep addresses in tNetBufferEntry (later we need to free them)
- increment number of mapped buffers belong parent tNetBufferListEntry
- when all the buffers are mapped
- insert parent NET_BUFFER_LIST into Send Queue
- start main body of TX operation (ParaNdis_ProcessTx)
Inside main body of TX operation the driver peeks the packet list at the head of the Send Queue, retrieves from it next packet to send and tries to submit it.
If the packet successfully submitted (as described in common part), the non-empty tNetBufferListEntry is processed again, empty tNetBufferListEntry moved to WL. If the packet is failed, the packet completion procedure is called immediately.
When the packet is completed by VirtIO (or failed), the packet completion procedure labels it as finished and increments number of finished packets in tNetBufferListEntry and frees resources associated with specific tNetBufferEntry.
The main body of TX exits its loop when all the packets are submitted or when the next packet is delayed (no VirtIO buffers for it). Then the procedure checks its WL and completes each NET_BUFFER_LIST in it who have all their packets completed.
Canceling packets in process
NDIS may request canceling of packets in process, providing Cancel ID, which is saved in OOB block.
For NDIS5, this is ID of the packet and it is simple to find it in Send Queue and complete.
For NDIS6, this is ID of NET_BUFFER_LIST which may have some of its packets in process. So, if the NET_BUFFER_LIST is not started yet, it will be completed; if it is in process, it will not be cancelled.
Checksum offload
The NetKVM driver includes implementation of all the TX CS offloads for IPV4 (IP + options, TCP + options, UDP) and it works functionally. CS offload can work with SW emulation and can use host support (when host declares on host features bit mask that it is able to do it, setting VIRTIO_NET_F_CSUM).
It is very helpful during development and diagnostic; some of these mechanisms used for LSO also.
There are 3 problems:
- IP and TCP or IP and UDP can be enabled together when hardware CS used. VirtIO header is TCP checksum oriented. Possible it can do IP CS, but not together with TCP.
- The implementation does not pass corner cases of NDIS test: the test sends packets with invalid IP checksum and invalid TCP checksum, the driver required in OOB NOT to fix the CS, the test expects to receive identical packets on the second side, the host does not deliver them
- There is no performance improvement due to CS support.
The checksum offload is controlled by configuration and disabled by default. Implementation of SW emulation is in sw-offload.c.
All the procedures work with data block starting from the beginning of IP header. First of all they parse header (at least IP + good to have basic header of TCP) to understand the kind of packet and save result of parting in tTcpIpPacketParsingResult structure (actually 32-bit bit mask).
Then, when possible and requested, procedures verify and/or calculate required type of checksum. The exact required modification or verification of checksum is set on parameter bit mask for procedure.
See following procedures:
- ParaNdis_CheckSumCalculate (parses and calculates)
- ParaNdis_CheckSumVerify (parses, calculates and modifies or leaves as is)
- ParaNdis_ReviewIPPacket (parses only)
The driver always receives the offset of IP header in OID parameters when the NDIS configures the miniport for offload task. Currently we declare support only for one encapsulation (802.3) and the offset is always 14 bytes.
Note that the packet data must be in contiguous buffer in virtual space in order to process the packet data. As the traversal of all the data blocks in the packet is already implemented in ParaNdis_PacketCopier procedures, it is used by any procedure also for getting of starting part of the packet for preview.
Large TCP segments offload
The NetKVM driver implements LSO for IPV4 only. This feature depends on host ability, indicated by VIRTIO_NET_F_HOST_TSO4. When NDIS sends packets with request for LSO in OOB it does not set TCP and IP checksums.
There is one issue with the LSO at the moment:
- If the driver does not fix the IP checksum in LSO packet, the host does not deliver packets
The driver sets limit of block size that can be submitted to 0xF000. NDIS during tests uses different sizes of packet and sets the MSS value to different values from 536 bytes; sometimes the host generates splash of more than 100 packets, so the test may fail if some other traffic present on the tap.
In functional environment, the system usually sets MSS to maximal possible value (1460 if IP and TCP are without options).
The LSO mechanism uses the same IP header offset value as checksum offload; it is set by NDIS when it enables any of offload capabilities. Additionally, in NDIS6 the driver receives (for each NET_BUFFER_LIST) the offset of TCP header in the outgoing packet.
When using LSO, the driver shall calculate and write into OOB (TcpLargeSendPacketInfo) the summary length of all the transmitted packets. It is not documented how exactly the driver shall calculate this value, only a criterion is that NDIS tests pass; the driver calculates these numbers differently for NDIS5 and NDIS6.
Priority and VLAN tagging
Priority and VLAN information are combined in the same tag and the functionality of Priority verified under NDIS test (verified using VirtIO adapter in pair with VirtIO and also in pair with e1000 adapter).
In general, selected VLAN to operate shall be provided by operation system on the configuration setting of the driver, the driver needs to populate this value in the outgoing tag.
However, VLAN functionality does not have any test case under NDIS test suite and currently not verified.
Corner cases
Small packets
One of requirement to network adapter (mentioned in Logo requirements) is padding of outgoing packets that have length less than 60 bytes with zeroes up to 60 bytes. Due to this requirement, the driver always use copy for such short packets and does the padding (in ParaNdis_PacketCopier).
Spurious data in packets
The driver must be careful parsing chain of fragments related to each packet provided for transmission. Each packet has declared length of the data; the buffers chain may continue after this valid area, following buffers are not-valid. The driver must ignore them and never touch.
In NDIS6 unused area may be present also before used data area; this unused area present in the packet’s SG list received by driver, and it must check where the used data starts and take it in account preparing the packet for submission.
Effect of mergeable buffers
Support for mergeable RX buffers change the format of VirtIO header used when the driver communicates with VirtIO network device. This change affects also TX path; during initialization the driver allocates bigger area for VirtIO header.
Receiving data
Initialization of receive path
During initialization the driver creates VirtIO queue which is automatically sized according to size of the queue in the device. The driver must provide physical buffers to place the received data as well as the structure for tracking buffers (descriptors).
In order to be able to communicate with NDIS core and provide received packets, the driver must have a pool of NDIS objects for such indication.
- For NDIS5 the driver indicates array of NDIS_PACKET structures, each one may have one or more NDIS_BUFFER structures attached
- For NDIS6 the driver indicate one or more NET_BUFFER_LIST structures; each one has single NET_BUFFER structure attached (may have more than one); each NET_BUFFER contains one MDL describing buffer in virtual memory (may contain more than one MDL)
Thus, for any packet the driver receives, it needs to have contiguous buffer with known physical and virtual addresses and
- For NDIS5 – NDIS_PACKET with attached NDIS_BUFFER describing physical buffer
- For NDIS6 - NET_BUFFER_LIST structure with NET_BUFFER attached with MDL physical buffer
The driver uses configuration parameter of initial number of RX buffers to prepare. Using this number it preallocates these system-dependent objects (one per buffer) and pool objects for them and uses them when required.
Then the driver starts preparing descriptor structures (listable control blocks) for receiving. The driver maintains list of descriptors that are waiting for data in VirtIO queue (NetReceiveBuffers) and list of descriptors that were indicated to NDIS and must be returned to the driver (NetReceiveBuffersWaiting). The driver prepares descriptors one-by-one:
- Allocates descriptor for RX buffer
- Allocates physical buffer for VirtIO header (10 bytes)
- Allocates physical buffer for packet data (1514 bytes + 4 for possible priority tag)
- Saves these buffers (physical address, virtual address and size) in the descriptor
- Calls system-dependent implementation of ParaNdis_BindBufferToPacket to allocate required system-dependent object (MDL for NDIS6, NDIS_BUFFER for NDIS5) to describe the buffer for packet data
- Uses add_buf to pass these 2 buffers (header and data) to VirtIO for further receiving
- When the VirtIO can not add more buffers, the driver stops the loop – it will work with as many buffers as VirtIO supports
- The descriptor added to list of buffers that VirtIO currently owns (NetReceiveBuffers)
There is small change for initialization when mergeable RX buffer supported:
- Separate physical buffer for VirtIO header not used
- Physical buffer for packet data (1514 bytes + 4 for possible priority tag + 12 bytes header)
- Uses add_buf to pass 1 buffer (header with data) to VirtIO for further receiving
The driver can indicate packets one by one (as arrays of 1 element or as list of one NET_BUFFER_LIST element), but batch indication (as array of packets or list of many elements) significantly improves the performance. For that, during initialization the driver allocates storage for array of pointers to use it during batch indication.
Receiving
Main procedure for receive ParaNdis_ProcessRxPath retrieves all the available buffers from VirtIO RX queue and starts from check of packet type; virtual address of the buffer was saved in the descriptor during initialization. It checks the destination address of the packet and decides using current packet filter mask whether the packet shall be indicated up to networks stack or dropped. For packet to be indicated it calls system-dependent implementation of ParaNdis_IndicateReceivedPacket, which can indicate the packet or just prepare it for further indication in batch of packets; ParaNdis_IndicateReceivedPacket allocates from prepared pools required objects and retrieves from the descriptor already prepared object, describing the buffer, and returns system-dependent object for indication (NDIS_PACKET or NET_BUFFER_LIST). When the batch is full, system-dependent procedure ParaNdis_IndicateReceivedBatch finally does the indication. The size of batch defined during initialization – it can include all the buffers of VirtIO, but during the loop the procedure, in general, may retrieve more buffers than VirtIO contains, as some buffers may be returned by NDIS, put back to VirtIO and returned back to the driver.
The procedure of indication must be called without holding spinlock, as the returning of buffers from NDIS to the driver may happen synchronously or asynchronously.
When indicating packets, the driver keeps in NDIS_PACKET or NET_BUFFER_LIST the address of the buffer’s descriptor for easy post-processing upon returning.
Additional responsibility of ParaNdis_IndicateReceivedPacket is removal of priority tag if it present in incoming packet and populating priority/VLAN data in packet’s OOB. Currently this procedure is not optimal – it removes the tag by moving data inside the buffer; it can be optimized later.
Packets, filtered out by filter mask, are returned to VirtIO immediately.
All the variables and objects related to RX path are protected by Receive Lock. All the operations with VirtIO RX queue are protected the same way from simultaneous add_buf, get_buf, kick, restart operations.
Returning buffers
NDIS returns packets one-by-one in NDIS5 or as list of NET_BUFFER_LIST structures in NDIS6; in both cases all the objects are reinitialized and returned to their pools, and the descriptor, ready for reuse, added to the VirtIO queue using common procedure ParaNdis_VirtIONetReuseRecvBuffer (also protected by Receive Lock).
Pausing of operation
NDIS6 defines required procedures in miniport used for pausing and resuming of adapter’s operation. It uses pausing before start binding or unbinding of protocol drivers to/from driver’s stack and also when initializes adapter and shuts it down. Pause procedure must stop send and receive operations, return to NDIS all the buffers the miniport has in TX path and receive from NDIS all the miniport buffers which were indicated.
Although NDIS5 does not require such operations, the driver uses them for both NDIS5 and NDIS6 for consistent implementation of transitions during initialization, halting, reset, power off and power on.
Both TX and RX paths maintain State variable, which may be tri-state Enabled/Disabled/Pausing (tSendReceiveState). Transition from Disabled to Enabled is simple and synchronous; transition from Enabled to Disabled is asynchronous, with possible intermediate Pausing state when the RX path waits to returning of all the packets, indicated to NDIS, and TX path waits for completion of all the packets which sending is in process.
When asynchronous pausing started, sending and receiving of new packets are suppressed and callback procedures set to indicate end of transition and provide indication to NDIS when applicable.
Triggers to check whether the transition to paused state is finished is the moment when
- (RX path) reusing RX buffers and passing them to VirtIO (when there are no waiting buffers, the suspend of RX paths done)
- (TX path) main TX procedure, called from interrupt DPC when VirtIO indicates that sending of some submitted buffers is finished.
Reset operation implemented the same way – “reset” procedure pauses RX and TX path, then does “reset” of internal state and device and restarts RX and TX.
Power off and power on procedures also includes pause (suspend) and resume as steps of execution.
Connect detection
Connect detection starts from device implementation in QEMU, which maintain connection state and generates state change events upon change from monitor’s command line. Driver recognizes these events, when reads interrupt status and passes it on context variable to DPC procedure.
When connect detection bit in interrupt status set (bit 1), the driver in the DPC rereads connect status from configuration array of the device (the configuration array contains MAC address and connect status) via IO space of PCI device and sends indication to NDIS about change in connection state. The exact implementation of connect / disconnect indication depends on NDIS version and uses system-dependent procedure ParaNdis_IndicateConnect.
When connect status in the miniport is “not connected”, the procedure responsible for packets sending, immediately completes any new packet with status of failure; packets that already queues for sending will be completed as usual. Processing of interrupts and DPC continues; any packet, received in not connected state, returned to VirtIO queue without indication.
- To set the link status through QEMU nonitor, type set_link <nic>.<instance number> <up | down>
OID support
The implementation of OID support in NDIS5 and NDIS6 is slightly different, although the approach is the same and there are many OID with nearly identical implementation in both systems.
NDIS5 defines Get and Set methods for OID, NDIS6 defines many types but actually in use also Get and Set (Get Statistics method is actually Get, other are optional).
For both, on each OID call the driver receives exchange buffer and its input and output size; the driver needs to indicate number of bytes it read or written on successful completion; on failure due to too small buffer the driver indicates proper size of buffer required for the operation.
This allows splitting of OID support implementation to system-dependent and system-independent part where entry points are system dependent and the implementation for many OID is common.
OID call executes on IRQL that may be PASSIVE or DISPATCH, so the driver must or complete the OID immediately or fail it immediately. In other rare cases the driver completes OID asynchronously using scheduled work items or special indications.
Each system-dependent implementation contains table of structures defining all the supported OID operations, each table entry includes
- OID code (32-bit value)
- Bit mask describing operations valid for this OID (get, set, get via statistics, conversion of 64-bit to 32-bit when applicable)
- Procedure to call when Set operation requested
- Logging level upon entry, completion or error
Registered procedure first retrieves from the table OID entry for received OID, initializes temporary descriptor for the request in system-independent format (which contain input/output buffer, their sizes etc.) and passes the request to suitable procedure:
- For Set request – to “Set” processing routine specified in the OID table
- For Get (Query) request –to system-dependent procedure, which processed OID codes supported only for specific system; in default case it calls common procedure for system-independent processing of the rest of OID
Starting operation for “Set” request and final operation for “Get” request, including checking of buffer sizes and transfer of the data are implemented in system-independent procedures (ParaNdis_OidSetCopy, ParaNdis_OidQueryCopy) to avoid multiple checks in each OID specific procedure.
General-purpose OID
Many OID requests are directly related to fields of device context and actually copy data to/from buffers with exact buffer size validation and minimal processing, as:
- Connection speed and connection status
- MTU size
- Estimated free buffer space
- Packet filter bit mask
- MAC addresses, current and permanent
- List of multicast addresses
- List of Wakeup patterns
- Data transfer statistics
Offload OID
OID requests, related to offload control, are different in NDIS5 and NDIS6 and do not have too much common points.
NDIS5
After device instance initialization NDIS queries the offload capabilities of the driver/device, providing buffer that the driver fills by concatenation of NDIS_TASK_OFFLOAD structures, each of ones describe capabilities of the device to support specific kind of task offload (checksum, LSO, security etc). In “Set” request for offload OID, the NDIS passes to the driver list of NDIS_TASK_OFFLOAD structures, describing configuration of offload features which the driver shall enable. This configuration always contain encapsulation setting (one of those the driver supports) and placement of IP header in data buffers of packets which will be sent (non-IP packets are out of interest, for them offload will never be required).
Empty set of tasks to offload effectively disables offload capabilities for the driver.
Using configuration parameters some or all offload task can be masked from reports and left disabled.
NDIS6
Initial set of possible kinds of offload tasks (capabilities) and initial set of enabled offload tasks (configuration) the driver retrieves from configuration parameters and reports to NDIS during its initialization sequence (via NdisMSetMiniportAttributes).
The NDIS sends to driver mandatory offload parameters (encapsulation, IP header offset) on dedicated OID call and, if needed, changes configuration of specific offload tasks in following OID requests, passing NDIS_OFFLOAD structure. Historically, the driver required to support different versions of offload tasks formats (V1 and V2), where V2 includes also IPV6 related fields.
When there is some change in offload configuration, caused by “Set” request sent to the driver, it reports its new offload configuration via status indication mechanism, passing NDIS_OFFLOAD structure filled with values, reflecting changes of the current configuration. Exact meaning of field values in NDIS_OFFLOAD structure varies depending on exact flow where the structure is used – in NDIS-to-driver “Set” request or in driver-to-NDIS status indication.
Statistics OID
In NDIS5 each statistics entry implemented using its own OID; in NDIS6 the miniport driver can return also single structure (NDIS_STATISTICS_INFO) including all the statistics; for simplicity, NDIS5 implementation of the driver keeps its statistics entry in the same structure in which NDIS6 implementation returns statistics information.
Power management OID
Power management starts from Set OID request for OID_PNP_SET_POWER, providing target device state (D0 or D3, these states implied, other must be declared in PCI properties, if supported).
These requests can never be completed synchronously, so the driver in both NDIS5 and NDIS6 return STATUS_PENDING and uses system-specific ways to schedule work item which is executed on PASSIVE IRQL. From callback procedure of work item the driver calls common implementation of power transition - ParaNdis_PowerOff and ParaNdis_PowerOn; then indicates that the OID processing is completed.
Note that system power transition (standby or hibernation) will be blocked until the driver completes OID_PNP_SET_POWER (or will wait long time for completion, on timeout will produce BSOD). This OID for D3 can not be failed (in best case this will abort system transition to low power state); in general, the driver never shall fail power management OID requests.
Power management implementation
In general, the state of hardware devices in system hibernation state (S4) is very close to power-off. During transition to S0 system state the VirtIO device will be full reinitialized. During system transition to sleep (S3) state some device context may persist. From driver’s point of view there is no difference between system transition to S4 and S3, in both cases the device goes to D3.
The NDIS driver during initialization can declare legacy behavior and ask for halt on transition to low power device state (NDIS_MINIPORT_ATTRIBUTES_NO_HALT_ON_SUSPEND flag declares support for power management), having on D0 to D3 transition the same flow as if the device was disabled. Unfortunately, testing software does not like such a behavior, so the driver must simulate extended support for power management.
Declaring support for power management, the driver receives SET_POWER OID requests on system transitions and in both cases of system state transition (S3 and S4), it keeps its internal state, including buffers allocated in VirtIO library and physical buffers allocated during initialization.
Complete re-initialization on power-on transition is not a good choice, as (especially in NDIS6) the driver required to allocate physical memory during initialization and not on power-transition; failure to allocate physical memory on power-on transition will be unsuitable.
In order to prevent misunderstanding, the driver on each power-off device transition (D0 to D3) resets the device (by writing 0 to status register) and on power-on transition (D3 to D0) initializes it from scratch using the same set of already allocated buffers in VirtIO library and the same set of physical buffers allocated during initialization. For that, in the implementation of VirtIO library
- the shutdown method of the VirtIO queue is fixed to bring the queue to its initial state
- added high level method to PCI device API to renew the queue using existing buffers and just rewrite required addresses to registers of VirtIO device
Finally, the power-off sequence starts from Pause operation (suspend RX and TX and wait until all the blocks are returned to their owners). Then the driver shuts down both VirtIO queues, completely stopping the transfer, and resets the device by writing 0 to status register.
On power-on sequence the driver renews VirtIO queues and restarts send and receive paths.
The driver implements formal support for wake-up capabilities via wake-up patterns and/or magic packets; currently there is no support for this feature in QEMU and the goal of this implementation is only formal compatibility with requirements of WHQL testing procedures.
Reset command
For consistency, the same flow that the driver uses for device power-off and power-on during system power transition, it uses to implement reset command.
The trigger for reset is NDIS call to miniport reset procedure, registered by the driver. NDIS can call it when it wants (especially during tests), but in normal usage there are two reasons to receive reset command:
- Very long delay in completion of OID request
- Two positive responses to miniport “Check for hang” procedure
The driver never does it, so reset command mainly initiated by test cases (NDIS tests). Upon reset command, the driver acts like it received sequence “power-off, power-on”.
Halt command (adapter’s shutdown)
Upon Halt command, the driver frees all the resources, allocated for device instance and leaves the device in inactive state, not able to generate interrupts and ready for next initialization.
Implementation of device reset in Halt flow is similar to device reset.
Note that device reset sequence, to clean up device QEMU context, includes also writing 0 to register of guest features.
To initiate Halt command, disable the device in the device manager manually or do it using “devcon” tool from WDK.
Synchronization mechanisms
Synchronization goals
Under different flows, processing requests from NDIS as well as events from device, the driver must protect fields of its context from simultaneous modification from different contexts. Guidelines are simple
- Flow running on IRQL= PASSIVE may be preempted by DPC even on the same processor
- Flow running on IRQL= DISPATCH may be preempted by interrupt even on the same processor
- Flow running on any IRQL may run at the same time and conflict with any flow on different processor
The source code of VirtIO library has the same potential hazards as the source code of the driver itself; VirtIO library does not include any synchronization mechanisms and caller of any procedure do it on its own responsibility.
Each method of VirtIO queue, callable from external code, modifies internal state of this queue; so each access to VirtIO queue must be protected.
Synchronization objects
The driver uses modest set of synchronization means
- DPC spin lock acquired on DPC entry and release on DPC exit
- SendLock spinlock acquired when accessing variables related to TX path
- ReceiveLock spinlock acquired when accessing variables related to RX path
The driver also uses interlocked access to specific variables to provide synchronization of their modification regardless context where they are modified and interrupt synchronization to protect very limited set of variables from simultaneous modification from interrupt procedure and other contexts.
Additional point of synchronization (more precise, IRQL forcing) is request for TX packet mapping when NDIS6 send packets to miniport. NdisMAllocateNetBufferSGList must be called on IRQL=DISPATCH, but we can achieve it acquiring SendLock as we do with all the TX variables, as it can immediately call mapping callback, causing deadlock. In order to force IRQL=DISPATCH we initialize temporary spinlock on the stack and acquire it before call NDIS for buffers mapping and immediately release. Even if NDIS will call us from any NDIS call initiated by TX path now, it will not cause deadlock and will acquire another temporary spinlock.
SendLock and ReceiveLock actually both use the same spinlock object – it depends on define “UNIFY_LOCKS” and can be modified by compilation, no side-effects expected, as TX path and RX path does not have any common variables.
NDIS requirements
Some NDIS calls according to documentation may not be called with spinlocks held. Only reason for that could be a possibility that from this call the NDIS will call some driver’s procedure and this may cause deadlock - attempt to acquire the same spinlock twice from the same processor or attempt to acquire two spinlocks on two difference processor in different order.
The calls are
- Indication of received packet to NDIS
- Completion of send packets
- Status indications
Strictly speaking, we violate this requirement.
In driver’s DPC procedure, implementing deferred processing of interrupt raised by device, we acquire DPC spinlock and hold it until DPC is finished. We do not expect any side-effects from it, as this lock is not a part of any flow initiated by NDIS during packets sending or receiving.
This DPC lock serves two goals:
- Ensure in Halt or procedure that there is no DPC in process and any other DPC will not be processed when we free buffers and descriptors. It is possible to write this code without using
- Ensure that when we are in DPC process, another DPC on another CPU will not run. With traditional hardware this is simple – device interrupts are disabled in IRQ handler and enabled in DPC. The VirtIO clears interrupt status on read operation and immediately after that may raise another IRQ, possible directed to another processor.
Build
Compilation and linking
Build process uses “build” utility from WDK. The build process uses “dirs” file in current directory and in leach directory listed in “dirs” file to define tree of directories to process; in each directory build uses “sources” file to define sources and targets in it; syntax of “sources” is similar to one of make file for “nmake” utility (although “sources” file is not a make file; real make file present in each directory and only includes WDK make file).
A batch file “buildAll” sets up the environment for each target OS, creates make files and “dirs” files and runs “build” utility. When driver’s binary file generated, the batch calls procedure for signing and packaging.
Signing and packaging
Providing driver package (CAT file) and signing it at least with development time certificate is only way to achieve smooth driver installation (for completely smooth the driver must be signed by logo certificate). For that, the build process includes these steps – it generates CAT file from INF file (using INF2CAT utility from latest DDK 6001 or using signability utility from earlier DDK) and signs it using development certificate, generated especially for this project. Only goal is to make build process complete and produce ready-to-use package for immediate testing.
In order to sign the package with different certificate, the certificate file (and password) can be changed in build environment.
The signed package also can be signed again with different certificate (previous signature will be removed).
In addition to signing, in Vista OS family the operating system must be configured to allow installation of drivers with test signature (see below).
Creation of ISO image with driver installation is a good practice for testing: under any operating system drivers from CD driver can be always installed automatically or using simple installer, which may not work when the driver package is kept on hard disk.
Installation
The developer during testing often tends replacing the kernel driver binary with just compiled version. Overwriting of only kernel driver binary does not work on Vista 64 OS family, the driver must be installed. Additionally, some tests from Logo kit may reinstall the driver and return it to originally installed version. Having driver installed is always better than patching it in place.
Testing
Minimal subset of testing tools:
- Network performance tools
- IPERF, for short transfers use –n count, for specific NIC use -B
- Driver Verifier (verifier.exe), set it to at least netkvm.sys + ndis.sys (activating verifier on the driver significantly reduces the performance)
- NDIS test from WLK kit
- NDISTEST (a.k.a. 6.0)
- NDISTEST.NET (a.k.a. 6.5)
- WDK tools
- Devcon
- Plug-and-Play test
- Power test (for Vista)
- Sleeper (for automatic power management under XP)
Both offline NDIS (6.0 and 6.5) tests define DUT, support adapter and optional message adapter. Message adapter is mandatory for client-server test, when 2 VM used: one as server with one or two support adapters (usually VirtIO) and one message adapter (Intel); another one is client with one VirtIO under test, optional VirtIO as support adapter and Intel as message adapter.
Some of NDIS tests require only one adapter (DUT) – they do not require second VM at all; some require two adapters (DUT and local or remote support adapter).
NDIS test on single machine
Preliminary sanity test could be done on single VM in client mode with 2 NETKVM adapters, one is DUT and another one is support (without message adapter). Almost all the functional tests can run without problems in this configuration and detect problems in driver flows.
Recommended configuration for single VM tests includes 3 adapters: 2 NETKVM + one Intel E1000 with updated drivers from Intel support site.
It is recommended to run tests related to sending and receiving packets with priority tags twice, selecting NetKVM and Intel as support adapter.
NDIS test on two machines
NDIS test runs on one VM in server mode (requires 2 or 3 adapters – one as message device and another one or two as support devices) and on another VM in client mode (requires at least 2 adapters – one DUT and one as message device).
This setting is recommended for tests involving power management and also for MPE tests (multi-path execution).
Note that MPE test may produce failures and BSOD when the machine under test has small amount of RAM (at least 640M recommended), other tests do not use too much memory and able to run with 256-384 MB.
Debugging
Logging
The NetKVM driver implements simple scheme of logging and the main tool for collecting logging information is Debug View.
In order to avoid limitations of debugging output of Vista, instead of DbgPrint the driver uses vDbgPrintEx with mask that enabled by default and not filtered.
Debug level is controlled from adapter parameters (Device manager – Properties - Advanced), but changes global variable in the driver and affects all the devices supported by the driver. Typically, with default debug level, the driver produces logging information only during initialization and shutdown
WPP support
If ENABLE_EVENT_TRACING defined in Common\common.mak, the logging mechanism changed to WPP. In order to collect logging information, WDK utilities (as TraceView) shall be used and PDB file (or TMF file, containing WPP-specific information) required to translate unreadable log to readable. TMF file can be produced from PDB file using tracefmt utility from WDK. The process of activating verbose logging is very uneasy comparing to one using Debug View.
Analysis of crash dump
In order to debug the driver when it (or system) stops responding, it is easy to use manual generation of crash dump using keyboard (the system must be configured to allow it, see [1] ).
Parsing the crash dump using WinDbg, in addition to regular commands use NDIS extension commands:
- “!ndiskd.miniports” gives a list of existing adapters. Use miniport address for details about specific adapter.
- “!ndiskd.miniport <address of miniport>” gives details for any specific miniport.
After adapter context retrieved using the last command, use “Watch” of this value with casting to (PARANDIS_ADAPTER *) to examine internal fields of per-adapter context (PDB file required).
Workarounds
Interrupt recovery
In earlier versions of KVM with Vista there was a problem of interrupt activation from host to guest. The code in Vista driver still contains parameter “Interrupt recovery”, enabled by default. It starts 15 ms timer which acts instead of hardware interrupt. When it detects a real interrupt activity, it stops running. With this workaround, the VM had fewer cases without interrupt streaming; and even if the guest VM does not receive interrupts, the miniport still can function, receives IP address etc.
Timed connect indication
In earlier versions of KVM/QEMU the connect/disconnect indication from the host to VM was not implemented. It was useful during testing during development to delay the connect indication and do not immediately upon driver startup. This workaround is controller by “Connect after” parameter (by default disabled now).