1## @file 2# 3# Technical notes for the virtio-net driver. 4# 5# Copyright (C) 2013, Red Hat, Inc. 6# 7# This program and the accompanying materials are licensed and made available 8# under the terms and conditions of the BSD License which accompanies this 9# distribution. The full text of the license may be found at 10# http://opensource.org/licenses/bsd-license.php 11# 12# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT 13# WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED. 14# 15## 16 17Disclaimer 18---------- 19 20All statements concerning standards and specifications are informative and not 21normative. They are made in good faith. Corrections are most welcome on the 22edk2-devel mailing list. 23 24The following documents have been perused while writing the driver and this 25document: 26- Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C; 27 June 27, 2012 28- Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01; 29- Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7. 30 31 32Summary 33------- 34 35The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for 36virtio-net devices. Higher level protocols are automatically installed on top 37of it by the DXE Core / the ConnectController() boot service, enabling for 38virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib 39applications, and PXE booting in OVMF. 40 41 42UEFI driver structure 43--------------------- 44 45A driver instance, belonging to a given virtio-net device, can be in one of 46four states at any time. The states stack up as follows below. The state 47transitions are labeled with the primary function (and its important callees 48faithfully indented) that implement the transition. 49 50 | ^ 51 | | 52 [DriverBinding.c] | | [DriverBinding.c] 53 VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop 54 VirtioNetSnpPopulate | | VirtioNetSnpEvacuate 55 VirtioNetGetFeatures | | 56 v | 57 +-------------------------+ 58 | EfiSimpleNetworkStopped | 59 +-------------------------+ 60 | ^ 61 [SnpStart.c] | | [SnpStop.c] 62 VirtioNetStart | | VirtioNetStop 63 | | 64 v | 65 +-------------------------+ 66 | EfiSimpleNetworkStarted | 67 +-------------------------+ 68 | ^ 69 [SnpInitialize.c] | | [SnpShutdown.c] 70 VirtioNetInitialize | | VirtioNetShutdown 71 VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c] 72 VirtioRingInit | | VirtioNetShutdownTx [SnpSharedHelpers.c] 73 VirtioNetInitTx | | VirtioRingUninit {Tx, Rx} 74 VirtioNetInitRx | | 75 v | 76 +-----------------------------+ 77 | EfiSimpleNetworkInitialized | 78 +-----------------------------+ 79 80The state at the top means "nonexistent" and is hence unnamed on the diagram -- 81a driver instance actually doesn't exist at that point. The transition 82functions out of and into that state implement the Driver Binding Protocol. 83 84The lower three states characterize an existent driver instance and are all 85states defined by the Simple Network Protocol. The transition functions between 86them are member functions of the Simple Network Protocol. 87 88Each transition function validates its expected source state and its 89parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect 90from the controller unless it's in EfiSimpleNetworkStopped. 91 92 93Driver instance states (Simple Network Protocol) 94------------------------------------------------ 95 96In the EfiSimpleNetworkStopped state, the virtio-net device is (has been) 97re-set. No resources are allocated for networking / traffic purposes. The MAC 98address and other device attributes have been retrieved from the device (this 99is necessary for completing the VirtioNetDriverBindingStart transition). 100 101The EfiSimpleNetworkStarted is completely identical to the 102EfiSimpleNetworkStopped state for virtio-net, in the functional and 103resource-usage sense. This state is mandated / provided by the Simple Network 104Protocol for flexibility that the virtio-net driver doesn't exploit. 105 106In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown 107SNP member function, and must therefore correspond to a hardware configuration 108where "[it] is safe for another driver to initialize". (Clearly another UEFI 109driver could not do that due to the exclusivity of the driver binding that 110VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.) 111 112The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the 113driver instance. Virtio and other resources required for network traffic have 114been allocated, and the following SNP member functions are available (in 115addition to VirtioNetShutdown which leaves the state): 116 117- VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that 118 may have arrived asynchronously; 119 120- VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous 121 transmission (meant to be used together with VirtioNetGetStatus); 122 123- VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending 124 Tx packets; 125 126- VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6 127 address into a multicast MAC address; 128 129- VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast / 130 broadcast filter configuration (not their actual effect -- a more liberal 131 filter setting than requested is allowed by the UEFI specification). 132 133The following SNP member functions are not supported [SnpUnsupported.c]: 134 135- VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop 136 from/to EfiSimpleNetworkInitialized); 137 138- VirtioNetStationAddress: assign a new MAC address to the virtio NIC, 139 140- VirtioNetStatistics: collect statistics, 141 142- VirtioNetNvData: access non-volatile data on the virtio NIC. 143 144Missing support for these functions is allowed by the UEFI specification and 145doesn't seem to trip up higher level protocols. 146 147 148Events and task priority levels 149------------------------------- 150 151The UEFI specification defines a sophisticated mechanism for asynchronous 152events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for 153details). Such callbacks work like software interrupts, and some notion of 154locking / masking is important to implement critical sections (atomic or 155exclusive access to data or a device). This notion is defined as Task Priority 156Levels. 157 158The virtio-net driver for OVMF must concern itself with events for two reasons: 159 160- The Simple Network Protocol provides its clients with a (non-optional) WAIT 161 type event called WaitForPacket: it allows them to check or wait for Rx 162 packets by polling or blocking on this event. (This functionality overlaps 163 with the Receive member function.) The event is available to clients starting 164 with EfiSimpleNetworkStopped (inclusive). 165 166 The virtio-net driver is informed about such client polling or blockage by 167 receiving an asynchronous callback (a software interrupt). In the callback 168 function the driver must interrogate the driver instance state, and if it is 169 EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are 170 available for consumption. If so, it must signal the WaitForPacket WAIT type 171 event, waking the client. 172 173 For simplicity and safety, all parts of the virtio-net driver that access any 174 bit of the driver instance (data or device) run at the TPL_CALLBACK level. 175 This is the highest level allowed for an SNP implementation, and all code 176 protected in this manner satisfies even stricter non-blocking requirements 177 than what's documented for TPL_CALLBACK. 178 179 The task priority level for the WaitForPacket callback too is set by the 180 driver, the choice is TPL_CALLBACK again. This in effect serializes the 181 WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal" 182 parts of the driver. 183 184- According to the Driver Writer's Guide, a network driver should install a 185 callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY 186 type event). When the ExitBootServices() boot service has cleaned up internal 187 firmware state and is about to pass control to the OS, any network driver has 188 to stop any in-flight DMA transfers, lest it corrupts OS memory. For this 189 reason EXIT_BOOT_SERVICES is emitted and the network driver must abort 190 in-flight DMA transfers. 191 192 This callback (VirtioNetExitBoot) is synchronized with the rest of the driver 193 code just the same as explained for WaitForPacket. In 194 EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data 195 transfer. After the callback returns, no further driver code is expected to 196 be scheduled. 197 198 199Virtio internals -- Rx 200---------------------- 201 202Requests (Rx and Tx alike) are always submitted by the guest and processed by 203the host. For Tx, processing means transmission. For Rx, processing means 204filling in the request with an incoming packet. Submitted requests exist on the 205"Available Ring", and answered (processed) requests show up on the "Used Ring". 206 207Packet data includes the media (Ethernet) header: destination MAC, source MAC, 208and Ethertype (14 bytes total). 209 210The following structures implement packet reception. Most of them are defined 211in the Virtio specification, the only driver-specific trait here is the static 212pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The 213diagram is simplified. 214 215 Available Index Available Index 216 last processed incremented 217 by the host by the guest 218 v -------> v 219Available +-------+-------+-------+-------+-------+ 220Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx| 221 +-------+-------+-------+-------+-------+ 222 =D6 =D2 223 224 D2 D3 D4 D5 D6 D7 225Descr. +----------+----------++----------+----------++----------+----------+ 226Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx| 227 +----------+----------++----------+----------++----------+----------+ 228 =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7 229 230 231 A2 A3 A4 A5 A6 A7 232Receive +---------------+---------------+---------------+ 233Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet| 234Area +---------------+---------------+---------------+ 235 236 Used Index Used Index incremented 237 last processed by the guest by the host 238 v -------> v 239Used +-----------+-----------+-----------+-----------+-----------+ 240Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len| 241 +-----------+-----------+-----------+-----------+-----------+ 242 =D4 243 244In VirtioNetInitRx, the guest allocates the fixed size Receive Destination 245Area, which accommodates all packets delivered asynchronously by the host. To 246each packet, a slice of this area is dedicated; each slice is further 247subdivided into virtio-net request header and network packet data. The 248(guest-physical) addresses of these sub-slices are denoted with A2, A3, A4 and 249so on. Importantly, an even-subscript "A" always belongs to a virtio-net 250request header, while an odd-subscript "A" always belongs to a packet 251sub-slice. 252 253Furthermore, the guest lays out a static pattern in the Descriptor Table. For 254each packet that can be in-flight or already arrived from the host, 255VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N, 256the Nth descriptor chain is set up as follows: 257 258- the first (=head) descriptor, with even index, points to the fixed-size 259 sub-slice receiving the virtio-net request header, 260 261- the second descriptor (with odd index) points to the fixed (1514 byte) size 262 sub-slice receiving the packet data, 263 264- a link from the first (head) descriptor in the chain is established to the 265 second (tail) descriptor in the chain. 266 267Finally, the guest populates the Available Ring with the indices of the head 268descriptors. All descriptor indices on both the Available Ring and the Used 269Ring are even. 270 271Packet reception occurs as follows: 272 273- The host consumes a descriptor index off the Available Ring. This index is 274 even (=2*N), and fingers the head descriptor of the chain belonging to packet 275 N. 276 277- The host reads the descriptors D(2*N) and -- following the Next link there 278 --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the 279 packet data at A(2*N+1). 280 281- The host places the index of the head descriptor, 2*N, onto the Used Ring, 282 and sets the Len field in the same Used Ring Element to the total number of 283 bytes transferred for the entire descriptor chain. This enables the guest to 284 identify the length of Rx packets. 285 286- VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it 287 copies the data out to the caller, and recycles the index of the head 288 descriptor (ie. 2*N) to the Available Ring. 289 290- Because the host can process (answer) Rx requests in any order theoretically, 291 the order of head descriptor indices on each of the Available Ring and the 292 Used Ring is virtually random. (Except right after the initial population in 293 VirtioNetInitRx, when the Available Ring is full and increasing, and the Used 294 Ring is empty.) 295 296- If the Available Ring is empty, the host is forced to drop packets. If the 297 Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet 298 available). 299 300 301Virtio internals -- Tx 302---------------------- 303 304The transmission structure erected by VirtioNetInitTx is similar, it differs 305in the following: 306 307- There is no Receive Destination Area. 308 309- Each head descriptor, D(2*N), points to a read-only virtio-net request header 310 that is shared by all of the head descriptors. This virtio-net request header 311 is never modified by the host. 312 313- Each tail descriptor is re-pointed to the caller-supplied packet buffer 314 whenever VirtioNetTransmit places the corresponding head descriptor on the 315 Available Ring. The caller is responsible to hang on to the unmodified buffer 316 until it is reported transmitted by VirtioNetGetStatus. 317 318Steps of packet transmission: 319 320- Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor 321 chains by keeping the indices of their head descriptors in a stack that is 322 private to the driver instance. All elements of the stack are even. 323 324- If the stack is empty (that is, each descriptor chain, in isolation, is 325 either pending transmission, or has been processed by the host but not 326 yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns 327 EFI_NOT_READY. 328 329- Otherwise the index of a free chain's head descriptor is popped from the 330 stack. The linked tail descriptor is re-pointed as discussed above. The head 331 descriptor's index is pushed on the Available Ring. 332 333- The host moves the head descriptor index from the Available Ring to the Used 334 Ring when it transmits the packet. 335 336- Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the 337 function reports no Tx completion. Otherwise, a head descriptor's index is 338 consumed from the Used Ring and recycled to the private stack. The client 339 code's original packet buffer address is fetched from the tail descriptor 340 (where it has been stored at VirtioNetTransmit time) and returned to the 341 caller. 342 343- The Len field of the Used Ring Element is not checked. The host is assumed to 344 have transmitted the entire packet -- VirtioNetTransmit had forced it below 345 1514 bytes (inclusive). The Virtio specification suggests this packet size is 346 always accepted (and a lower MTU could be encountered on any later hop as 347 well). Additionally, there's no good way to report a short transmit via 348 VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification 349 and higher level protocols could interpret it as a fatal condition. 350 351- The host can theoretically reorder head descriptor indices when moving them 352 from the Available Ring to the Used Ring (out of order transmission). Because 353 of this (and the choice of a stack over a list for free descriptor chain 354 tracking) the order of head descriptor indices on either Ring is 355 unpredictable. 356