1 2 ========================================== 3 Xillybus driver for generic FPGA interface 4 ========================================== 5 6Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) 7Email: eli.billauer@gmail.com or as advertised on Xillybus' site. 8 9Contents: 10 11 - Introduction 12 -- Background 13 -- Xillybus Overview 14 15 - Usage 16 -- User interface 17 -- Synchronization 18 -- Seekable pipes 19 20- Internals 21 -- Source code organization 22 -- Pipe attributes 23 -- Host never reads from the FPGA 24 -- Channels, pipes, and the message channel 25 -- Data streaming 26 -- Data granularity 27 -- Probing 28 -- Buffer allocation 29 -- The "nonempty" message (supporting poll) 30 31 32INTRODUCTION 33============ 34 35Background 36---------- 37 38An FPGA (Field Programmable Gate Array) is a piece of logic hardware, which 39can be programmed to become virtually anything that is usually found as a 40dedicated chipset: For instance, a display adapter, network interface card, 41or even a processor with its peripherals. FPGAs are the LEGO of hardware: 42Based upon certain building blocks, you make your own toys the way you like 43them. It's usually pointless to reimplement something that is already 44available on the market as a chipset, so FPGAs are mostly used when some 45special functionality is needed, and the production volume is relatively low 46(hence not justifying the development of an ASIC). 47 48The challenge with FPGAs is that everything is implemented at a very low 49level, even lower than assembly language. In order to allow FPGA designers to 50focus on their specific project, and not reinvent the wheel over and over 51again, pre-designed building blocks, IP cores, are often used. These are the 52FPGA parallels of library functions. IP cores may implement certain 53mathematical functions, a functional unit (e.g. a USB interface), an entire 54processor (e.g. ARM) or anything that might come handy. Think of them as a 55building block, with electrical wires dangling on the sides for connection to 56other blocks. 57 58One of the daunting tasks in FPGA design is communicating with a fullblown 59operating system (actually, with the processor running it): Implementing the 60low-level bus protocol and the somewhat higher-level interface with the host 61(registers, interrupts, DMA etc.) is a project in itself. When the FPGA's 62function is a well-known one (e.g. a video adapter card, or a NIC), it can 63make sense to design the FPGA's interface logic specifically for the project. 64A special driver is then written to present the FPGA as a well-known interface 65to the kernel and/or user space. In that case, there is no reason to treat the 66FPGA differently than any device on the bus. 67 68It's however common that the desired data communication doesn't fit any well- 69known peripheral function. Also, the effort of designing an elegant 70abstraction for the data exchange is often considered too big. In those cases, 71a quicker and possibly less elegant solution is sought: The driver is 72effectively written as a user space program, leaving the kernel space part 73with just elementary data transport. This still requires designing some 74interface logic for the FPGA, and write a simple ad-hoc driver for the kernel. 75 76Xillybus Overview 77----------------- 78 79Xillybus is an IP core and a Linux driver. Together, they form a kit for 80elementary data transport between an FPGA and the host, providing pipe-like 81data streams with a straightforward user interface. It's intended as a low- 82effort solution for mixed FPGA-host projects, for which it makes sense to 83have the project-specific part of the driver running in a user-space program. 84 85Since the communication requirements may vary significantly from one FPGA 86project to another (the number of data pipes needed in each direction and 87their attributes), there isn't one specific chunk of logic being the Xillybus 88IP core. Rather, the IP core is configured and built based upon a 89specification given by its end user. 90 91Xillybus presents independent data streams, which resemble pipes or TCP/IP 92communication to the user. At the host side, a character device file is used 93just like any pipe file. On the FPGA side, hardware FIFOs are used to stream 94the data. This is contrary to a common method of communicating through fixed- 95sized buffers (even though such buffers are used by Xillybus under the hood). 96There may be more than a hundred of these streams on a single IP core, but 97also no more than one, depending on the configuration. 98 99In order to ease the deployment of the Xillybus IP core, it contains a simple 100data structure which completely defines the core's configuration. The Linux 101driver fetches this data structure during its initialization process, and sets 102up the DMA buffers and character devices accordingly. As a result, a single 103driver is used to work out of the box with any Xillybus IP core. 104 105The data structure just mentioned should not be confused with PCI's 106configuration space or the Flattened Device Tree. 107 108USAGE 109===== 110 111User interface 112-------------- 113 114On the host, all interface with Xillybus is done through /dev/xillybus_* 115device files, which are generated automatically as the drivers loads. The 116names of these files depend on the IP core that is loaded in the FPGA (see 117Probing below). To communicate with the FPGA, open the device file that 118corresponds to the hardware FIFO you want to send data or receive data from, 119and use plain write() or read() calls, just like with a regular pipe. In 120particular, it makes perfect sense to go: 121 122$ cat mydata > /dev/xillybus_thisfifo 123 124$ cat /dev/xillybus_thatfifo > hisdata 125 126possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have 127the capability to send an EOF (but may not use it). 128 129The driver and hardware are designed to behave sensibly as pipes, including: 130 131* Supporting non-blocking I/O (by setting O_NONBLOCK on open() ). 132 133* Supporting poll() and select(). 134 135* Being bandwidth efficient under load (using DMA) but also handle small 136 pieces of data sent across (like TCP/IP) by autoflushing. 137 138A device file can be read only, write only or bidirectional. Bidirectional 139device files are treated like two independent pipes (except for sharing a 140"channel" structure in the implementation code). 141 142Synchronization 143--------------- 144 145Xillybus pipes are configured (on the IP core) to be either synchronous or 146asynchronous. For a synchronous pipe, write() returns successfully only after 147some data has been submitted and acknowledged by the FPGA. This slows down 148bulk data transfers, and is nearly impossible for use with streams that 149require data at a constant rate: There is no data transmitted to the FPGA 150between write() calls, in particular when the process loses the CPU. 151 152When a pipe is configured asynchronous, write() returns if there was enough 153room in the buffers to store any of the data in the buffers. 154 155For FPGA to host pipes, asynchronous pipes allow data transfer from the FPGA 156as soon as the respective device file is opened, regardless of if the data 157has been requested by a read() call. On synchronous pipes, only the amount 158of data requested by a read() call is transmitted. 159 160In summary, for synchronous pipes, data between the host and FPGA is 161transmitted only to satisfy the read() or write() call currently handled 162by the driver, and those calls wait for the transmission to complete before 163returning. 164 165Note that the synchronization attribute has nothing to do with the possibility 166that read() or write() completes less bytes than requested. There is a 167separate configuration flag ("allowpartial") that determines whether such a 168partial completion is allowed. 169 170Seekable pipes 171-------------- 172 173A synchronous pipe can be configured to have the stream's position exposed 174to the user logic at the FPGA. Such a pipe is also seekable on the host API. 175With this feature, a memory or register interface can be attached on the 176FPGA side to the seekable stream. Reading or writing to a certain address in 177the attached memory is done by seeking to the desired address, and calling 178read() or write() as required. 179 180 181INTERNALS 182========= 183 184Source code organization 185------------------------ 186 187The Xillybus driver consists of a core module, xillybus_core.c, and modules 188that depend on the specific bus interface (xillybus_of.c and xillybus_pcie.c). 189 190The bus specific modules are those probed when a suitable device is found by 191the kernel. Since the DMA mapping and synchronization functions, which are bus 192dependent by their nature, are used by the core module, a 193xilly_endpoint_hardware structure is passed to the core module on 194initialization. This structure is populated with pointers to wrapper functions 195which execute the DMA-related operations on the bus. 196 197Pipe attributes 198--------------- 199 200Each pipe has a number of attributes which are set when the FPGA component 201(IP core) is built. They are fetched from the IDT (the data structure which 202defines the core's configuration, see Probing below) by xilly_setupchannels() 203in xillybus_core.c as follows: 204 205* is_writebuf: The pipe's direction. A non-zero value means it's an FPGA to 206 host pipe (the FPGA "writes"). 207 208* channelnum: The pipe's identification number in communication between the 209 host and FPGA. 210 211* format: The underlying data width. See Data Granularity below. 212 213* allowpartial: A non-zero value means that a read() or write() (whichever 214 applies) may return with less than the requested number of bytes. The common 215 choice is a non-zero value, to match standard UNIX behavior. 216 217* synchronous: A non-zero value means that the pipe is synchronous. See 218 Syncronization above. 219 220* bufsize: Each DMA buffer's size. Always a power of two. 221 222* bufnum: The number of buffers allocated for this pipe. Always a power of two. 223 224* exclusive_open: A non-zero value forces exclusive opening of the associated 225 device file. If the device file is bidirectional, and already opened only in 226 one direction, the opposite direction may be opened once. 227 228* seekable: A non-zero value indicates that the pipe is seekable. See 229 Seekable pipes above. 230 231* supports_nonempty: A non-zero value (which is typical) indicates that the 232 hardware will send the messages that are necessary to support select() and 233 poll() for this pipe. 234 235Host never reads from the FPGA 236------------------------------ 237 238Even though PCI Express is hotpluggable in general, a typical motherboard 239doesn't expect a card to go away all of the sudden. But since the PCIe card 240is based upon reprogrammable logic, a sudden disappearance from the bus is 241quite likely as a result of an accidental reprogramming of the FPGA while the 242host is up. In practice, nothing happens immediately in such a situation. But 243if the host attempts to read from an address that is mapped to the PCI Express 244device, that leads to an immediate freeze of the system on some motherboards, 245even though the PCIe standard requires a graceful recovery. 246 247In order to avoid these freezes, the Xillybus driver refrains completely from 248reading from the device's register space. All communication from the FPGA to 249the host is done through DMA. In particular, the Interrupt Service Routine 250doesn't follow the common practice of checking a status register when it's 251invoked. Rather, the FPGA prepares a small buffer which contains short 252messages, which inform the host what the interrupt was about. 253 254This mechanism is used on non-PCIe buses as well for the sake of uniformity. 255 256 257Channels, pipes, and the message channel 258---------------------------------------- 259 260Each of the (possibly bidirectional) pipes presented to the user is allocated 261a data channel between the FPGA and the host. The distinction between channels 262and pipes is necessary only because of channel 0, which is used for interrupt- 263related messages from the FPGA, and has no pipe attached to it. 264 265Data streaming 266-------------- 267 268Even though a non-segmented data stream is presented to the user at both 269sides, the implementation relies on a set of DMA buffers which is allocated 270for each channel. For the sake of illustration, let's take the FPGA to host 271direction: As data streams into the respective channel's interface in the 272FPGA, the Xillybus IP core writes it to one of the DMA buffers. When the 273buffer is full, the FPGA informs the host about that (appending a 274XILLYMSG_OPCODE_RELEASEBUF message channel 0 and sending an interrupt if 275necessary). The host responds by making the data available for reading through 276the character device. When all data has been read, the host writes on the 277the FPGA's buffer control register, allowing the buffer's overwriting. Flow 278control mechanisms exist on both sides to prevent underflows and overflows. 279 280This is not good enough for creating a TCP/IP-like stream: If the data flow 281stops momentarily before a DMA buffer is filled, the intuitive expectation is 282that the partial data in buffer will arrive anyhow, despite the buffer not 283being completed. This is implemented by adding a field in the 284XILLYMSG_OPCODE_RELEASEBUF message, through which the FPGA informs not just 285which buffer is submitted, but how much data it contains. 286 287But the FPGA will submit a partially filled buffer only if directed to do so 288by the host. This situation occurs when the read() method has been blocking 289for XILLY_RX_TIMEOUT jiffies (currently 10 ms), after which the host commands 290the FPGA to submit a DMA buffer as soon as it can. This timeout mechanism 291balances between bus bandwidth efficiency (preventing a lot of partially 292filled buffers being sent) and a latency held fairly low for tails of data. 293 294A similar setting is used in the host to FPGA direction. The handling of 295partial DMA buffers is somewhat different, though. The user can tell the 296driver to submit all data it has in the buffers to the FPGA, by issuing a 297write() with the byte count set to zero. This is similar to a flush request, 298but it doesn't block. There is also an autoflushing mechanism, which triggers 299an equivalent flush roughly XILLY_RX_TIMEOUT jiffies after the last write(). 300This allows the user to be oblivious about the underlying buffering mechanism 301and yet enjoy a stream-like interface. 302 303Note that the issue of partial buffer flushing is irrelevant for pipes having 304the "synchronous" attribute nonzero, since synchronous pipes don't allow data 305to lay around in the DMA buffers between read() and write() anyhow. 306 307Data granularity 308---------------- 309 310The data arrives or is sent at the FPGA as 8, 16 or 32 bit wide words, as 311configured by the "format" attribute. Whenever possible, the driver attempts 312to hide this when the pipe is accessed differently from its natural alignment. 313For example, reading single bytes from a pipe with 32 bit granularity works 314with no issues. Writing single bytes to pipes with 16 or 32 bit granularity 315will also work, but the driver can't send partially completed words to the 316FPGA, so the transmission of up to one word may be held until it's fully 317occupied with user data. 318 319This somewhat complicates the handling of host to FPGA streams, because 320when a buffer is flushed, it may contain up to 3 bytes don't form a word in 321the FPGA, and hence can't be sent. To prevent loss of data, these leftover 322bytes need to be moved to the next buffer. The parts in xillybus_core.c 323that mention "leftovers" in some way are related to this complication. 324 325Probing 326------- 327 328As mentioned earlier, the number of pipes that are created when the driver 329loads and their attributes depend on the Xillybus IP core in the FPGA. During 330the driver's initialization, a blob containing configuration info, the 331Interface Description Table (IDT), is sent from the FPGA to the host. The 332bootstrap process is done in three phases: 333 3341. Acquire the length of the IDT, so a buffer can be allocated for it. This 335 is done by sending a quiesce command to the device, since the acknowledge 336 for this command contains the IDT's buffer length. 337 3382. Acquire the IDT itself. 339 3403. Create the interfaces according to the IDT. 341 342Buffer allocation 343----------------- 344 345In order to simplify the logic that prevents illegal boundary crossings of 346PCIe packets, the following rule applies: If a buffer is smaller than 4kB, 347it must not cross a 4kB boundary. Otherwise, it must be 4kB aligned. The 348xilly_setupchannels() functions allocates these buffers by requesting whole 349pages from the kernel, and diving them into DMA buffers as necessary. Since 350all buffers' sizes are powers of two, it's possible to pack any set of such 351buffers, with a maximal waste of one page of memory. 352 353All buffers are allocated when the driver is loaded. This is necessary, 354since large continuous physical memory segments are sometimes requested, 355which are more likely to be available when the system is freshly booted. 356 357The allocation of buffer memory takes place in the same order they appear in 358the IDT. The driver relies on a rule that the pipes are sorted with decreasing 359buffer size in the IDT. If a requested buffer is larger or equal to a page, 360the necessary number of pages is requested from the kernel, and these are 361used for this buffer. If the requested buffer is smaller than a page, one 362single page is requested from the kernel, and that page is partially used. 363Or, if there already is a partially used page at hand, the buffer is packed 364into that page. It can be shown that all pages requested from the kernel 365(except possibly for the last) are 100% utilized this way. 366 367The "nonempty" message (supporting poll) 368--------------------------------------- 369 370In order to support the "poll" method (and hence select() ), there is a small 371catch regarding the FPGA to host direction: The FPGA may have filled a DMA 372buffer with some data, but not submitted that buffer. If the host waited for 373the buffer's submission by the FPGA, there would be a possibility that the 374FPGA side has sent data, but a select() call would still block, because the 375host has not received any notification about this. This is solved with 376XILLYMSG_OPCODE_NONEMPTY messages sent by the FPGA when a channel goes from 377completely empty to containing some data. 378 379These messages are used only to support poll() and select(). The IP core can 380be configured not to send them for a slight reduction of bandwidth. 381