1Contents: 2 31) TCM Userspace Design 4 a) Background 5 b) Benefits 6 c) Design constraints 7 d) Implementation overview 8 i. Mailbox 9 ii. Command ring 10 iii. Data Area 11 e) Device discovery 12 f) Device events 13 g) Other contingencies 142) Writing a user pass-through handler 15 a) Discovering and configuring TCMU uio devices 16 b) Waiting for events on the device(s) 17 c) Managing the command ring 183) A final note 19 20 21TCM Userspace Design 22-------------------- 23 24TCM is another name for LIO, an in-kernel iSCSI target (server). 25Existing TCM targets run in the kernel. TCMU (TCM in Userspace) 26allows userspace programs to be written which act as iSCSI targets. 27This document describes the design. 28 29The existing kernel provides modules for different SCSI transport 30protocols. TCM also modularizes the data storage. There are existing 31modules for file, block device, RAM or using another SCSI device as 32storage. These are called "backstores" or "storage engines". These 33built-in modules are implemented entirely as kernel code. 34 35Background: 36 37In addition to modularizing the transport protocol used for carrying 38SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes 39the actual data storage as well. These are referred to as "backstores" 40or "storage engines". The target comes with backstores that allow a 41file, a block device, RAM, or another SCSI device to be used for the 42local storage needed for the exported SCSI LUN. Like the rest of LIO, 43these are implemented entirely as kernel code. 44 45These backstores cover the most common use cases, but not all. One new 46use case that other non-kernel target solutions, such as tgt, are able 47to support is using Gluster's GLFS or Ceph's RBD as a backstore. The 48target then serves as a translator, allowing initiators to store data 49in these non-traditional networked storage systems, while still only 50using standard protocols themselves. 51 52If the target is a userspace process, supporting these is easy. tgt, 53for example, needs only a small adapter module for each, because the 54modules just use the available userspace libraries for RBD and GLFS. 55 56Adding support for these backstores in LIO is considerably more 57difficult, because LIO is entirely kernel code. Instead of undertaking 58the significant work to port the GLFS or RBD APIs and protocols to the 59kernel, another approach is to create a userspace pass-through 60backstore for LIO, "TCMU". 61 62 63Benefits: 64 65In addition to allowing relatively easy support for RBD and GLFS, TCMU 66will also allow easier development of new backstores. TCMU combines 67with the LIO loopback fabric to become something similar to FUSE 68(Filesystem in Userspace), but at the SCSI layer instead of the 69filesystem layer. A SUSE, if you will. 70 71The disadvantage is there are more distinct components to configure, and 72potentially to malfunction. This is unavoidable, but hopefully not 73fatal if we're careful to keep things as simple as possible. 74 75Design constraints: 76 77- Good performance: high throughput, low latency 78- Cleanly handle if userspace: 79 1) never attaches 80 2) hangs 81 3) dies 82 4) misbehaves 83- Allow future flexibility in user & kernel implementations 84- Be reasonably memory-efficient 85- Simple to configure & run 86- Simple to write a userspace backend 87 88 89Implementation overview: 90 91The core of the TCMU interface is a memory region that is shared 92between kernel and userspace. Within this region is: a control area 93(mailbox); a lockless producer/consumer circular buffer for commands 94to be passed up, and status returned; and an in/out data buffer area. 95 96TCMU uses the pre-existing UIO subsystem. UIO allows device driver 97development in userspace, and this is conceptually very close to the 98TCMU use case, except instead of a physical device, TCMU implements a 99memory-mapped layout designed for SCSI commands. Using UIO also 100benefits TCMU by handling device introspection (e.g. a way for 101userspace to determine how large the shared region is) and signaling 102mechanisms in both directions. 103 104There are no embedded pointers in the memory region. Everything is 105expressed as an offset from the region's starting address. This allows 106the ring to still work if the user process dies and is restarted with 107the region mapped at a different virtual address. 108 109See target_core_user.h for the struct definitions. 110 111The Mailbox: 112 113The mailbox is always at the start of the shared memory region, and 114contains a version, details about the starting offset and size of the 115command ring, and head and tail pointers to be used by the kernel and 116userspace (respectively) to put commands on the ring, and indicate 117when the commands are completed. 118 119version - 1 (userspace should abort if otherwise) 120flags: 121- TCMU_MAILBOX_FLAG_CAP_OOOC: indicates out-of-order completion is 122 supported. See "The Command Ring" for details. 123cmdr_off - The offset of the start of the command ring from the start 124of the memory region, to account for the mailbox size. 125cmdr_size - The size of the command ring. This does *not* need to be a 126power of two. 127cmd_head - Modified by the kernel to indicate when a command has been 128placed on the ring. 129cmd_tail - Modified by userspace to indicate when it has completed 130processing of a command. 131 132The Command Ring: 133 134Commands are placed on the ring by the kernel incrementing 135mailbox.cmd_head by the size of the command, modulo cmdr_size, and 136then signaling userspace via uio_event_notify(). Once the command is 137completed, userspace updates mailbox.cmd_tail in the same way and 138signals the kernel via a 4-byte write(). When cmd_head equals 139cmd_tail, the ring is empty -- no commands are currently waiting to be 140processed by userspace. 141 142TCMU commands are 8-byte aligned. They start with a common header 143containing "len_op", a 32-bit value that stores the length, as well as 144the opcode in the lowest unused bits. It also contains cmd_id and 145flags fields for setting by the kernel (kflags) and userspace 146(uflags). 147 148Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD. 149 150When the opcode is CMD, the entry in the command ring is a struct 151tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via 152tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the 153overall shared memory region, not the entry. The data in/out buffers 154are accessible via tht req.iov[] array. iov_cnt contains the number of 155entries in iov[] needed to describe either the Data-In or Data-Out 156buffers. For bidirectional commands, iov_cnt specifies how many iovec 157entries cover the Data-Out area, and iov_bidi_cnt specifies how many 158iovec entries immediately after that in iov[] cover the Data-In 159area. Just like other fields, iov.iov_base is an offset from the start 160of the region. 161 162When completing a command, userspace sets rsp.scsi_status, and 163rsp.sense_buffer if necessary. Userspace then increments 164mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the 165kernel via the UIO method, a 4-byte write to the file descriptor. 166 167If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is 168capable of handling out-of-order completions. In this case, userspace can 169handle command in different order other than original. Since kernel would 170still process the commands in the same order it appeared in the command 171ring, userspace need to update the cmd->id when completing the 172command(a.k.a steal the original command's entry). 173 174When the opcode is PAD, userspace only updates cmd_tail as above -- 175it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry 176is contiguous within the command ring.) 177 178More opcodes may be added in the future. If userspace encounters an 179opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in 180hdr.uflags, update cmd_tail, and proceed with processing additional 181commands, if any. 182 183The Data Area: 184 185This is shared-memory space after the command ring. The organization 186of this area is not defined in the TCMU interface, and userspace 187should access only the parts referenced by pending iovs. 188 189 190Device Discovery: 191 192Other devices may be using UIO besides TCMU. Unrelated user processes 193may also be handling different sets of TCMU devices. TCMU userspace 194processes must find their devices by scanning sysfs 195class/uio/uio*/name. For TCMU devices, these names will be of the 196format: 197 198tcm-user/<hba_num>/<device_name>/<subtype>/<path> 199 200where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num> 201and <device_name> allow userspace to find the device's path in the 202kernel target's configfs tree. Assuming the usual mount point, it is 203found at: 204 205/sys/kernel/config/target/core/user_<hba_num>/<device_name> 206 207This location contains attributes such as "hw_block_size", that 208userspace needs to know for correct operation. 209 210<subtype> will be a userspace-process-unique string to identify the 211TCMU device as expecting to be backed by a certain handler, and <path> 212will be an additional handler-specific string for the user process to 213configure the device, if needed. The name cannot contain ':', due to 214LIO limitations. 215 216For all devices so discovered, the user handler opens /dev/uioX and 217calls mmap(): 218 219mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0) 220 221where size must be equal to the value read from 222/sys/class/uio/uioX/maps/map0/size. 223 224 225Device Events: 226 227If a new device is added or removed, a notification will be broadcast 228over netlink, using a generic netlink family name of "TCM-USER" and a 229multicast group named "config". This will include the UIO name as 230described in the previous section, as well as the UIO minor 231number. This should allow userspace to identify both the UIO device and 232the LIO device, so that after determining the device is supported 233(based on subtype) it can take the appropriate action. 234 235 236Other contingencies: 237 238Userspace handler process never attaches: 239 240- TCMU will post commands, and then abort them after a timeout period 241 (30 seconds.) 242 243Userspace handler process is killed: 244 245- It is still possible to restart and re-connect to TCMU 246 devices. Command ring is preserved. However, after the timeout period, 247 the kernel will abort pending tasks. 248 249Userspace handler process hangs: 250 251- The kernel will abort pending tasks after a timeout period. 252 253Userspace handler process is malicious: 254 255- The process can trivially break the handling of devices it controls, 256 but should not be able to access kernel memory outside its shared 257 memory areas. 258 259 260Writing a user pass-through handler (with example code) 261------------------------------------------------------- 262 263A user process handing a TCMU device must support the following: 264 265a) Discovering and configuring TCMU uio devices 266b) Waiting for events on the device(s) 267c) Managing the command ring: Parsing operations and commands, 268 performing work as needed, setting response fields (scsi_status and 269 possibly sense_buffer), updating cmd_tail, and notifying the kernel 270 that work has been finished 271 272First, consider instead writing a plugin for tcmu-runner. tcmu-runner 273implements all of this, and provides a higher-level API for plugin 274authors. 275 276TCMU is designed so that multiple unrelated processes can manage TCMU 277devices separately. All handlers should make sure to only open their 278devices, based opon a known subtype string. 279 280a) Discovering and configuring TCMU UIO devices: 281 282(error checking omitted for brevity) 283 284int fd, dev_fd; 285char buf[256]; 286unsigned long long map_len; 287void *map; 288 289fd = open("/sys/class/uio/uio0/name", O_RDONLY); 290ret = read(fd, buf, sizeof(buf)); 291close(fd); 292buf[ret-1] = '\0'; /* null-terminate and chop off the \n */ 293 294/* we only want uio devices whose name is a format we expect */ 295if (strncmp(buf, "tcm-user", 8)) 296 exit(-1); 297 298/* Further checking for subtype also needed here */ 299 300fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY); 301ret = read(fd, buf, sizeof(buf)); 302close(fd); 303str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */ 304 305map_len = strtoull(buf, NULL, 0); 306 307dev_fd = open("/dev/uio0", O_RDWR); 308map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0); 309 310 311b) Waiting for events on the device(s) 312 313while (1) { 314 char buf[4]; 315 316 int ret = read(dev_fd, buf, 4); /* will block */ 317 318 handle_device_events(dev_fd, map); 319} 320 321 322c) Managing the command ring 323 324#include <linux/target_core_user.h> 325 326int handle_device_events(int fd, void *map) 327{ 328 struct tcmu_mailbox *mb = map; 329 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail; 330 int did_some_work = 0; 331 332 /* Process events from cmd ring until we catch up with cmd_head */ 333 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) { 334 335 if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) { 336 uint8_t *cdb = (void *)mb + ent->req.cdb_off; 337 bool success = true; 338 339 /* Handle command here. */ 340 printf("SCSI opcode: 0x%x\n", cdb[0]); 341 342 /* Set response fields */ 343 if (success) 344 ent->rsp.scsi_status = SCSI_NO_SENSE; 345 else { 346 /* Also fill in rsp->sense_buffer here */ 347 ent->rsp.scsi_status = SCSI_CHECK_CONDITION; 348 } 349 } 350 else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) { 351 /* Tell the kernel we didn't handle unknown opcodes */ 352 ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP; 353 } 354 else { 355 /* Do nothing for PAD entries except update cmd_tail */ 356 } 357 358 /* update cmd_tail */ 359 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size; 360 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail; 361 did_some_work = 1; 362 } 363 364 /* Notify the kernel that work has been finished */ 365 if (did_some_work) { 366 uint32_t buf = 0; 367 368 write(fd, &buf, 4); 369 } 370 371 return 0; 372} 373 374 375A final note 376------------ 377 378Please be careful to return codes as defined by the SCSI 379specifications. These are different than some values defined in the 380scsi/scsi.h include file. For example, CHECK CONDITION's status code 381is 2, not 1. 382