1dm-zoned 2======== 3 4The dm-zoned device mapper target exposes a zoned block device (ZBC and 5ZAC compliant devices) as a regular block device without any write 6pattern constraints. In effect, it implements a drive-managed zoned 7block device which hides from the user (a file system or an application 8doing raw block device accesses) the sequential write constraints of 9host-managed zoned block devices and can mitigate the potential 10device-side performance degradation due to excessive random writes on 11host-aware zoned block devices. 12 13For a more detailed description of the zoned block device models and 14their constraints see (for SCSI devices): 15 16http://www.t10.org/drafts.htm#ZBC_Family 17 18and (for ATA devices): 19 20http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf 21 22The dm-zoned implementation is simple and minimizes system overhead (CPU 23and memory usage as well as storage capacity loss). For a 10TB 24host-managed disk with 256 MB zones, dm-zoned memory usage per disk 25instance is at most 4.5 MB and as little as 5 zones will be used 26internally for storing metadata and performaing reclaim operations. 27 28dm-zoned target devices are formatted and checked using the dmzadm 29utility available at: 30 31https://github.com/hgst/dm-zoned-tools 32 33Algorithm 34========= 35 36dm-zoned implements an on-disk buffering scheme to handle non-sequential 37write accesses to the sequential zones of a zoned block device. 38Conventional zones are used for caching as well as for storing internal 39metadata. 40 41The zones of the device are separated into 2 types: 42 431) Metadata zones: these are conventional zones used to store metadata. 44Metadata zones are not reported as useable capacity to the user. 45 462) Data zones: all remaining zones, the vast majority of which will be 47sequential zones used exclusively to store user data. The conventional 48zones of the device may be used also for buffering user random writes. 49Data in these zones may be directly mapped to the conventional zone, but 50later moved to a sequential zone so that the conventional zone can be 51reused for buffering incoming random writes. 52 53dm-zoned exposes a logical device with a sector size of 4096 bytes, 54irrespective of the physical sector size of the backend zoned block 55device being used. This allows reducing the amount of metadata needed to 56manage valid blocks (blocks written). 57 58The on-disk metadata format is as follows: 59 601) The first block of the first conventional zone found contains the 61super block which describes the on disk amount and position of metadata 62blocks. 63 642) Following the super block, a set of blocks is used to describe the 65mapping of the logical device blocks. The mapping is done per chunk of 66blocks, with the chunk size equal to the zoned block device size. The 67mapping table is indexed by chunk number and each mapping entry 68indicates the zone number of the device storing the chunk of data. Each 69mapping entry may also indicate if the zone number of a conventional 70zone used to buffer random modification to the data zone. 71 723) A set of blocks used to store bitmaps indicating the validity of 73blocks in the data zones follows the mapping table. A valid block is 74defined as a block that was written and not discarded. For a buffered 75data chunk, a block is always valid only in the data zone mapping the 76chunk or in the buffer zone of the chunk. 77 78For a logical chunk mapped to a conventional zone, all write operations 79are processed by directly writing to the zone. If the mapping zone is a 80sequential zone, the write operation is processed directly only if the 81write offset within the logical chunk is equal to the write pointer 82offset within of the sequential data zone (i.e. the write operation is 83aligned on the zone write pointer). Otherwise, write operations are 84processed indirectly using a buffer zone. In that case, an unused 85conventional zone is allocated and assigned to the chunk being 86accessed. Writing a block to the buffer zone of a chunk will 87automatically invalidate the same block in the sequential zone mapping 88the chunk. If all blocks of the sequential zone become invalid, the zone 89is freed and the chunk buffer zone becomes the primary zone mapping the 90chunk, resulting in native random write performance similar to a regular 91block device. 92 93Read operations are processed according to the block validity 94information provided by the bitmaps. Valid blocks are read either from 95the sequential zone mapping a chunk, or if the chunk is buffered, from 96the buffer zone assigned. If the accessed chunk has no mapping, or the 97accessed blocks are invalid, the read buffer is zeroed and the read 98operation terminated. 99 100After some time, the limited number of convnetional zones available may 101be exhausted (all used to map chunks or buffer sequential zones) and 102unaligned writes to unbuffered chunks become impossible. To avoid this 103situation, a reclaim process regularly scans used conventional zones and 104tries to reclaim the least recently used zones by copying the valid 105blocks of the buffer zone to a free sequential zone. Once the copy 106completes, the chunk mapping is updated to point to the sequential zone 107and the buffer zone freed for reuse. 108 109Metadata Protection 110=================== 111 112To protect metadata against corruption in case of sudden power loss or 113system crash, 2 sets of metadata zones are used. One set, the primary 114set, is used as the main metadata region, while the secondary set is 115used as a staging area. Modified metadata is first written to the 116secondary set and validated by updating the super block in the secondary 117set, a generation counter is used to indicate that this set contains the 118newest metadata. Once this operation completes, in place of metadata 119block updates can be done in the primary metadata set. This ensures that 120one of the set is always consistent (all modifications committed or none 121at all). Flush operations are used as a commit point. Upon reception of 122a flush request, metadata modification activity is temporarily blocked 123(for both incoming BIO processing and reclaim process) and all dirty 124metadata blocks are staged and updated. Normal operation is then 125resumed. Flushing metadata thus only temporarily delays write and 126discard requests. Read requests can be processed concurrently while 127metadata flush is being executed. 128 129Usage 130===== 131 132A zoned block device must first be formatted using the dmzadm tool. This 133will analyze the device zone configuration, determine where to place the 134metadata sets on the device and initialize the metadata sets. 135 136Ex: 137 138dmzadm --format /dev/sdxx 139 140For a formatted device, the target can be created normally with the 141dmsetup utility. The only parameter that dm-zoned requires is the 142underlying zoned block device name. Ex: 143 144echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` 145