1I/O Barriers 2============ 3Tejun Heo <htejun@gmail.com>, July 22 2005 4 5I/O barrier requests are used to guarantee ordering around the barrier 6requests. Unless you're crazy enough to use disk drives for 7implementing synchronization constructs (wow, sounds interesting...), 8the ordering is meaningful only for write requests for things like 9journal checkpoints. All requests queued before a barrier request 10must be finished (made it to the physical medium) before the barrier 11request is started, and all requests queued after the barrier request 12must be started only after the barrier request is finished (again, 13made it to the physical medium). 14 15In other words, I/O barrier requests have the following two properties. 16 171. Request ordering 18 19Requests cannot pass the barrier request. Preceding requests are 20processed before the barrier and following requests after. 21 22Depending on what features a drive supports, this can be done in one 23of the following three ways. 24 25i. For devices which have queue depth greater than 1 (TCQ devices) and 26support ordered tags, block layer can just issue the barrier as an 27ordered request and the lower level driver, controller and drive 28itself are responsible for making sure that the ordering constraint is 29met. Most modern SCSI controllers/drives should support this. 30 31NOTE: SCSI ordered tag isn't currently used due to limitation in the 32 SCSI midlayer, see the following random notes section. 33 34ii. For devices which have queue depth greater than 1 but don't 35support ordered tags, block layer ensures that the requests preceding 36a barrier request finishes before issuing the barrier request. Also, 37it defers requests following the barrier until the barrier request is 38finished. Older SCSI controllers/drives and SATA drives fall in this 39category. 40 41iii. Devices which have queue depth of 1. This is a degenerate case 42of ii. Just keeping issue order suffices. Ancient SCSI 43controllers/drives and IDE drives are in this category. 44 452. Forced flushing to physical medium 46 47Again, if you're not gonna do synchronization with disk drives (dang, 48it sounds even more appealing now!), the reason you use I/O barriers 49is mainly to protect filesystem integrity when power failure or some 50other events abruptly stop the drive from operating and possibly make 51the drive lose data in its cache. So, I/O barriers need to guarantee 52that requests actually get written to non-volatile medium in order. 53 54There are four cases, 55 56i. No write-back cache. Keeping requests ordered is enough. 57 58ii. Write-back cache but no flush operation. There's no way to 59guarantee physical-medium commit order. This kind of devices can't to 60I/O barriers. 61 62iii. Write-back cache and flush operation but no FUA (forced unit 63access). We need two cache flushes - before and after the barrier 64request. 65 66iv. Write-back cache, flush operation and FUA. We still need one 67flush to make sure requests preceding a barrier are written to medium, 68but post-barrier flush can be avoided by using FUA write on the 69barrier itself. 70 71 72How to support barrier requests in drivers 73------------------------------------------ 74 75All barrier handling is done inside block layer proper. All low level 76drivers have to are implementing its prepare_flush_fn and using one 77the following two functions to indicate what barrier type it supports 78and how to prepare flush requests. Note that the term 'ordered' is 79used to indicate the whole sequence of performing barrier requests 80including draining and flushing. 81 82typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq); 83 84int blk_queue_ordered(struct request_queue *q, unsigned ordered, 85 prepare_flush_fn *prepare_flush_fn); 86 87@q : the queue in question 88@ordered : the ordered mode the driver/device supports 89@prepare_flush_fn : this function should prepare @rq such that it 90 flushes cache to physical medium when executed 91 92For example, SCSI disk driver's prepare_flush_fn looks like the 93following. 94 95static void sd_prepare_flush(struct request_queue *q, struct request *rq) 96{ 97 memset(rq->cmd, 0, sizeof(rq->cmd)); 98 rq->cmd_type = REQ_TYPE_BLOCK_PC; 99 rq->timeout = SD_TIMEOUT; 100 rq->cmd[0] = SYNCHRONIZE_CACHE; 101 rq->cmd_len = 10; 102} 103 104The following seven ordered modes are supported. The following table 105shows which mode should be used depending on what features a 106device/driver supports. In the leftmost column of table, 107QUEUE_ORDERED_ prefix is omitted from the mode names to save space. 108 109The table is followed by description of each mode. Note that in the 110descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is 111used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the 112preceding step must be complete before proceeding to the next step. 113'->' indicates that the next step can start as soon as the previous 114step is issued. 115 116 write-back cache ordered tag flush FUA 117----------------------------------------------------------------------- 118NONE yes/no N/A no N/A 119DRAIN no no N/A N/A 120DRAIN_FLUSH yes no yes no 121DRAIN_FUA yes no yes yes 122TAG no yes N/A N/A 123TAG_FLUSH yes yes yes no 124TAG_FUA yes yes yes yes 125 126 127QUEUE_ORDERED_NONE 128 I/O barriers are not needed and/or supported. 129 130 Sequence: N/A 131 132QUEUE_ORDERED_DRAIN 133 Requests are ordered by draining the request queue and cache 134 flushing isn't needed. 135 136 Sequence: drain => barrier 137 138QUEUE_ORDERED_DRAIN_FLUSH 139 Requests are ordered by draining the request queue and both 140 pre-barrier and post-barrier cache flushings are needed. 141 142 Sequence: drain => preflush => barrier => postflush 143 144QUEUE_ORDERED_DRAIN_FUA 145 Requests are ordered by draining the request queue and 146 pre-barrier cache flushing is needed. By using FUA on barrier 147 request, post-barrier flushing can be skipped. 148 149 Sequence: drain => preflush => barrier 150 151QUEUE_ORDERED_TAG 152 Requests are ordered by ordered tag and cache flushing isn't 153 needed. 154 155 Sequence: barrier 156 157QUEUE_ORDERED_TAG_FLUSH 158 Requests are ordered by ordered tag and both pre-barrier and 159 post-barrier cache flushings are needed. 160 161 Sequence: preflush -> barrier -> postflush 162 163QUEUE_ORDERED_TAG_FUA 164 Requests are ordered by ordered tag and pre-barrier cache 165 flushing is needed. By using FUA on barrier request, 166 post-barrier flushing can be skipped. 167 168 Sequence: preflush -> barrier 169 170 171Random notes/caveats 172-------------------- 173 174* SCSI layer currently can't use TAG ordering even if the drive, 175controller and driver support it. The problem is that SCSI midlayer 176request dispatch function is not atomic. It releases queue lock and 177switch to SCSI host lock during issue and it's possible and likely to 178happen in time that requests change their relative positions. Once 179this problem is solved, TAG ordering can be enabled. 180 181* Currently, no matter which ordered mode is used, there can be only 182one barrier request in progress. All I/O barriers are held off by 183block layer until the previous I/O barrier is complete. This doesn't 184make any difference for DRAIN ordered devices, but, for TAG ordered 185devices with very high command latency, passing multiple I/O barriers 186to low level *might* be helpful if they are very frequent. Well, this 187certainly is a non-issue. I'm writing this just to make clear that no 188two I/O barrier is ever passed to low-level driver. 189 190* Completion order. Requests in ordered sequence are issued in order 191but not required to finish in order. Barrier implementation can 192handle out-of-order completion of ordered sequence. IOW, the requests 193MUST be processed in order but the hardware/software completion paths 194are allowed to reorder completion notifications - eg. current SCSI 195midlayer doesn't preserve completion order during error handling. 196 197* Requeueing order. Low-level drivers are free to requeue any request 198after they removed it from the request queue with 199blkdev_dequeue_request(). As barrier sequence should be kept in order 200when requeued, generic elevator code takes care of putting requests in 201order around barrier. See blk_ordered_req_seq() and 202ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details. 203 204Note that block drivers must not requeue preceding requests while 205completing latter requests in an ordered sequence. Currently, no 206error checking is done against this. 207 208* Error handling. Currently, block layer will report error to upper 209layer if any of requests in an ordered sequence fails. Unfortunately, 210this doesn't seem to be enough. Look at the following request flow. 211QUEUE_ORDERED_TAG_FLUSH is in use. 212 213 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... > 214 still in elevator 215 216Let's say request [2], [3] are write requests to update file system 217metadata (journal or whatever) and [barrier] is used to mark that 218those updates are valid. Consider the following sequence. 219 220 i. Requests [0] ~ [post] leaves the request queue and enters 221 low-level driver. 222 ii. After a while, unfortunately, something goes wrong and the 223 drive fails [2]. Note that any of [0], [1] and [3] could have 224 completed by this time, but [pre] couldn't have been finished 225 as the drive must process it in order and it failed before 226 processing that command. 227 iii. Error handling kicks in and determines that the error is 228 unrecoverable and fails [2], and resumes operation. 229 iv. [pre] [barrier] [post] gets processed. 230 v. *BOOM* power fails 231 232The problem here is that the barrier request is *supposed* to indicate 233that filesystem update requests [2] and [3] made it safely to the 234physical medium and, if the machine crashes after the barrier is 235written, filesystem recovery code can depend on that. Sadly, that 236isn't true in this case anymore. IOW, the success of a I/O barrier 237should also be dependent on success of some of the preceding requests, 238where only upper layer (filesystem) knows what 'some' is. 239 240This can be solved by implementing a way to tell the block layer which 241requests affect the success of the following barrier request and 242making lower lever drivers to resume operation on error only after 243block layer tells it to do so. 244 245As the probability of this happening is very low and the drive should 246be faulty, implementing the fix is probably an overkill. But, still, 247it's there. 248 249* In previous drafts of barrier implementation, there was fallback 250mechanism such that, if FUA or ordered TAG fails, less fancy ordered 251mode can be selected and the failed barrier request is retried 252automatically. The rationale for this feature was that as FUA is 253pretty new in ATA world and ordered tag was never used widely, there 254could be devices which report to support those features but choke when 255actually given such requests. 256 257 This was removed for two reasons 1. it's an overkill 2. it's 258impossible to implement properly when TAG ordering is used as low 259level drivers resume after an error automatically. If it's ever 260needed adding it back and modifying low level drivers accordingly 261shouldn't be difficult. 262