1BFQ (Budget Fair Queueing) 2========================== 3 4BFQ is a proportional-share I/O scheduler, with some extra 5low-latency capabilities. In addition to cgroups support (blkio or io 6controllers), BFQ's main features are: 7- BFQ guarantees a high system and application responsiveness, and a 8 low latency for time-sensitive applications, such as audio or video 9 players; 10- BFQ distributes bandwidth, and not just time, among processes or 11 groups (switching back to time distribution when needed to keep 12 throughput high). 13 14In its default configuration, BFQ privileges latency over 15throughput. So, when needed for achieving a lower latency, BFQ builds 16schedules that may lead to a lower throughput. If your main or only 17goal, for a given device, is to achieve the maximum-possible 18throughput at all times, then do switch off all low-latency heuristics 19for that device, by setting low_latency to 0. See Section 3 for 20details on how to configure BFQ for the desired tradeoff between 21latency and throughput, or on how to maximize throughput. 22 23On average CPUs, the current version of BFQ can handle devices 24performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a 25reference, 30-50 KIOPS correspond to very high bandwidths with 26sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and 27to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on 28multi-queue devices too. 29 30The table of contents follow. Impatients can just jump to Section 3. 31 32CONTENTS 33 341. When may BFQ be useful? 35 1-1 Personal systems 36 1-2 Server systems 372. How does BFQ work? 383. What are BFQ's tunables and how to properly configure BFQ? 394. BFQ group scheduling 40 4-1 Service guarantees provided 41 4-2 Interface 42 431. When may BFQ be useful? 44========================== 45 46BFQ provides the following benefits on personal and server systems. 47 481-1 Personal systems 49-------------------- 50 51Low latency for interactive applications 52 53Regardless of the actual background workload, BFQ guarantees that, for 54interactive tasks, the storage device is virtually as responsive as if 55it was idle. For example, even if one or more of the following 56background workloads are being executed: 57- one or more large files are being read, written or copied, 58- a tree of source files is being compiled, 59- one or more virtual machines are performing I/O, 60- a software update is in progress, 61- indexing daemons are scanning filesystems and updating their 62 databases, 63starting an application or loading a file from within an application 64takes about the same time as if the storage device was idle. As a 65comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, 66applications experience high latencies, or even become unresponsive 67until the background workload terminates (also on SSDs). 68 69Low latency for soft real-time applications 70 71Also soft real-time applications, such as audio and video 72players/streamers, enjoy a low latency and a low drop rate, regardless 73of the background I/O workload. As a consequence, these applications 74do not suffer from almost any glitch due to the background workload. 75 76Higher speed for code-development tasks 77 78If some additional workload happens to be executed in parallel, then 79BFQ executes the I/O-related components of typical code-development 80tasks (compilation, checkout, merge, ...) much more quickly than CFQ, 81NOOP or DEADLINE. 82 83High throughput 84 85On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and 86up to 150% higher throughput than DEADLINE and NOOP, with all the 87sequential workloads considered in our tests. With random workloads, 88and with all the workloads on flash-based devices, BFQ achieves, 89instead, about the same throughput as the other schedulers. 90 91Strong fairness, bandwidth and delay guarantees 92 93BFQ distributes the device throughput, and not just the device time, 94among I/O-bound applications in proportion their weights, with any 95workload and regardless of the device parameters. From these bandwidth 96guarantees, it is possible to compute tight per-I/O-request delay 97guarantees by a simple formula. If not configured for strict service 98guarantees, BFQ switches to time-based resource sharing (only) for 99applications that would otherwise cause a throughput loss. 100 1011-2 Server systems 102------------------ 103 104Most benefits for server systems follow from the same service 105properties as above. In particular, regardless of whether additional, 106possibly heavy workloads are being served, BFQ guarantees: 107 108. audio and video-streaming with zero or very low jitter and drop 109 rate; 110 111. fast retrieval of WEB pages and embedded objects; 112 113. real-time recording of data in live-dumping applications (e.g., 114 packet logging); 115 116. responsiveness in local and remote access to a server. 117 118 1192. How does BFQ work? 120===================== 121 122BFQ is a proportional-share I/O scheduler, whose general structure, 123plus a lot of code, are borrowed from CFQ. 124 125- Each process doing I/O on a device is associated with a weight and a 126 (bfq_)queue. 127 128- BFQ grants exclusive access to the device, for a while, to one queue 129 (process) at a time, and implements this service model by 130 associating every queue with a budget, measured in number of 131 sectors. 132 133 - After a queue is granted access to the device, the budget of the 134 queue is decremented, on each request dispatch, by the size of the 135 request. 136 137 - The in-service queue is expired, i.e., its service is suspended, 138 only if one of the following events occurs: 1) the queue finishes 139 its budget, 2) the queue empties, 3) a "budget timeout" fires. 140 141 - The budget timeout prevents processes doing random I/O from 142 holding the device for too long and dramatically reducing 143 throughput. 144 145 - Actually, as in CFQ, a queue associated with a process issuing 146 sync requests may not be expired immediately when it empties. In 147 contrast, BFQ may idle the device for a short time interval, 148 giving the process the chance to go on being served if it issues 149 a new request in time. Device idling typically boosts the 150 throughput on rotational devices and on non-queueing flash-based 151 devices, if processes do synchronous and sequential I/O. In 152 addition, under BFQ, device idling is also instrumental in 153 guaranteeing the desired throughput fraction to processes 154 issuing sync requests (see the description of the slice_idle 155 tunable in this document, or [1, 2], for more details). 156 157 - With respect to idling for service guarantees, if several 158 processes are competing for the device at the same time, but 159 all processes and groups have the same weight, then BFQ 160 guarantees the expected throughput distribution without ever 161 idling the device. Throughput is thus as high as possible in 162 this common scenario. 163 164 - On flash-based storage with internal queueing of commands 165 (typically NCQ), device idling happens to be always detrimental 166 for throughput. So, with these devices, BFQ performs idling 167 only when strictly needed for service guarantees, i.e., for 168 guaranteeing low latency or fairness. In these cases, overall 169 throughput may be sub-optimal. No solution currently exists to 170 provide both strong service guarantees and optimal throughput 171 on devices with internal queueing. 172 173 - If low-latency mode is enabled (default configuration), BFQ 174 executes some special heuristics to detect interactive and soft 175 real-time applications (e.g., video or audio players/streamers), 176 and to reduce their latency. The most important action taken to 177 achieve this goal is to give to the queues associated with these 178 applications more than their fair share of the device 179 throughput. For brevity, we call just "weight-raising" the whole 180 sets of actions taken by BFQ to privilege these queues. In 181 particular, BFQ provides a milder form of weight-raising for 182 interactive applications, and a stronger form for soft real-time 183 applications. 184 185 - BFQ automatically deactivates idling for queues born in a burst of 186 queue creations. In fact, these queues are usually associated with 187 the processes of applications and services that benefit mostly 188 from a high throughput. Examples are systemd during boot, or git 189 grep. 190 191 - As CFQ, BFQ merges queues performing interleaved I/O, i.e., 192 performing random I/O that becomes mostly sequential if 193 merged. Differently from CFQ, BFQ achieves this goal with a more 194 reactive mechanism, called Early Queue Merge (EQM). EQM is so 195 responsive in detecting interleaved I/O (cooperating processes), 196 that it enables BFQ to achieve a high throughput, by queue 197 merging, even for queues for which CFQ needs a different 198 mechanism, preemption, to get a high throughput. As such EQM is a 199 unified mechanism to achieve a high throughput with interleaved 200 I/O. 201 202 - Queues are scheduled according to a variant of WF2Q+, named 203 B-WF2Q+, and implemented using an augmented rb-tree to preserve an 204 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is 205 also ready for hierarchical scheduling, details in Section 4. 206 207 - B-WF2Q+ guarantees a tight deviation with respect to an ideal, 208 perfectly fair, and smooth service. In particular, B-WF2Q+ 209 guarantees that each queue receives a fraction of the device 210 throughput proportional to its weight, even if the throughput 211 fluctuates, and regardless of: the device parameters, the current 212 workload and the budgets assigned to the queue. 213 214 - The last, budget-independence, property (although probably 215 counterintuitive in the first place) is definitely beneficial, for 216 the following reasons: 217 218 - First, with any proportional-share scheduler, the maximum 219 deviation with respect to an ideal service is proportional to 220 the maximum budget (slice) assigned to queues. As a consequence, 221 BFQ can keep this deviation tight not only because of the 222 accurate service of B-WF2Q+, but also because BFQ *does not* 223 need to assign a larger budget to a queue to let the queue 224 receive a higher fraction of the device throughput. 225 226 - Second, BFQ is free to choose, for every process (queue), the 227 budget that best fits the needs of the process, or best 228 leverages the I/O pattern of the process. In particular, BFQ 229 updates queue budgets with a simple feedback-loop algorithm that 230 allows a high throughput to be achieved, while still providing 231 tight latency guarantees to time-sensitive applications. When 232 the in-service queue expires, this algorithm computes the next 233 budget of the queue so as to: 234 235 - Let large budgets be eventually assigned to the queues 236 associated with I/O-bound applications performing sequential 237 I/O: in fact, the longer these applications are served once 238 got access to the device, the higher the throughput is. 239 240 - Let small budgets be eventually assigned to the queues 241 associated with time-sensitive applications (which typically 242 perform sporadic and short I/O), because, the smaller the 243 budget assigned to a queue waiting for service is, the sooner 244 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). 245 246- If several processes are competing for the device at the same time, 247 but all processes and groups have the same weight, then BFQ 248 guarantees the expected throughput distribution without ever idling 249 the device. It uses preemption instead. Throughput is then much 250 higher in this common scenario. 251 252- ioprio classes are served in strict priority order, i.e., 253 lower-priority queues are not served as long as there are 254 higher-priority queues. Among queues in the same class, the 255 bandwidth is distributed in proportion to the weight of each 256 queue. A very thin extra bandwidth is however guaranteed to 257 the Idle class, to prevent it from starving. 258 259 2603. What are BFQ's tunables and how to properly configure BFQ? 261============================================================= 262 263Most BFQ tunables affect service guarantees (basically latency and 264fairness) and throughput. For full details on how to choose the 265desired tradeoff between service guarantees and throughput, see the 266parameters slice_idle, strict_guarantees and low_latency. For details 267on how to maximise throughput, see slice_idle, timeout_sync and 268max_budget. The other performance-related parameters have been 269inherited from, and have been preserved mostly for compatibility with 270CFQ. So far, no performance improvement has been reported after 271changing the latter parameters in BFQ. 272 273In particular, the tunables back_seek-max, back_seek_penalty, 274fifo_expire_async and fifo_expire_sync below are the same as in 275CFQ. Their description is just copied from that for CFQ. Some 276considerations in the description of slice_idle are copied from CFQ 277too. 278 279per-process ioprio and weight 280----------------------------- 281 282Unless the cgroups interface is used (see "4. BFQ group scheduling"), 283weights can be assigned to processes only indirectly, through I/O 284priorities, and according to the relation: 285weight = (IOPRIO_BE_NR - ioprio) * 10. 286 287Beware that, if low-latency is set, then BFQ automatically raises the 288weight of the queues associated with interactive and soft real-time 289applications. Unset this tunable if you need/want to control weights. 290 291slice_idle 292---------- 293 294This parameter specifies how long BFQ should idle for next I/O 295request, when certain sync BFQ queues become empty. By default 296slice_idle is a non-zero value. Idling has a double purpose: boosting 297throughput and making sure that the desired throughput distribution is 298respected (see the description of how BFQ works, and, if needed, the 299papers referred there). 300 301As for throughput, idling can be very helpful on highly seeky media 302like single spindle SATA/SAS disks where we can cut down on overall 303number of seeks and see improved throughput. 304 305Setting slice_idle to 0 will remove all the idling on queues and one 306should see an overall improved throughput on faster storage devices 307like multiple SATA/SAS disks in hardware RAID configuration, as well 308as flash-based storage with internal command queueing (and 309parallelism). 310 311So depending on storage and workload, it might be useful to set 312slice_idle=0. In general for SATA/SAS disks and software RAID of 313SATA/SAS disks keeping slice_idle enabled should be useful. For any 314configurations where there are multiple spindles behind single LUN 315(Host based hardware RAID controller or for storage arrays), or with 316flash-based fast storage, setting slice_idle=0 might end up in better 317throughput and acceptable latencies. 318 319Idling is however necessary to have service guarantees enforced in 320case of differentiated weights or differentiated I/O-request lengths. 321To see why, suppose that a given BFQ queue A must get several I/O 322requests served for each request served for another queue B. Idling 323ensures that, if A makes a new I/O request slightly after becoming 324empty, then no request of B is dispatched in the middle, and thus A 325does not lose the possibility to get more than one request dispatched 326before the next request of B is dispatched. Note that idling 327guarantees the desired differentiated treatment of queues only in 328terms of I/O-request dispatches. To guarantee that the actual service 329order then corresponds to the dispatch order, the strict_guarantees 330tunable must be set too. 331 332There is an important flipside for idling: apart from the above cases 333where it is beneficial also for throughput, idling can severely impact 334throughput. One important case is random workload. Because of this 335issue, BFQ tends to avoid idling as much as possible, when it is not 336beneficial also for throughput (as detailed in Section 2). As a 337consequence of this behavior, and of further issues described for the 338strict_guarantees tunable, short-term service guarantees may be 339occasionally violated. And, in some cases, these guarantees may be 340more important than guaranteeing maximum throughput. For example, in 341video playing/streaming, a very low drop rate may be more important 342than maximum throughput. In these cases, consider setting the 343strict_guarantees parameter. 344 345strict_guarantees 346----------------- 347 348If this parameter is set (default: unset), then BFQ 349 350- always performs idling when the in-service queue becomes empty; 351 352- forces the device to serve one I/O request at a time, by dispatching a 353 new request only if there is no outstanding request. 354 355In the presence of differentiated weights or I/O-request sizes, both 356the above conditions are needed to guarantee that every BFQ queue 357receives its allotted share of the bandwidth. The first condition is 358needed for the reasons explained in the description of the slice_idle 359tunable. The second condition is needed because all modern storage 360devices reorder internally-queued requests, which may trivially break 361the service guarantees enforced by the I/O scheduler. 362 363Setting strict_guarantees may evidently affect throughput. 364 365back_seek_max 366------------- 367 368This specifies, given in Kbytes, the maximum "distance" for backward seeking. 369The distance is the amount of space from the current head location to the 370sectors that are backward in terms of distance. 371 372This parameter allows the scheduler to anticipate requests in the "backward" 373direction and consider them as being the "next" if they are within this 374distance from the current head location. 375 376back_seek_penalty 377----------------- 378 379This parameter is used to compute the cost of backward seeking. If the 380backward distance of request is just 1/back_seek_penalty from a "front" 381request, then the seeking cost of two requests is considered equivalent. 382 383So scheduler will not bias toward one or the other request (otherwise scheduler 384will bias toward front request). Default value of back_seek_penalty is 2. 385 386fifo_expire_async 387----------------- 388 389This parameter is used to set the timeout of asynchronous requests. Default 390value of this is 248ms. 391 392fifo_expire_sync 393---------------- 394 395This parameter is used to set the timeout of synchronous requests. Default 396value of this is 124ms. In case to favor synchronous requests over asynchronous 397one, this value should be decreased relative to fifo_expire_async. 398 399low_latency 400----------- 401 402This parameter is used to enable/disable BFQ's low latency mode. By 403default, low latency mode is enabled. If enabled, interactive and soft 404real-time applications are privileged and experience a lower latency, 405as explained in more detail in the description of how BFQ works. 406 407DISABLE this mode if you need full control on bandwidth 408distribution. In fact, if it is enabled, then BFQ automatically 409increases the bandwidth share of privileged applications, as the main 410means to guarantee a lower latency to them. 411 412In addition, as already highlighted at the beginning of this document, 413DISABLE this mode if your only goal is to achieve a high throughput. 414In fact, privileging the I/O of some application over the rest may 415entail a lower throughput. To achieve the highest-possible throughput 416on a non-rotational device, setting slice_idle to 0 may be needed too 417(at the cost of giving up any strong guarantee on fairness and low 418latency). 419 420timeout_sync 421------------ 422 423Maximum amount of device time that can be given to a task (queue) once 424it has been selected for service. On devices with costly seeks, 425increasing this time usually increases maximum throughput. On the 426opposite end, increasing this time coarsens the granularity of the 427short-term bandwidth and latency guarantees, especially if the 428following parameter is set to zero. 429 430max_budget 431---------- 432 433Maximum amount of service, measured in sectors, that can be provided 434to a BFQ queue once it is set in service (of course within the limits 435of the above timeout). According to what said in the description of 436the algorithm, larger values increase the throughput in proportion to 437the percentage of sequential I/O requests issued. The price of larger 438values is that they coarsen the granularity of short-term bandwidth 439and latency guarantees. 440 441The default value is 0, which enables auto-tuning: BFQ sets max_budget 442to the maximum number of sectors that can be served during 443timeout_sync, according to the estimated peak rate. 444 445For specific devices, some users have occasionally reported to have 446reached a higher throughput by setting max_budget explicitly, i.e., by 447setting max_budget to a higher value than 0. In particular, they have 448set max_budget to higher values than those to which BFQ would have set 449it with auto-tuning. An alternative way to achieve this goal is to 450just increase the value of timeout_sync, leaving max_budget equal to 0. 451 452weights 453------- 454 455Read-only parameter, used to show the weights of the currently active 456BFQ queues. 457 458 4594. Group scheduling with BFQ 460============================ 461 462BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely 463blkio and io. In particular, BFQ supports weight-based proportional 464share. To activate cgroups support, set BFQ_GROUP_IOSCHED. 465 4664-1 Service guarantees provided 467------------------------------- 468 469With BFQ, proportional share means true proportional share of the 470device bandwidth, according to group weights. For example, a group 471with weight 200 gets twice the bandwidth, and not just twice the time, 472of a group with weight 100. 473 474BFQ supports hierarchies (group trees) of any depth. Bandwidth is 475distributed among groups and processes in the expected way: for each 476group, the children of the group share the whole bandwidth of the 477group in proportion to their weights. In particular, this implies 478that, for each leaf group, every process of the group receives the 479same share of the whole group bandwidth, unless the ioprio of the 480process is modified. 481 482The resource-sharing guarantee for a group may partially or totally 483switch from bandwidth to time, if providing bandwidth guarantees to 484the group lowers the throughput too much. This switch occurs on a 485per-process basis: if a process of a leaf group causes throughput loss 486if served in such a way to receive its share of the bandwidth, then 487BFQ switches back to just time-based proportional share for that 488process. 489 4904-2 Interface 491------------- 492 493To get proportional sharing of bandwidth with BFQ for a given device, 494BFQ must of course be the active scheduler for that device. 495 496Within each group directory, the names of the files associated with 497BFQ-specific cgroup parameters and stats begin with the "bfq." 498prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for 499BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group 500parameter to set the weight of a group with BFQ is blkio.bfq.weight 501or io.bfq.weight. 502 503Parameters to set 504----------------- 505 506For each group, there is only the following parameter to set. 507 508weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the 509group inside its parent. Available values: 1..10000 (default 100). The 510linear mapping between ioprio and weights, described at the beginning 511of the tunable section, is still valid, but all weights higher than 512IOPRIO_BE_NR*10 are mapped to ioprio 0. 513 514Recall that, if low-latency is set, then BFQ automatically raises the 515weight of the queues associated with interactive and soft real-time 516applications. Unset this tunable if you need/want to control weights. 517 518 519[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O 520 Scheduler", Proceedings of the First Workshop on Mobile System 521 Technologies (MST-2015), May 2015. 522 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf 523 524[2] P. Valente and M. Andreolini, "Improving Application 525 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of 526 the 5th Annual International Systems and Storage Conference 527 (SYSTOR '12), June 2012. 528 Slightly extended version: 529 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- 530 results.pdf 531