1 Block IO Controller 2 =================== 3Overview 4======== 5cgroup subsys "blkio" implements the block io controller. There seems to be 6a need of various kinds of IO control policies (like proportional BW, max BW) 7both at leaf nodes as well as at intermediate nodes in a storage hierarchy. 8Plan is to use the same cgroup based management interface for blkio controller 9and based on user options switch IO policies in the background. 10 11Currently two IO control policies are implemented. First one is proportional 12weight time based division of disk policy. It is implemented in CFQ. Hence 13this policy takes effect only on leaf nodes when CFQ is being used. The second 14one is throttling policy which can be used to specify upper IO rate limits 15on devices. This policy is implemented in generic block layer and can be 16used on leaf nodes as well as higher level logical devices like device mapper. 17 18HOWTO 19===== 20Proportional Weight division of bandwidth 21----------------------------------------- 22You can do a very simple testing of running two dd threads in two different 23cgroups. Here is what you can do. 24 25- Enable Block IO controller 26 CONFIG_BLK_CGROUP=y 27 28- Enable group scheduling in CFQ 29 CONFIG_CFQ_GROUP_IOSCHED=y 30 31- Compile and boot into kernel and mount IO controller (blkio); see 32 cgroups.txt, Why are cgroups needed?. 33 34 mount -t tmpfs cgroup_root /sys/fs/cgroup 35 mkdir /sys/fs/cgroup/blkio 36 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio 37 38- Create two cgroups 39 mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2 40 41- Set weights of group test1 and test2 42 echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight 43 echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight 44 45- Create two same size files (say 512MB each) on same disk (file1, file2) and 46 launch two dd threads in different cgroup to read those files. 47 48 sync 49 echo 3 > /proc/sys/vm/drop_caches 50 51 dd if=/mnt/sdb/zerofile1 of=/dev/null & 52 echo $! > /sys/fs/cgroup/blkio/test1/tasks 53 cat /sys/fs/cgroup/blkio/test1/tasks 54 55 dd if=/mnt/sdb/zerofile2 of=/dev/null & 56 echo $! > /sys/fs/cgroup/blkio/test2/tasks 57 cat /sys/fs/cgroup/blkio/test2/tasks 58 59- At macro level, first dd should finish first. To get more precise data, keep 60 on looking at (with the help of script), at blkio.disk_time and 61 blkio.disk_sectors files of both test1 and test2 groups. This will tell how 62 much disk time (in milli seconds), each group got and how many secotors each 63 group dispatched to the disk. We provide fairness in terms of disk time, so 64 ideally io.disk_time of cgroups should be in proportion to the weight. 65 66Throttling/Upper Limit policy 67----------------------------- 68- Enable Block IO controller 69 CONFIG_BLK_CGROUP=y 70 71- Enable throttling in block layer 72 CONFIG_BLK_DEV_THROTTLING=y 73 74- Mount blkio controller (see cgroups.txt, Why are cgroups needed?) 75 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio 76 77- Specify a bandwidth rate on particular device for root group. The format 78 for policy is "<major>:<minor> <bytes_per_second>". 79 80 echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 81 82 Above will put a limit of 1MB/second on reads happening for root group 83 on device having major/minor number 8:16. 84 85- Run dd to read a file and see if rate is throttled to 1MB/s or not. 86 87 # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 88 # iflag=direct 89 1024+0 records in 90 1024+0 records out 91 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s 92 93 Limits for writes can be put using blkio.throttle.write_bps_device file. 94 95Hierarchical Cgroups 96==================== 97- Currently only CFQ supports hierarchical groups. For throttling, 98 cgroup interface does allow creation of hierarchical cgroups and 99 internally it treats them as flat hierarchy. 100 101 If somebody created a hierarchy like as follows. 102 103 root 104 / \ 105 test1 test2 106 | 107 test3 108 109 CFQ will handle the hierarchy correctly but and throttling will 110 practically treat all groups at same level. For details on CFQ 111 hierarchy support, refer to Documentation/block/cfq-iosched.txt. 112 Throttling will treat the hierarchy as if it looks like the 113 following. 114 115 pivot 116 / / \ \ 117 root test1 test2 test3 118 119 Nesting cgroups, while allowed, isn't officially supported and blkio 120 genereates warning when cgroups nest. Once throttling implements 121 hierarchy support, hierarchy will be supported and the warning will 122 be removed. 123 124Various user visible config options 125=================================== 126CONFIG_BLK_CGROUP 127 - Block IO controller. 128 129CONFIG_DEBUG_BLK_CGROUP 130 - Debug help. Right now some additional stats file show up in cgroup 131 if this option is enabled. 132 133CONFIG_CFQ_GROUP_IOSCHED 134 - Enables group scheduling in CFQ. Currently only 1 level of group 135 creation is allowed. 136 137CONFIG_BLK_DEV_THROTTLING 138 - Enable block device throttling support in block layer. 139 140Details of cgroup files 141======================= 142Proportional weight policy files 143-------------------------------- 144- blkio.weight 145 - Specifies per cgroup weight. This is default weight of the group 146 on all the devices until and unless overridden by per device rule. 147 (See blkio.weight_device). 148 Currently allowed range of weights is from 10 to 1000. 149 150- blkio.weight_device 151 - One can specify per cgroup per device rules using this interface. 152 These rules override the default value of group weight as specified 153 by blkio.weight. 154 155 Following is the format. 156 157 # echo dev_maj:dev_minor weight > blkio.weight_device 158 Configure weight=300 on /dev/sdb (8:16) in this cgroup 159 # echo 8:16 300 > blkio.weight_device 160 # cat blkio.weight_device 161 dev weight 162 8:16 300 163 164 Configure weight=500 on /dev/sda (8:0) in this cgroup 165 # echo 8:0 500 > blkio.weight_device 166 # cat blkio.weight_device 167 dev weight 168 8:0 500 169 8:16 300 170 171 Remove specific weight for /dev/sda in this cgroup 172 # echo 8:0 0 > blkio.weight_device 173 # cat blkio.weight_device 174 dev weight 175 8:16 300 176 177- blkio.leaf_weight[_device] 178 - Equivalents of blkio.weight[_device] for the purpose of 179 deciding how much weight tasks in the given cgroup has while 180 competing with the cgroup's child cgroups. For details, 181 please refer to Documentation/block/cfq-iosched.txt. 182 183- blkio.time 184 - disk time allocated to cgroup per device in milliseconds. First 185 two fields specify the major and minor number of the device and 186 third field specifies the disk time allocated to group in 187 milliseconds. 188 189- blkio.sectors 190 - number of sectors transferred to/from disk by the group. First 191 two fields specify the major and minor number of the device and 192 third field specifies the number of sectors transferred by the 193 group to/from the device. 194 195- blkio.io_service_bytes 196 - Number of bytes transferred to/from the disk by the group. These 197 are further divided by the type of operation - read or write, sync 198 or async. First two fields specify the major and minor number of the 199 device, third field specifies the operation type and the fourth field 200 specifies the number of bytes. 201 202- blkio.io_serviced 203 - Number of IOs completed to/from the disk by the group. These 204 are further divided by the type of operation - read or write, sync 205 or async. First two fields specify the major and minor number of the 206 device, third field specifies the operation type and the fourth field 207 specifies the number of IOs. 208 209- blkio.io_service_time 210 - Total amount of time between request dispatch and request completion 211 for the IOs done by this cgroup. This is in nanoseconds to make it 212 meaningful for flash devices too. For devices with queue depth of 1, 213 this time represents the actual service time. When queue_depth > 1, 214 that is no longer true as requests may be served out of order. This 215 may cause the service time for a given IO to include the service time 216 of multiple IOs when served out of order which may result in total 217 io_service_time > actual time elapsed. This time is further divided by 218 the type of operation - read or write, sync or async. First two fields 219 specify the major and minor number of the device, third field 220 specifies the operation type and the fourth field specifies the 221 io_service_time in ns. 222 223- blkio.io_wait_time 224 - Total amount of time the IOs for this cgroup spent waiting in the 225 scheduler queues for service. This can be greater than the total time 226 elapsed since it is cumulative io_wait_time for all IOs. It is not a 227 measure of total time the cgroup spent waiting but rather a measure of 228 the wait_time for its individual IOs. For devices with queue_depth > 1 229 this metric does not include the time spent waiting for service once 230 the IO is dispatched to the device but till it actually gets serviced 231 (there might be a time lag here due to re-ordering of requests by the 232 device). This is in nanoseconds to make it meaningful for flash 233 devices too. This time is further divided by the type of operation - 234 read or write, sync or async. First two fields specify the major and 235 minor number of the device, third field specifies the operation type 236 and the fourth field specifies the io_wait_time in ns. 237 238- blkio.io_merged 239 - Total number of bios/requests merged into requests belonging to this 240 cgroup. This is further divided by the type of operation - read or 241 write, sync or async. 242 243- blkio.io_queued 244 - Total number of requests queued up at any given instant for this 245 cgroup. This is further divided by the type of operation - read or 246 write, sync or async. 247 248- blkio.avg_queue_size 249 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 250 The average queue size for this cgroup over the entire time of this 251 cgroup's existence. Queue size samples are taken each time one of the 252 queues of this cgroup gets a timeslice. 253 254- blkio.group_wait_time 255 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 256 This is the amount of time the cgroup had to wait since it became busy 257 (i.e., went from 0 to 1 request queued) to get a timeslice for one of 258 its queues. This is different from the io_wait_time which is the 259 cumulative total of the amount of time spent by each IO in that cgroup 260 waiting in the scheduler queue. This is in nanoseconds. If this is 261 read when the cgroup is in a waiting (for timeslice) state, the stat 262 will only report the group_wait_time accumulated till the last time it 263 got a timeslice and will not include the current delta. 264 265- blkio.empty_time 266 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 267 This is the amount of time a cgroup spends without any pending 268 requests when not being served, i.e., it does not include any time 269 spent idling for one of the queues of the cgroup. This is in 270 nanoseconds. If this is read when the cgroup is in an empty state, 271 the stat will only report the empty_time accumulated till the last 272 time it had a pending request and will not include the current delta. 273 274- blkio.idle_time 275 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 276 This is the amount of time spent by the IO scheduler idling for a 277 given cgroup in anticipation of a better request than the existing ones 278 from other queues/cgroups. This is in nanoseconds. If this is read 279 when the cgroup is in an idling state, the stat will only report the 280 idle_time accumulated till the last idle period and will not include 281 the current delta. 282 283- blkio.dequeue 284 - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This 285 gives the statistics about how many a times a group was dequeued 286 from service tree of the device. First two fields specify the major 287 and minor number of the device and third field specifies the number 288 of times a group was dequeued from a particular device. 289 290- blkio.*_recursive 291 - Recursive version of various stats. These files show the 292 same information as their non-recursive counterparts but 293 include stats from all the descendant cgroups. 294 295Throttling/Upper limit policy files 296----------------------------------- 297- blkio.throttle.read_bps_device 298 - Specifies upper limit on READ rate from the device. IO rate is 299 specified in bytes per second. Rules are per device. Following is 300 the format. 301 302 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device 303 304- blkio.throttle.write_bps_device 305 - Specifies upper limit on WRITE rate to the device. IO rate is 306 specified in bytes per second. Rules are per device. Following is 307 the format. 308 309 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device 310 311- blkio.throttle.read_iops_device 312 - Specifies upper limit on READ rate from the device. IO rate is 313 specified in IO per second. Rules are per device. Following is 314 the format. 315 316 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device 317 318- blkio.throttle.write_iops_device 319 - Specifies upper limit on WRITE rate to the device. IO rate is 320 specified in io per second. Rules are per device. Following is 321 the format. 322 323 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device 324 325Note: If both BW and IOPS rules are specified for a device, then IO is 326 subjected to both the constraints. 327 328- blkio.throttle.io_serviced 329 - Number of IOs (bio) completed to/from the disk by the group (as 330 seen by throttling policy). These are further divided by the type 331 of operation - read or write, sync or async. First two fields specify 332 the major and minor number of the device, third field specifies the 333 operation type and the fourth field specifies the number of IOs. 334 335 blkio.io_serviced does accounting as seen by CFQ and counts are in 336 number of requests (struct request). On the other hand, 337 blkio.throttle.io_serviced counts number of IO in terms of number 338 of bios as seen by throttling policy. These bios can later be 339 merged by elevator and total number of requests completed can be 340 lesser. 341 342- blkio.throttle.io_service_bytes 343 - Number of bytes transferred to/from the disk by the group. These 344 are further divided by the type of operation - read or write, sync 345 or async. First two fields specify the major and minor number of the 346 device, third field specifies the operation type and the fourth field 347 specifies the number of bytes. 348 349 These numbers should roughly be same as blkio.io_service_bytes as 350 updated by CFQ. The difference between two is that 351 blkio.io_service_bytes will not be updated if CFQ is not operating 352 on request queue. 353 354Common files among various policies 355----------------------------------- 356- blkio.reset_stats 357 - Writing an int to this file will result in resetting all the stats 358 for that cgroup. 359 360CFQ sysfs tunable 361================= 362/sys/block/<disk>/queue/iosched/slice_idle 363------------------------------------------ 364On a faster hardware CFQ can be slow, especially with sequential workload. 365This happens because CFQ idles on a single queue and single queue might not 366drive deeper request queue depths to keep the storage busy. In such scenarios 367one can try setting slice_idle=0 and that would switch CFQ to IOPS 368(IO operations per second) mode on NCQ supporting hardware. 369 370That means CFQ will not idle between cfq queues of a cfq group and hence be 371able to driver higher queue depth and achieve better throughput. That also 372means that cfq provides fairness among groups in terms of IOPS and not in 373terms of disk time. 374 375/sys/block/<disk>/queue/iosched/group_idle 376------------------------------------------ 377If one disables idling on individual cfq queues and cfq service trees by 378setting slice_idle=0, group_idle kicks in. That means CFQ will still idle 379on the group in an attempt to provide fairness among groups. 380 381By default group_idle is same as slice_idle and does not do anything if 382slice_idle is enabled. 383 384One can experience an overall throughput drop if you have created multiple 385groups and put applications in that group which are not driving enough 386IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle 387on individual groups and throughput should improve. 388 389What works 390========== 391- Currently only sync IO queues are support. All the buffered writes are 392 still system wide and not per group. Hence we will not see service 393 differentiation between buffered writes between groups. 394