cgroup-v2.rst - OpenGrok cross reference for /Documentation/admin-guide/cgroup-v2.rst

Lines Matching +full:compare +full:- +full:and +full:- +full:swap
1 .. _cgroup-v2:
10 This is the authoritative documentation on the design, interface and
11 conventions of cgroup v2.  It describes all userland-visible aspects
12 of cgroup including core and specific controller behaviors.  All
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19      1-1. Terminology
20      1-2. What is cgroup?
22      2-1. Mounting
23      2-2. Organizing Processes and Threads
24        2-2-1. Processes
25        2-2-2. Threads
26      2-3. [Un]populated Notification
27      2-4. Controlling Controllers
28        2-4-1. Enabling and Disabling
29        2-4-2. Top-down Constraint
30        2-4-3. No Internal Process Constraint
31      2-5. Delegation
32        2-5-1. Model of Delegation
33        2-5-2. Delegation Containment
34      2-6. Guidelines
35        2-6-1. Organize Once and Control
36        2-6-2. Avoid Name Collisions
38      3-1. Weights
39      3-2. Limits
40      3-3. Protections
41      3-4. Allocations
43      4-1. Format
44      4-2. Conventions
45      4-3. Core Interface Files
47      5-1. CPU
48        5-1-1. CPU Interface Files
49      5-2. Memory
50        5-2-1. Memory Interface Files
51        5-2-2. Usage Guidelines
52        5-2-3. Memory Ownership
53      5-3. IO
54        5-3-1. IO Interface Files
55        5-3-2. Writeback
56        5-3-3. IO Latency
57          5-3-3-1. How IO Latency Throttling Works
58          5-3-3-2. IO Latency Interface Files
59        5-3-4. IO Priority
60      5-4. PID
61        5-4-1. PID Interface Files
62      5-5. Cpuset
63        5.5-1. Cpuset Interface Files
64      5-6. Device
65      5-7. RDMA
66        5-7-1. RDMA Interface Files
67      5-8. HugeTLB
68        5.8-1. HugeTLB Interface Files
69      5-9. Misc
70        5.9-1 Miscellaneous cgroup Interface Files
71        5.9-2 Migration and Ownership
72      5-10. Others
73        5-10-1. perf_event
74      5-N. Non-normative information
75        5-N-1. CPU controller root cgroup process behaviour
76        5-N-2. IO controller root cgroup process behaviour
78      6-1. Basics
79      6-2. The Root and Views
80      6-3. Migration and setns(2)
81      6-4. Interaction with Other Namespaces
83      P-1. Filesystem Support for Writeback
85    R. Issues with v1 and Rationales for v2
86      R-1. Multiple Hierarchies
87      R-2. Thread Granularity
88      R-3. Competition Between Inner Nodes and Threads
89      R-4. Other Interface Issues
90      R-5. Controller Issues and Remedies
91        R-5-1. Memory
98 -----------
100 "cgroup" stands for "control group" and is never capitalized.  The
101 singular form is used to designate the whole feature and also as a
107 ---------------
109 cgroup is a mechanism to organize processes hierarchically and
110 distribute system resources along the hierarchy in a controlled and
113 cgroup is largely composed of two parts - the core and controllers.
120 cgroups form a tree structure and every process in the system belongs
121 to one and only one cgroup.  All threads of a process belong to the
129 hierarchical - if a controller is enabled on a cgroup, it affects all
131 sub-hierarchy of the cgroup.  When a controller is enabled on a nested
141 --------
146   # mount -t cgroup2 none $MOUNT_POINT
149 controllers which support v2 and are not bound to a v1 hierarchy are
150 automatically bound to the v2 hierarchy and show up at the root.
156 is no longer referenced in its current hierarchy.  Because per-cgroup
157 controller states are destroyed asynchronously and controllers may
161 the unified hierarchy and it may take some time for the disabled
163 to inter-controller dependencies, other controllers may need to be
166 While useful for development and manual configurations, moving
167 controllers dynamically between the v2 and other hierarchies is
169 the hierarchies and controller associations before starting using the
173 automount the v1 cgroup filesystem and so hijack all controllers
175 and experimenting easier, the kernel parameter cgroup_no_v1= allows
176 disabling controllers in v1 and make them always available in v2.
182 	option is system wide and can only be set on mount or modified
184 	ignored on non-init namespace mounts.  Please refer to the
189         task migrations and controller on/offs at the cost of making
190         hot path operations such as forks and exits more expensive.
192         controllers, and then seeding it with CLONE_INTO_CGROUP is
197         and not any subtrees. This is legacy behaviour, the default
199         This option is system wide and can only be set on mount or
201         option is ignored on non-init namespace mounts.
204         Recursively apply memory.min and memory.low protection to
209         behavior but is a mount-option to avoid regressing setups
216         statistics reporting and memory protetion). This is a new
223           controller. The pre-allocated pool does not belong to anyone.
233           still has pages available (but the cgroup limit is hit and
236           memory protection and reclaim dynamics. Any userspace tuning
243         The option restores v1-like behavior of pids.events:max, that is only
250 Organizing Processes and Threads
251 --------------------------------
257 A child cgroup can be created by creating a sub-directory::
262 structure.  Each cgroup has a read-writable interface file
264 belong to the cgroup one-per-line.  The PIDs are not ordered and the
266 another cgroup and then back or the PID got recycled while reading.
278 zombie process does not appear in "cgroup.procs" and thus can't be
283 have any children and is associated only with zombie processes is
284 considered empty and can be removed::
295   0::/test-cgroup/test-cgroup-nested
297 If the process becomes a zombie and the cgroup it was associated with
302   0::/test-cgroup/test-cgroup-nested (deleted)
323 threaded, is called threaded domain or thread root interchangeably and
327 different cgroups and are not subject to the no internal process
328 constraint - threaded controllers can be enabled on non-leaf cgroups
333 resource consumptions whether there are processes in it or not and
336 serve both as a threaded domain and a parent to domain cgroups.
343 On creation, a cgroup is always a domain cgroup and can be made
352 - As the cgroup will join the parent's resource domain.  The parent
355 - When the parent is an unthreaded domain, it must not have any domain
359 Topology-wise, a cgroup can be in an invalid state.  Please consider
362   A (threaded domain) - B (threaded) - C (domain, just created)
377 threads in the cgroup.  Except that the operations are per-thread
378 instead of per-process, "cgroup.threads" has the same format and
385 subtree, and, while the threads can be scattered across the subtree,
388 processes in the subtree and is not readable in the subtree proper.
394 accounts for and controls resource consumptions associated with the
395 threads in the cgroup and its descendants.  All consumptions which
400 between threads in a non-leaf cgroup and its child cgroups.  Each
403 Currently, the following controllers are threaded and can be enabled
406 - cpu
407 - cpuset
408 - perf_event
409 - pids
412 --------------------------
414 Each non-root cgroup has a "cgroup.events" file which contains
415 "populated" field indicating whether the cgroup's sub-hierarchy has
417 the cgroup and its descendants; otherwise, 1.  poll and [id]notify
419 example, to start a clean-up operation after all processes of a given
420 sub-hierarchy have exited.  The populated state updates and
421 notifications are recursive.  Consider the following sub-hierarchy
425   A(4) - B(0) - C(1)
428 A, B and C's "populated" fields would be 1 while D's 0.  After the one
429 process in C exits, B and C's "populated" fields would flip to "0" and
435 -----------------------
437 Enabling and Disabling
446 No controller is enabled by default.  Controllers can be enabled and
449   # echo "+cpu +memory -io" > cgroup.subtree_control
458 Consider the following sub-hierarchy.  The enabled controllers are
461   A(cpu,memory) - B(memory) - C()
464 As A has "cpu" and "memory" enabled, A will control the distribution
465 of CPU cycles and memory to its children, in this case, B.  As B has
466 "memory" enabled but not "CPU", C and D will compete freely on CPU
472 would create the "cpu." prefixed controller interface files in C and
474 prefixed controller interface files from C and D.  This means that the
475 controller interface files - anything which doesn't start with
479 Top-down Constraint
482 Resources are distributed top-down and a cgroup can further distribute
484 parent.  This means that all non-root "cgroup.subtree_control" files
487 the parent has the controller enabled and a controller can't be
494 Non-root cgroups can distribute domain resources to their children
505 processes and anonymous resource consumption which can't be associated
506 with any other cgroups and requires special treatment from most
509 refer to the Non-normative information section in the Controllers
516 cgroup must create children and transfer all its processes to the
522 ----------
528 user by granting write access of the directory and its "cgroup.procs",
529 "cgroup.threads" and "cgroup.subtree_control" files to the user.
538 of at least mount namespacing, and the kernel rejects writes to all
544 delegated, the user can build sub-hierarchy under the directory,
545 organize processes inside it as it sees fit and further distribute the
546 resources it received from the parent.  The limits and other settings
547 of all resource controllers are hierarchical and regardless of what
548 happens in the delegated sub-hierarchy, nothing can escape the
552 cgroups in or nesting depth of a delegated sub-hierarchy; however,
559 A delegated sub-hierarchy is contained in the sense that processes
560 can't be moved into or out of the sub-hierarchy by the delegatee.
563 requiring the following conditions for a process with a non-root euid
567 - The writer must have write access to the "cgroup.procs" file.
569 - The writer must have write access to the "cgroup.procs" file of the
570   common ancestor of the source and destination cgroups.
573 processes around freely in the delegated sub-hierarchy it can't pull
574 in from or push out to outside the sub-hierarchy.
576 For an example, let's assume cgroups C0 and C1 have been delegated to
577 user U0 who created C00, C01 under C0 and C10 under C1 as follows and
578 all processes under C0 and C1 belong to U0::
580   ~~~~~~~~~~~~~ - C0 - C00
583   ~~~~~~~~~~~~~ - C1 - C10
587 file; however, the common ancestor of the source cgroup C10 and the
588 destination cgroup C00 is above the points of delegation and U0 would
589 not have write access to its "cgroup.procs" files and thus the write
590 will be denied with -EACCES.
593 that both the source and destination cgroups are reachable from the
595 is not reachable, the migration is rejected with -ENOENT.
599 ----------
601 Organize Once and Control
605 and stateful resources such as memory are not moved together with the
607 inherent trade-offs between migration and various hot paths in terms
612 should be assigned to a cgroup according to the system's logical and
613 resource structure once on start-up.  Dynamic adjustments to resource
621 Interface files for a cgroup and its children cgroups occupy the same
622 directory and it is possible to create children cgroups which collide
625 All cgroup core interface files are prefixed with "cgroup." and each
626 controller's interface files are prefixed with the controller name and
627 a dot.  A controller's name is composed of lower case alphabets and
633 cgroup doesn't do anything to prevent name collisions and it's the
641 depending on the resource type and expected use cases.  This section
646 -------
649 active children and giving each the fraction matching the ratio of its
652 work-conserving.  Due to the dynamic nature, this model is usually
660 valid and there is no reason to reject configuration changes or
664 and is an example of this type.
667 .. _cgroupv2-limits-distributor:
670 ------
673 Limits can be over-committed - the sum of the limits of children can
676 Limits are in the range [0, max] and defaults to "max", which is noop.
678 As limits can be over-committed, all configuration combinations are
679 valid and there is no reason to reject configuration changes or
682 "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
683 on an IO device and is an example of this type.
685 .. _cgroupv2-protections-distributor:
688 -----------
693 soft boundaries.  Protections can also be over-committed in which case
697 Protections are in the range [0, max] and defaults to 0, which is
700 As protections can be over-committed, all configuration combinations
701 are valid and there is no reason to reject configuration changes or
704 "memory.low" implements best-effort memory protection and is an
709 -----------
712 resource.  Allocations can't be over-committed - the sum of the
716 Allocations are in the range [0, max] and defaults to 0, which is no
719 As allocations can't be over-committed, some configuration
720 combinations are invalid and should be rejected.  Also, if the
724 "cpu.rt.max" hard-allocates realtime slices and is an example of this
732 ------
737   New-line separated values
745   (when read-only or multiple values can be written at once)
765 For both flat and nested keyed files, only the values for a single key
767 may be specified in any order and not all pairs have to be specified.
771 -----------
773 - Settings for a single feature should be contained in a single file.
775 - The root cgroup should be exempt from resource control and thus
778 - The default time unit is microseconds.  If a different unit is ever
781 - A parts-per quantity should use a percentage decimal with at least
782   two digit fractional part - e.g. 13.40.
784 - If a controller implements weight based resource distribution, its
785   interface file should be named "weight" and have the range [1,
787   enough and symmetric bias in both directions while keeping it
790 - If a controller implements an absolute resource guarantee and/or
791   limit, the interface files should be named "min" and "max"
793   guarantee and/or limit, the interface files should be named "low"
794   and "high" respectively.
797   used to represent upward infinity for both reading and writing.
799 - If a setting has a configurable default value and keyed specific
800   overrides, the default entry should be keyed with "default" and
813     # cat cgroup-example-interface-file
819     # echo 125 > cgroup-example-interface-file
823     # echo "default 125" > cgroup-example-interface-file
827     # echo "8:16 170" > cgroup-example-interface-file
829   and cleared by::
831     # echo "8:0 default" > cgroup-example-interface-file
832     # cat cgroup-example-interface-file
836 - For events which are not very high frequency, an interface file
843 --------------------
848 	A read-write single value file which exists on non-root
854 	- "domain" : A normal valid domain cgroup.
856 	- "domain threaded" : A threaded domain cgroup which is
859 	- "domain invalid" : A cgroup which is in an invalid state.
863 	- "threaded" : A threaded cgroup which is a member of a
870 	A read-write new-line separated values file which exists on
874 	the cgroup one-per-line.  The PIDs are not ordered and the
876 	to another cgroup and then back or the PID got recycled while
883 	- It must have write access to the "cgroup.procs" file.
885 	- It must have write access to the "cgroup.procs" file of the
886 	  common ancestor of the source and destination cgroups.
888 	When delegating a sub-hierarchy, write access to this file
893 	supported and moves every thread of the process to the cgroup.
896 	A read-write new-line separated values file which exists on
900 	the cgroup one-per-line.  The TIDs are not ordered and the
902 	another cgroup and then back or the TID got recycled while
909 	- It must have write access to the "cgroup.threads" file.
911 	- The cgroup that the thread is currently in must be in the
914 	- It must have write access to the "cgroup.procs" file of the
915 	  common ancestor of the source and destination cgroups.
917 	When delegating a sub-hierarchy, write access to this file
921 	A read-only space separated values file which exists on all
928 	A read-write space separated values file which exists on all
935 	Space separated list of controllers prefixed with '+' or '-'
937 	name prefixed with '+' enables the controller and '-'
939 	the last one is effective.  When multiple enable and disable
943 	A read-only flat-keyed file which exists on non-root cgroups.
955 	A read-write single value files.  The default is "max".
962 	A read-write single value files.  The default is "max".
969 	A read-only flat-keyed file with the following entries:
988 		cgroup) at and beneath the current cgroup.
992 		cgroup) at and beneath the current cgroup.
995 	A read-only flat-keyed file which exists in non-root cgroups.
999 		Cumulative time that this cgroup has spent between freezing and
1010 		the duration being measured is the span between a and c.
1013 	A read-write single value file which exists on non-root cgroups.
1014 	Allowed values are "0" and "1". The default is "0".
1016 	Writing "1" to the file causes freezing of the cgroup and all
1018 	be stopped and will not run until the cgroup will be explicitly
1021 	will be updated to "1" and the corresponding notification will be
1029 	They also can enter and leave a frozen cgroup: either by an explicit
1035 	it's possible to delete a frozen (and empty) cgroup, as well as
1036 	create new sub-cgroups.
1039 	A write-only single value file which exists in non-root cgroups.
1042 	Writing "1" to the file causes the cgroup and all descendant cgroups to
1046 	Killing a cgroup tree will deal with concurrent forks appropriately and
1051 	the whole thread-group.
1054 	A read-write single value file that allowed values are "0" and "1".
1058 	Writing "1" to the file will re-enable the cgroup PSI accounting.
1062 	and doesn't need pass enablement via ancestors from root.
1065 	each cgroup separately and aggregates it at each level of the hierarchy.
1066 	This may cause non-negligible overhead for some workloads when under
1068 	be used to disable PSI accounting in the non-leaf cgroups.
1071 	A read-write nested-keyed file.
1079 .. _cgroup-v2-cpu:
1082 ---
1085 controller implements weight and absolute bandwidth limit models for
1086 normal scheduling policy and absolute bandwidth allocation model for
1090 base and it does not account for the frequency at which tasks are executed.
1102 cgroups during the system boot process, and these processes may need
1113 	A read-only flat-keyed file.
1118 	- usage_usec
1119 	- user_usec
1120 	- system_usec
1122 	and the following five when the controller is enabled:
1124 	- nr_periods
1125 	- nr_throttled
1126 	- throttled_usec
1127 	- nr_bursts
1128 	- burst_usec
1131 	A read-write single value file which exists on non-root
1141 	A read-write single value file which exists on non-root
1144 	The nice value is in the range [-20, 19].
1147 	"cpu.weight" and allows reading and setting weight using the
1148 	same values used by nice(2).  Because the range is smaller and
1153 	A read-write two value file which exists on non-root cgroups.
1165 	A read-write single value file which exists on non-root
1171 	A read-write nested-keyed file.
1177         A read-write single value file which exists on non-root cgroups.
1183         This interface allows reading and setting minimum utilization clamp
1192         A read-write single value file which exists on non-root cgroups.
1198         This interface allows reading and setting maximum utilization clamp
1203 	A read-write single value file which exists on non-root cgroups.
1206 	This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1215 ------
1218 stateful and implements both limit and protection models.  Due to the
1219 intertwining between memory usage and reclaim pressure and the
1223 While not completely water-tight, all major memory usages by a given
1225 accounted and controlled to a reasonable extent.  Currently, the
1228 - Userland memory - page cache and anonymous memory.
1230 - Kernel data structures such as dentries and inodes.
1232 - TCP socket buffers.
1245 	A read-only single value file which exists on non-root
1249 	and its descendants.
1252 	A read-write single value file which exists on non-root
1272 	protection is discouraged and may lead to constant OOMs.
1278 	A read-write single value file which exists on non-root
1281 	Best-effort memory protection.  If the memory usage of a
1301 	A read-write single value file which exists on non-root
1306 	throttled and put under heavy reclaim pressure.
1308 	Going over the high limit never invokes the OOM killer and
1315 	A read-write single value file which exists on non-root
1320 	this limit and can't be reduced, the OOM killer is invoked in
1324 	In default configuration regular 0-order allocations always
1329 	as -ENOMEM or silently ignore in cases like disk readahead.
1332 	A write-only nested-keyed file which exists for all cgroups.
1343 	specified amount, -EAGAIN is returned.
1361 	all the existing limitations and potential future extensions.
1364 	A read-write single value file which exists on non-root cgroups.
1366 	The max memory usage recorded for the cgroup and its descendants since
1369 	A write of any non-empty string to this file resets it to the
1374 	A read-write single value file which exists on non-root
1384 	Tasks with the OOM protection (oom_score_adj set to -1000)
1385 	are treated as an exception and are never killed.
1392 	A read-only flat-keyed file which exists on non-root cgroups.
1397 	Note that all fields in this file are hierarchical and the
1406 		boundary is over-committed.
1410 		throttled and routed to perform direct memory reclaim
1423 		reached the limit and allocation was about to fail.
1426 		considered as an option, e.g. for failed high-order
1442 	A read-only flat-keyed file which exists on non-root cgroups.
1445 	types of memory, type-specific details, and other information
1446 	on the state and past events of the memory management system.
1450 	The entries are ordered to be human readable, and new entries
1454 	If the entry has no per-node counter (or not show in the
1455 	memory.numa_stat). We use 'npn' (non-per-node) as the tag
1460 		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
1464 		including tmpfs and shared memory.
1480 		and arm64 and IOMMU page tables.
1483 		Amount of memory used for storing per-cpu kernel
1493 		Amount of cached filesystem data that is swap-backed,
1510 		Amount of cached filesystem data that was modified and
1514 		Amount of swap cached in memory. The swapcache is accounted
1515 		against both memory and swap usage.
1530 		Amount of memory, swap-backed and filesystem-backed,
1536 		the value for the foo counter, since the foo counter is type-based, not
1537 		list-based.
1541 		dentries and inodes.
1548 		Amount of memory used for storing in-kernel data
1621 		Number of pages swapped into memory and filled with zero, where I/O
1626 		Number of zero-filled pages swapped out with I/O skipped due to the
1636 		Number of pages written from zswap to swap.
1654 		Usually because failed to allocate some continuous swap space
1677 	A read-only nested-keyed file which exists on non-root cgroups.
1680 	types of memory, type-specific details, and other information
1695 	The entries are ordered to be human readable, and new entries
1701   memory.swap.current
1702 	A read-only single value file which exists on non-root
1705 	The total amount of swap currently being used by the cgroup
1706 	and its descendants.
1708   memory.swap.high
1709 	A read-write single value file which exists on non-root
1712 	Swap usage throttle limit.  If a cgroup's swap usage exceeds
1714 	allow userspace to implement custom out-of-memory procedures.
1718 	during regular operation. Compare to memory.swap.max, which
1724   memory.swap.peak
1725 	A read-write single value file which exists on non-root cgroups.
1727 	The max swap usage recorded for the cgroup and its descendants since
1730 	A write of any non-empty string to this file resets it to the
1734   memory.swap.max
1735 	A read-write single value file which exists on non-root
1738 	Swap usage hard limit.  If a cgroup's swap usage reaches this
1741   memory.swap.events
1742 	A read-only flat-keyed file which exists on non-root cgroups.
1748 		The number of times the cgroup's swap usage was over
1752 		The number of times the cgroup's swap usage was about
1753 		to go over the max boundary and swap allocation
1757 		The number of times swap allocation failed either
1758 		because of running out of swap system-wide or max
1761 	When reduced under the current usage, the existing swap
1762 	entries are reclaimed gradually and the swap usage may stay
1764 	reduces the impact on the workload and memory management.
1767 	A read-only single value file which exists on non-root
1774 	A read-write single value file which exists on non-root
1782 	A read-write single value file. The default value is "1".
1788 	are disabled. This included both zswap writebacks, and swapping due
1792 	pages might be rejected again and again).
1794 	Note that this is subtly different from setting memory.swap.max to
1796 	This setting has no effect if zswap is disabled, and swapping
1797 	is allowed unless memory.swap.max is set to 0.
1800 	A read-only nested-keyed file.
1810 Over-committing on high limit (sum of high limits > available memory)
1811 and letting global memory pressure to distribute memory according to
1816 opportunities to monitor and take appropriate actions such as granting
1824 pressure - how much the workload is being impacted due to lack of
1825 memory - is necessary to determine whether a workload needs more
1833 A memory area is charged to the cgroup which instantiated it and stays
1839 To which cgroup the area will be charged is in-deterministic; however,
1850 --
1853 controller implements both weight based and absolute bandwidth or IOPS
1855 only if cfq-iosched is in use and neither scheme is available for
1856 blk-mq devices.
1863 	A read-only nested-keyed file.
1865 	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1883 	A read-write nested-keyed file which exists only on the root
1889 	are keyed by $MAJ:$MIN device numbers and not ordered.  The
1895 	  enable	Weight-based control enable
1905 	The controller is disabled by default and can be enabled by
1906 	setting "enable" to 1.  "rpct" and "wpct" parameters default
1907 	to zero and the controller uses internal device saturation
1908 	state to adjust the overall IO rate between "min" and "max".
1917 	latencies is above 75ms or write 150ms, and adjust the overall
1918 	IO issue rate between 50% and 150% accordingly.
1922 	adjustment range between "min" and "max", the more conformant
1924 	base rate may be far off from 100% and setting "min" and "max"
1926 	control quality.  "min" and "max" are useful for regulating
1927 	devices which show wide temporary behavior changes - e.g. a
1928 	ssd which accepts writes at the line speed for a while and
1932 	kernel and may change automatically.  Setting "ctrl" to "user"
1933 	or setting any of the percentile and latency parameters puts
1934 	it into "user" mode and disables the automatic changes.  The
1938 	A read-write nested-keyed file which exists only on the root
1944 	by $MAJ:$MIN device numbers and not ordered.  The line for a
1951 	  model		The cost model in use - "linear"
1956 	parameters are written to, "ctrl" become "user" and the
1969 	costs of a sequential and random IO and the cost coefficient
1974 	sense and is scaled to the device behavior dynamically.
1977 	generate device-specific coefficients.
1980 	A read-write flat-keyed file which exists on non-root cgroups.
1985 	$MAJ:$MIN device numbers and not ordered.  The weights are in
1986 	the range [1, 10000] and specifies the relative amount IO time
1991 	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
2000 	A read-write nested-keyed file which exists on non-root
2003 	BPS and IOPS based IO limit.  Lines are keyed by $MAJ:$MIN
2004 	device numbers and not ordered.  The following nested keys are
2014 	When writing, any number of nested key-value pairs can be
2019 	BPS and IOPS are measured in each IO direction and IOs are
2022 	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
2039 	A read-only nested-keyed file.
2048 Page cache is dirtied through buffered writes and shared mmaps and
2050 mechanism.  Writeback sits between the memory and IO domains and
2051 regulates the proportion of dirty memory by balancing dirtying and
2056 defines the memory domain that dirty memory ratio is calculated and
2057 maintained for and the io controller defines the io domain which
2058 writes out dirty pages for the memory domain.  Both system-wide and
2059 per-cgroup dirty memory states are examined and the more restrictive
2064 btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are 
2067 There are inherent differences in memory and writeback management
2070 inode is assigned to a cgroup and all IO requests to write dirty pages
2076 constantly keeps track of foreign pages and, if a particular foreign
2085 As memory controller assigns page ownership on the first use and
2097 	memory controller and system-wide clean memory.
2101 	total available memory and applied the same way as
2109 with a latency target, and if the average latency exceeds that target the
2114 in the diagram below, only groups A, B, and C will influence each other, and
2115 groups D and F will influence each other.  Group G will influence nobody::
2124 So the ideal way to configure this is to set io.latency in groups A, B, and C.
2127 Start at higher than the expected latency for your device and watch the
2130 your real setting, setting at 10-15% higher than the value in io.stat.
2140 - Queue depth throttling.  This is the number of outstanding IO's a group is
2142   and going all the way down to 1 IO at a time.
2144 - Artificial delay induction.  There are certain types of IO that cannot be
2146   includes swapping and metadata IO.  These types of IO are allowed to occur
2148   originating group is being throttled you will see the use_delay and delay
2191   no-change
2194   promote-to-rt
2195 	For requests that have a non-RT I/O priority class, change it into RT.
2199   restrict-to-be
2209   none-to-rt
2210 	Deprecated. Just an alias for promote-to-rt.
2214 +----------------+---+
2215 | no-change      | 0 |
2216 +----------------+---+
2217 | promote-to-rt  | 1 |
2218 +----------------+---+
2219 | restrict-to-be | 2 |
2220 +----------------+---+
2222 +----------------+---+
2226 +-------------------------------+---+
2228 +-------------------------------+---+
2229 | IOPRIO_CLASS_RT (real-time)   | 1 |
2230 +-------------------------------+---+
2232 +-------------------------------+---+
2234 +-------------------------------+---+
2238 - If I/O priority class policy is promote-to-rt, change the request I/O
2239   priority class to IOPRIO_CLASS_RT and change the request I/O priority
2241 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2243   into the maximum of the I/O priority class policy number and the numerical
2247 ---
2266 	A read-write single value file which exists on non-root
2272 	A read-only single value file which exists on non-root cgroups.
2274 	The number of processes currently in the cgroup and its
2278 	A read-only single value file which exists on non-root cgroups.
2280 	The maximum value that the number of processes in the cgroup and its
2284 	A read-only flat-keyed file which exists on non-root cgroups. Unless
2302 through fork() or clone(). These will return -EAGAIN if the creation
2307 ------
2310 the CPU and memory node placement of tasks to only the resources
2313 on properly sized subsets of the systems with careful processor and
2314 memory placement to reduce cross-node memory access and contention
2325 	A read-write multiple values file which exists on non-root
2326 	cpuset-enabled cgroups.
2330 	subjected to constraints imposed by its parent and can differ
2333 	The CPU numbers are comma-separated numbers or ranges.
2337 	  0-4,6,8-10
2340 	setting as the nearest cgroup ancestor with a non-empty
2344 	and won't be affected by any CPU hotplug events.
2347 	A read-only multiple values file which exists on all
2348 	cpuset-enabled cgroups.
2364 	A read-write multiple values file which exists on non-root
2365 	cpuset-enabled cgroups.
2369 	is subjected to constraints imposed by its parent and can differ
2372 	The memory node numbers are comma-separated numbers or ranges.
2376 	  0-1,3
2379 	setting as the nearest cgroup ancestor with a non-empty
2384 	and won't be affected by any memory nodes hotplug events.
2386 	Setting a non-empty value to "cpuset.mems" causes memory of
2391 	may not be complete and some memory pages may be left behind.
2398 	A read-only multiple values file which exists on all
2399 	cpuset-enabled cgroups.
2414 	A read-write multiple values file which exists on non-root
2415 	cpuset-enabled cgroups.
2444 	The root cgroup is a partition root and all its available CPUs
2448 	A read-only multiple values file which exists on all non-root
2449 	cpuset-enabled cgroups.
2461 	A read-only and root cgroup only multiple values file.
2468 	A read-write single value file which exists on non-root
2469 	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
2470 	and is not delegatable.
2475 	  "member"	Non-root member of a partition
2480 	A cpuset partition is a collection of cpuset-enabled cgroups with
2481 	a partition root at the top of the hierarchy and its descendants
2482 	except those that are separate partition roots themselves and
2487 	There are two types of partitions - local and remote.  A local
2502 	The root cgroup is always a partition root and its state cannot
2503 	be changed.  All other non-root cgroups start out as "member".
2511 	and excluded from the unbound workqueues.  Tasks placed in such
2513 	and bound to each of the individual CPUs for optimal performance.
2516 	two possible states - valid or invalid.  An invalid partition
2520 	All possible state transitions among "member", "root" and
2527 	  "member"			Non-root member of a partition
2551 	become invalid and vice versa.	Note that a task cannot be
2554 	A valid non-root parent partition may distribute out all its CPUs
2565 	Poll and inotify events are triggered whenever the state of
2573 	A user can pre-configure certain CPUs to an isolated state
2580 -----------------
2583 creation of new device files (using mknod), and access to the
2586 Cgroup v2 device controller has no interface files and is implemented
2588 create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach
2590 device file, corresponding BPF programs will be executed, and depending
2591 on the return value the attempt will succeed or fail with -EPERM.
2595 access type (mknod/read/write) and device (type, major and minor numbers).
2596 If the program returns 0, the attempt fails with -EPERM, otherwise it
2604 ----
2606 The "rdma" controller regulates the distribution and accounting of
2613 	A readwrite nested-keyed file that exists for all the cgroups
2617 	Lines are keyed by device name and are not ordered.
2618 	Each line contains space separated resource name and its configured
2628 	An example for mlx4 and ocrdma device follows::
2634 	A read-only file that describes current resource usage.
2637 	An example for mlx4 and ocrdma device follows::
2643 -------
2645 The HugeTLB controller allows to limit the HugeTLB usage per control group and
2660 	A read-only flat-keyed file which exists on non-root cgroups.
2673         use hugetlb pages are included.  The per-node values are in bytes.
2676 ----
2678 The Miscellaneous cgroup provides the resource limiting and tracking
2684 include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[]
2688 Once a capacity is set then the resource usage can be updated using charge and
2695 Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are re…
2698         A read-only flat-keyed file shown only in the root cgroup.  It shows
2707         A read-only flat-keyed file shown in the all cgroups.  It shows
2708         the current usage of the resources in the cgroup and its children.::
2715         A read-only flat-keyed file shown in all cgroups.  It shows the
2716         historical maximum usage of the resources in the cgroup and its
2724         A read-write flat-keyed file shown in the non root cgroups. Allowed
2725         maximum usage of the resources in the cgroup and its children.::
2743 	A read-only flat-keyed file which exists on non-root cgroups. The
2757 Migration and Ownership
2761 first, and stays charged to that cgroup until that resource is freed. Migrating
2766 ------
2777 Non-normative information
2778 -------------------------
2781 the stable kernel API and so is subject to change.
2794 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2810 ------
2813 "/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
2814 flag can be used with clone(2) and unshare(2) to create a new cgroup
2822 a set of cgroups and namespaces are intended to isolate processes the
2829 The path '/batchjobs/container_id1' can be considered as system-data
2830 and undesirable to expose to the isolated processes.  cgroup namespace
2834   # ls -l /proc/self/ns/cgroup
2835   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2841   # ls -l /proc/self/ns/cgroup
2842   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2846 When some thread from a multi-threaded process unshares its cgroup
2853 namespace is destroyed.  The cgroupns root and the actual cgroups
2857 The Root and Views
2858 ------------------
2869   # ~/unshare -c # unshare cgroupns in some cgroup
2877 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2907 Migration and setns(2)
2908 ----------------------
2910 Processes inside a cgroup namespace can move into and out of the
2913 /batchjobs/container_id1, and assuming that the global hierarchy is
2937 ---------------------------------
2940 running inside a non-init cgroup namespace::
2942   # mount -t cgroup2 none $MOUNT_POINT
2945 filesystem root.  The process needs CAP_SYS_ADMIN against its user and
2949 the view of cgroup hierarchy by namespace-private cgroupfs mount
2957 where interacting with cgroup is necessary.  cgroup core and
2962 --------------------------------
2965 address_space_operations->writepage[s]() to annotate bio's using the
2969 	Should be called for each bio carrying writeback data and
2970 	associates the bio with the inode's owner cgroup and the
2972 	a queue (device) has been associated with the bio and
2978 	during the writeback session, it's the easiest and most
2982 super_block by setting SB_I_CGROUPWB in ->s_iflags.  This allows for
2988 the configuration, the bio may be executed at a lower priority and if
2992 cases by skipping wbc_init_bio() and using bio_associate_blkg()
2999 - Multiple hierarchies including named ones are not supported.
3001 - All v1 mount options are not supported.
3003 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3005 - "cgroup.clone_children" is removed.
3007 - /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" or
3011 Issues with v1 and Rationales for v2
3015 --------------------
3017 cgroup v1 allowed an arbitrary number of hierarchies and each
3031 put on the same hierarchy and most configurations resorted to putting
3033 as the cpu and cpuacct controllers, made sense to be put on the same
3041 used in general and what controllers was able to do.
3045 length.  The key might contain any number of entries and was unlimited
3046 in length, which made it highly awkward to manipulate and led to
3068 ------------------
3071 This didn't make sense for some controllers and those controllers
3074 individual applications and system management interface.
3076 Generally, in-process knowledge is available only to the process
3077 itself; thus, unlike service-level organization of processes,
3083 individual applications so that they can create and manage their own
3084 sub-hierarchies and control resource distributions along them.  This
3085 effectively raised cgroup to the status of a syscall-like API exposed
3092 and then read and/or write to it.  This is not only extremely clunky
3093 and unusual but also inherently racy.  There is no conventional way to
3094 define transaction across the required steps and nothing can guarantee
3095 that the process would actually be operating on its own sub-hierarchy.
3099 system-management pseudo filesystem.  cgroup ended up with interface
3100 knobs which were not properly abstracted or refined and directly
3102 individual applications through the ill-defined delegation mechanism
3106 This was painful for both userland and kernel.  Userland ended up with
3107 misbehaving and poorly abstracted interfaces and kernel exposing and
3111 Competition Between Inner Nodes and Threads
3112 -------------------------------------------
3115 interesting problem where threads belonging to a parent cgroup and its
3117 different types of entities competed and there was no obvious way to
3120 The cpu controller considered threads and cgroups as equivalents and
3123 cycles and the number of internal threads fluctuated - the ratios
3126 wasn't obvious or universal, and there were various other knobs which
3134 otherwise, made the interface messy and significantly complicated the
3138 between internal tasks and child cgroups and the behavior was not
3139 clearly defined.  There were attempts to add ad-hoc behaviors and
3143 Multiple controllers struggled with internal tasks and came up with
3145 severely flawed and, furthermore, the widely different behaviors
3153 ----------------------
3155 cgroup v1 grew without oversight and developed a large number of
3156 idiosyncrasies and inconsistencies.  One issue on the cgroup core side
3157 was how an empty cgroup was notified - a userland helper binary was
3158 forked and executed for each event.  The event delivery wasn't
3160 to in-kernel event delivery filtering mechanism further complicating
3164 controllers completely ignoring hierarchical organization and treating
3173 control used widely differing naming schemes and formats.  Statistics
3174 and information knobs were named arbitrarily and used different
3175 formats and units even in the same controller.
3177 cgroup v2 establishes common conventions where appropriate and updates
3178 controllers so that they expose minimal and consistent interfaces.
3181 Controller Issues and Remedies
3182 ------------------------------
3189 global reclaim prefers is opt-in, rather than opt-out.  The costs for
3194 rbtree and treated like equal peers, regardless where they are located
3199 becomes self-defeating.
3201 The memory.low boundary on the other hand is a top-down allocated
3211 runtime, and that requires users to overcommit.  But doing that with a
3214 estimation is hard and error prone, and getting it wrong results in
3215 OOM kills, most users tend to err on the side of a looser limit and
3224 and make corrections until the minimal memory footprint that still
3227 In extreme cases, with many concurrent allocations and a complete
3232 limit this type of spillover and ultimately contain buggy or even
3238 limit to prevent new charges, and then reclaim and OOM kill until the
3239 new limit is met - or the task writing to memory.max is killed.
3241 The combined memory+swap accounting and limiting is replaced by real
3242 control over swap space.
3244 The main argument for a combined memory+swap facility in the original
3246 able to swap all anonymous memory of a child group, regardless of the
3248 groups can sabotage swapping by other means - such as referencing its
3249 anonymous memory in a tight loop - and an admin can not assume full
3253 intuitive userspace interface, and it flies in the face of the idea
3254 that cgroup controllers should account and limit specific physical
3255 resources.  Swap space is a resource like all others in the system,
3256 and that's why unified hierarchy allows distributing it separately.