Lines Matching +full:memory +full:- +full:controller
1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
12 of cgroup including core and specific controller behaviors. All
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. HugeTLB
68 5.8-1. HugeTLB Interface Files
69 5-9. Misc
70 5.9-1 Miscellaneous cgroup Interface Files
71 5.9-2 Migration and Ownership
72 5-10. Others
73 5-10-1. perf_event
74 5-N. Non-normative information
75 5-N-1. CPU controller root cgroup process behaviour
76 5-N-2. IO controller root cgroup process behaviour
78 6-1. Basics
79 6-2. The Root and Views
80 6-3. Migration and setns(2)
81 6-4. Interaction with Other Namespaces
83 P-1. Filesystem Support for Writeback
86 R-1. Multiple Hierarchies
87 R-2. Thread Granularity
88 R-3. Competition Between Inner Nodes and Threads
89 R-4. Other Interface Issues
90 R-5. Controller Issues and Remedies
91 R-5-1. Memory
98 -----------
107 ---------------
113 cgroup is largely composed of two parts - the core and controllers.
115 processes. A cgroup controller is usually responsible for
128 disabled selectively on a cgroup. All controller behaviors are
129 hierarchical - if a controller is enabled on a cgroup, it affects all
131 sub-hierarchy of the cgroup. When a controller is enabled on a nested
141 --------
146 # mount -t cgroup2 none $MOUNT_POINT
155 A controller can be moved across hierarchies only after the controller
156 is no longer referenced in its current hierarchy. Because per-cgroup
157 controller states are destroyed asynchronously and controllers may
158 have lingering references, a controller may not show up immediately on
160 Similarly, a controller should be fully disabled to be moved out of
162 controller to become available for other hierarchies; furthermore, due
163 to inter-controller dependencies, other controllers may need to be
169 the hierarchies and controller associations before starting using the
184 ignored on non-init namespace mounts. Please refer to the
189 task migrations and controller on/offs at the cost of making
196 Only populate memory.events with data for the current cgroup,
201 option is ignored on non-init namespace mounts.
204 Recursively apply memory.min and memory.low protection to
209 behavior but is a mount-option to avoid regressing setups
214 Count HugeTLB memory usage towards the cgroup's overall
215 memory usage for the memory controller (for the purpose of
216 statistics reporting and memory protetion). This is a new
222 * There is no HugeTLB pool management involved in the memory
223 controller. The pre-allocated pool does not belong to anyone.
226 memory controller. It is only charged to a cgroup when it is
227 actually used (for e.g at page fault time). Host memory
230 done via other mechanisms (such as the HugeTLB controller).
231 * Failure to charge a HugeTLB folio to the memory controller
235 * Charging HugeTLB memory towards the memory controller affects
236 memory protection and reclaim dynamics. Any userspace tuning
239 will not be tracked by the memory controller (even if cgroup
243 The option restores v1-like behavior of pids.events:max, that is only
251 --------------------------------
257 A child cgroup can be created by creating a sub-directory::
262 structure. Each cgroup has a read-writable interface file
264 belong to the cgroup one-per-line. The PIDs are not ordered and the
295 0::/test-cgroup/test-cgroup-nested
302 0::/test-cgroup/test-cgroup-nested (deleted)
328 constraint - threaded controllers can be enabled on non-leaf cgroups
352 - As the cgroup will join the parent's resource domain. The parent
355 - When the parent is an unthreaded domain, it must not have any domain
359 Topology-wise, a cgroup can be in an invalid state. Please consider
362 A (threaded domain) - B (threaded) - C (domain, just created)
377 threads in the cgroup. Except that the operations are per-thread
378 instead of per-process, "cgroup.threads" has the same format and
393 a threaded controller is enabled inside a threaded subtree, it only
399 constraint, a threaded controller must be able to handle competition
400 between threads in a non-leaf cgroup and its child cgroups. Each
401 threaded controller defines how such competitions are handled.
406 - cpu
407 - cpuset
408 - perf_event
409 - pids
412 --------------------------
414 Each non-root cgroup has a "cgroup.events" file which contains
415 "populated" field indicating whether the cgroup's sub-hierarchy has
419 example, to start a clean-up operation after all processes of a given
420 sub-hierarchy have exited. The populated state updates and
421 notifications are recursive. Consider the following sub-hierarchy
425 A(4) - B(0) - C(1)
435 -----------------------
444 cpu io memory
446 No controller is enabled by default. Controllers can be enabled and
449 # echo "+cpu +memory -io" > cgroup.subtree_control
453 all succeed or fail. If multiple operations on the same controller
456 Enabling a controller in a cgroup indicates that the distribution of
458 Consider the following sub-hierarchy. The enabled controllers are
461 A(cpu,memory) - B(memory) - C()
464 As A has "cpu" and "memory" enabled, A will control the distribution
465 of CPU cycles and memory to its children, in this case, B. As B has
466 "memory" enabled but not "CPU", C and D will compete freely on CPU
467 cycles but their division of memory available to B will be controlled.
469 As a controller regulates the distribution of the target resource to
470 the cgroup's children, enabling it creates the controller's interface
472 would create the "cpu." prefixed controller interface files in C and
473 D. Likewise, disabling "memory" from B would remove the "memory."
474 prefixed controller interface files from C and D. This means that the
475 controller interface files - anything which doesn't start with
479 Top-down Constraint
482 Resources are distributed top-down and a cgroup can further distribute
484 parent. This means that all non-root "cgroup.subtree_control" files
486 "cgroup.subtree_control" file. A controller can be enabled only if
487 the parent has the controller enabled and a controller can't be
494 Non-root cgroups can distribute domain resources to their children
499 This guarantees that, when a domain controller is looking at the part
508 is up to each controller (for more information on this topic please
509 refer to the Non-normative information section in the Controllers
513 enabled controller in the cgroup's "cgroup.subtree_control". This is
522 ----------
544 delegated, the user can build sub-hierarchy under the directory,
548 happens in the delegated sub-hierarchy, nothing can escape the
552 cgroups in or nesting depth of a delegated sub-hierarchy; however,
559 A delegated sub-hierarchy is contained in the sense that processes
560 can't be moved into or out of the sub-hierarchy by the delegatee.
563 requiring the following conditions for a process with a non-root euid
567 - The writer must have write access to the "cgroup.procs" file.
569 - The writer must have write access to the "cgroup.procs" file of the
573 processes around freely in the delegated sub-hierarchy it can't pull
574 in from or push out to outside the sub-hierarchy.
580 ~~~~~~~~~~~~~ - C0 - C00
583 ~~~~~~~~~~~~~ - C1 - C10
590 will be denied with -EACCES.
595 is not reachable, the migration is rejected with -ENOENT.
599 ----------
605 and stateful resources such as memory are not moved together with the
607 inherent trade-offs between migration and various hot paths in terms
613 resource structure once on start-up. Dynamic adjustments to resource
614 distribution can be made by changing controller configuration through
626 controller's interface files are prefixed with the controller name and
627 a dot. A controller's name is composed of lower case alphabets and
646 -------
652 work-conserving. Due to the dynamic nature, this model is usually
667 .. _cgroupv2-limits-distributor:
670 ------
673 Limits can be over-committed - the sum of the limits of children can
678 As limits can be over-committed, all configuration combinations are
685 .. _cgroupv2-protections-distributor:
688 -----------
693 soft boundaries. Protections can also be over-committed in which case
700 As protections can be over-committed, all configuration combinations
704 "memory.low" implements best-effort memory protection and is an
709 -----------
712 resource. Allocations can't be over-committed - the sum of the
719 As allocations can't be over-committed, some configuration
724 "cpu.rt.max" hard-allocates realtime slices and is an example of this
732 ------
737 New-line separated values
745 (when read-only or multiple values can be written at once)
771 -----------
773 - Settings for a single feature should be contained in a single file.
775 - The root cgroup should be exempt from resource control and thus
778 - The default time unit is microseconds. If a different unit is ever
781 - A parts-per quantity should use a percentage decimal with at least
782 two digit fractional part - e.g. 13.40.
784 - If a controller implements weight based resource distribution, its
790 - If a controller implements an absolute resource guarantee and/or
792 respectively. If a controller implements best effort resource
799 - If a setting has a configurable default value and keyed specific
813 # cat cgroup-example-interface-file
819 # echo 125 > cgroup-example-interface-file
823 # echo "default 125" > cgroup-example-interface-file
827 # echo "8:16 170" > cgroup-example-interface-file
831 # echo "8:0 default" > cgroup-example-interface-file
832 # cat cgroup-example-interface-file
836 - For events which are not very high frequency, an interface file
843 --------------------
848 A read-write single value file which exists on non-root
854 - "domain" : A normal valid domain cgroup.
856 - "domain threaded" : A threaded domain cgroup which is
859 - "domain invalid" : A cgroup which is in an invalid state.
863 - "threaded" : A threaded cgroup which is a member of a
870 A read-write new-line separated values file which exists on
874 the cgroup one-per-line. The PIDs are not ordered and the
883 - It must have write access to the "cgroup.procs" file.
885 - It must have write access to the "cgroup.procs" file of the
888 When delegating a sub-hierarchy, write access to this file
896 A read-write new-line separated values file which exists on
900 the cgroup one-per-line. The TIDs are not ordered and the
909 - It must have write access to the "cgroup.threads" file.
911 - The cgroup that the thread is currently in must be in the
914 - It must have write access to the "cgroup.procs" file of the
917 When delegating a sub-hierarchy, write access to this file
921 A read-only space separated values file which exists on all
928 A read-write space separated values file which exists on all
935 Space separated list of controllers prefixed with '+' or '-'
936 can be written to enable or disable controllers. A controller
937 name prefixed with '+' enables the controller and '-'
938 disables. If a controller appears more than once on the list,
943 A read-only flat-keyed file which exists on non-root cgroups.
955 A read-write single value files. The default is "max".
962 A read-write single value files. The default is "max".
969 A read-only flat-keyed file with the following entries:
987 Total number of live cgroup subsystems (e.g memory
991 Total number of dying cgroup subsystems (e.g. memory
995 A read-only flat-keyed file which exists in non-root cgroups.
1013 A read-write single value file which exists on non-root cgroups.
1036 create new sub-cgroups.
1039 A write-only single value file which exists in non-root cgroups.
1051 the whole thread-group.
1054 A read-write single value file that allowed values are "0" and "1".
1058 Writing "1" to the file will re-enable the cgroup PSI accounting.
1066 This may cause non-negligible overhead for some workloads when under
1068 be used to disable PSI accounting in the non-leaf cgroups.
1071 A read-write nested-keyed file.
1079 .. _cgroup-v2-cpu:
1082 ---
1085 controller implements weight and absolute bandwidth limit models for
1098 scheduling of realtime processes, the cpu controller can only be enabled
1103 to be moved to the root cgroup before the cpu controller can be enabled
1113 A read-only flat-keyed file.
1114 This file exists whether the controller is enabled or not.
1118 - usage_usec
1119 - user_usec
1120 - system_usec
1122 and the following five when the controller is enabled:
1124 - nr_periods
1125 - nr_throttled
1126 - throttled_usec
1127 - nr_bursts
1128 - burst_usec
1131 A read-write single value file which exists on non-root
1141 A read-write single value file which exists on non-root
1144 The nice value is in the range [-20, 19].
1153 A read-write two value file which exists on non-root cgroups.
1165 A read-write single value file which exists on non-root
1171 A read-write nested-keyed file.
1177 A read-write single value file which exists on non-root cgroups.
1192 A read-write single value file which exists on non-root cgroups.
1203 A read-write single value file which exists on non-root cgroups.
1206 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1214 Memory section in Controllers
1215 ------
1217 The "memory" controller regulates distribution of memory. Memory is
1219 intertwining between memory usage and reclaim pressure and the
1220 stateful nature of memory, the distribution model is relatively
1223 While not completely water-tight, all major memory usages by a given
1224 cgroup are tracked so that the total memory consumption can be
1226 following types of memory usages are tracked.
1228 - Userland memory - page cache and anonymous memory.
1230 - Kernel data structures such as dentries and inodes.
1232 - TCP socket buffers.
1237 Memory Interface Files argument
1240 All memory amounts are in bytes. If a value which is not aligned to
1244 memory.current
1245 A read-only single value file which exists on non-root
1248 The total amount of memory currently being used by the cgroup
1251 memory.min
1252 A read-write single value file which exists on non-root
1255 Hard memory protection. If the memory usage of a cgroup
1256 is within its effective min boundary, the cgroup's memory
1258 unprotected reclaimable memory available, OOM killer
1264 Effective min boundary is limited by memory.min values of
1265 all ancestor cgroups. If there is memory.min overcommitment
1266 (child cgroup or cgroups are requiring more protected memory
1269 actual memory usage below memory.min.
1271 Putting more memory than generally available under this
1274 If a memory cgroup is not populated with processes,
1275 its memory.min is ignored.
1277 memory.low
1278 A read-write single value file which exists on non-root
1281 Best-effort memory protection. If the memory usage of a
1283 memory won't be reclaimed unless there is no reclaimable
1284 memory available in unprotected cgroups.
1290 Effective low boundary is limited by memory.low values of
1291 all ancestor cgroups. If there is memory.low overcommitment
1292 (child cgroup or cgroups are requiring more protected memory
1295 actual memory usage below memory.low.
1297 Putting more memory than generally available under this
1300 memory.high
1301 A read-write single value file which exists on non-root
1304 Memory usage throttle limit. If a cgroup's usage goes
1314 memory.max
1315 A read-write single value file which exists on non-root
1318 Memory usage hard limit. This is the main mechanism to limit
1319 memory usage of a cgroup. If a cgroup's memory usage reaches
1324 In default configuration regular 0-order allocations always
1329 as -ENOMEM or silently ignore in cases like disk readahead.
1331 memory.reclaim
1332 A write-only nested-keyed file which exists for all cgroups.
1334 This is a simple interface to trigger memory reclaim in the
1339 echo "1G" > memory.reclaim
1343 specified amount, -EAGAIN is returned.
1346 interface) is not meant to indicate memory pressure on the
1347 memory cgroup. Therefore socket memory balancing triggered by
1348 the memory reclaim normally is not exercised in this case.
1350 reclaim induced by memory.reclaim.
1363 memory.peak
1364 A read-write single value file which exists on non-root cgroups.
1366 The max memory usage recorded for the cgroup and its descendants since
1369 A write of any non-empty string to this file resets it to the
1370 current memory usage for subsequent reads through the same
1373 memory.oom.group
1374 A read-write single value file which exists on non-root
1380 (if the memory cgroup is not a leaf cgroup) are killed
1384 Tasks with the OOM protection (oom_score_adj set to -1000)
1389 memory.oom.group values of ancestor cgroups.
1391 memory.events
1392 A read-only flat-keyed file which exists on non-root cgroups.
1400 memory.events.local.
1404 high memory pressure even though its usage is under
1406 boundary is over-committed.
1410 throttled and routed to perform direct memory reclaim
1411 because the high memory boundary was exceeded. For a
1412 cgroup whose memory usage is capped by the high limit
1413 rather than global memory pressure, this event's
1417 The number of times the cgroup's memory usage was
1422 The number of time the cgroup's memory usage was
1426 considered as an option, e.g. for failed high-order
1436 memory.events.local
1437 Similar to memory.events but the fields in the file are local
1441 memory.stat
1442 A read-only flat-keyed file which exists on non-root cgroups.
1444 This breaks down the cgroup's memory footprint into different
1445 types of memory, type-specific details, and other information
1446 on the state and past events of the memory management system.
1448 All memory amounts are in bytes.
1454 If the entry has no per-node counter (or not show in the
1455 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1456 to indicate that it will not show in the memory.numa_stat.
1459 Amount of memory used in anonymous mappings such as
1463 Amount of memory used to cache filesystem data,
1464 including tmpfs and shared memory.
1467 Amount of total kernel memory, including
1469 addition to other kernel memory use cases.
1472 Amount of memory allocated to kernel stacks.
1475 Amount of memory allocated for page tables.
1478 Amount of memory allocated for secondary page tables,
1483 Amount of memory used for storing per-cpu kernel
1487 Amount of memory used in network transmission buffers
1490 Amount of memory used for vmap backed memory.
1493 Amount of cached filesystem data that is swap-backed,
1497 Amount of memory consumed by the zswap compression backend.
1500 Amount of application memory swapped out to zswap.
1514 Amount of swap cached in memory. The swapcache is accounted
1515 against both memory and swap usage.
1518 Amount of memory used in anonymous mappings backed by
1530 Amount of memory, swap-backed and filesystem-backed,
1531 on the internal memory management lists used by the
1535 memory management lists), inactive_foo + active_foo may not be equal to
1536 the value for the foo counter, since the foo counter is type-based, not
1537 list-based.
1544 Part of "slab" that cannot be reclaimed on memory
1548 Amount of memory used for storing in-kernel data
1615 Amount of pages postponed to be freed under memory pressure
1621 Number of pages swapped into memory and filled with zero, where I/O
1626 Number of zero-filled pages swapped out with I/O skipped due to the
1630 Number of pages moved in to memory from zswap.
1633 Number of pages moved out of memory to zswap.
1676 memory.numa_stat
1677 A read-only nested-keyed file which exists on non-root cgroups.
1679 This breaks down the cgroup's memory footprint into different
1680 types of memory, type-specific details, and other information
1681 per node on the state of the memory management system.
1689 All memory amounts are in bytes.
1691 The output format of memory.numa_stat is::
1699 The entries can refer to the memory.stat.
1701 memory.swap.current
1702 A read-only single value file which exists on non-root
1708 memory.swap.high
1709 A read-write single value file which exists on non-root
1714 allow userspace to implement custom out-of-memory procedures.
1718 during regular operation. Compare to memory.swap.max, which
1720 continue unimpeded as long as other memory can be reclaimed.
1724 memory.swap.peak
1725 A read-write single value file which exists on non-root cgroups.
1730 A write of any non-empty string to this file resets it to the
1731 current memory usage for subsequent reads through the same
1734 memory.swap.max
1735 A read-write single value file which exists on non-root
1739 limit, anonymous memory of the cgroup will not be swapped out.
1741 memory.swap.events
1742 A read-only flat-keyed file which exists on non-root cgroups.
1758 because of running out of swap system-wide or max
1764 reduces the impact on the workload and memory management.
1766 memory.zswap.current
1767 A read-only single value file which exists on non-root
1770 The total amount of memory consumed by the zswap compression
1773 memory.zswap.max
1774 A read-write single value file which exists on non-root
1781 memory.zswap.writeback
1782 A read-write single value file. The default value is "1".
1794 Note that this is subtly different from setting memory.swap.max to
1797 is allowed unless memory.swap.max is set to 0.
1799 memory.pressure
1800 A read-only nested-keyed file.
1802 Shows pressure stall information for memory. See
1809 "memory.high" is the main mechanism to control memory usage.
1810 Over-committing on high limit (sum of high limits > available memory)
1811 and letting global memory pressure to distribute memory according to
1817 more memory or terminating the workload.
1819 Determining whether a cgroup has enough memory is not trivial as
1820 memory usage doesn't indicate whether the workload can benefit from
1821 more memory. For example, a workload which writes data received from
1822 network to a file can use all available memory but can also operate as
1823 performant with a small amount of memory. A measure of memory
1824 pressure - how much the workload is being impacted due to lack of
1825 memory - is necessary to determine whether a workload needs more
1826 memory; unfortunately, memory pressure monitoring mechanism isn't
1830 Memory Ownership argument
1833 A memory area is charged to the cgroup which instantiated it and stays
1835 to a different cgroup doesn't move the memory usages that it
1838 A memory area may be used by processes belonging to different cgroups.
1839 To which cgroup the area will be charged is in-deterministic; however,
1840 over time, the memory area is likely to end up in a cgroup which has
1841 enough memory allowance to avoid high reclaim pressure.
1843 If a cgroup sweeps a considerable amount of memory which is expected
1845 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1846 belonging to the affected files to ensure correct memory ownership.
1850 --
1852 The "io" controller regulates the distribution of IO resources. This
1853 controller implements both weight based and absolute bandwidth or IOPS
1855 only if cfq-iosched is in use and neither scheme is available for
1856 blk-mq devices.
1863 A read-only nested-keyed file.
1883 A read-write nested-keyed file which exists only on the root
1887 model based controller (CONFIG_BLK_CGROUP_IOCOST) which
1895 enable Weight-based control enable
1905 The controller is disabled by default and can be enabled by
1907 to zero and the controller uses internal device saturation
1915 shows that on sdb, the controller is enabled, will consider
1927 devices which show wide temporary behavior changes - e.g. a
1938 A read-write nested-keyed file which exists only on the root
1942 controller (CONFIG_BLK_CGROUP_IOCOST) which currently
1951 model The cost model in use - "linear"
1977 generate device-specific coefficients.
1980 A read-write flat-keyed file which exists on non-root cgroups.
2000 A read-write nested-keyed file which exists on non-root
2014 When writing, any number of nested key-value pairs can be
2039 A read-only nested-keyed file.
2050 mechanism. Writeback sits between the memory and IO domains and
2051 regulates the proportion of dirty memory by balancing dirtying and
2054 The io controller, in conjunction with the memory controller,
2055 implements control of page cache writeback IOs. The memory controller
2056 defines the memory domain that dirty memory ratio is calculated and
2057 maintained for and the io controller defines the io domain which
2058 writes out dirty pages for the memory domain. Both system-wide and
2059 per-cgroup dirty memory states are examined and the more restrictive
2067 There are inherent differences in memory and writeback management
2068 which affects how cgroup ownership is tracked. Memory is tracked per
2073 As cgroup ownership for memory is tracked per page, there can be pages
2085 As memory controller assigns page ownership on the first use and
2096 amount of available memory capped by limits imposed by the
2097 memory controller and system-wide clean memory.
2101 total available memory and applied the same way as
2108 This is a cgroup v2 controller for IO workload protection. You provide a group
2110 controller will throttle any peers that have a lower latency target than the
2130 your real setting, setting at 10-15% higher than the value in io.stat.
2136 target the controller doesn't do anything. Once a group starts missing its
2140 - Queue depth throttling. This is the number of outstanding IO's a group is
2144 - Artificial delay induction. There are certain types of IO that cannot be
2167 If the controller is enabled you will see extra stats in io.stat in
2191 no-change
2194 promote-to-rt
2195 For requests that have a non-RT I/O priority class, change it into RT.
2199 restrict-to-be
2209 none-to-rt
2210 Deprecated. Just an alias for promote-to-rt.
2214 +----------------+---+
2215 | no-change | 0 |
2216 +----------------+---+
2217 | promote-to-rt | 1 |
2218 +----------------+---+
2219 | restrict-to-be | 2 |
2220 +----------------+---+
2222 +----------------+---+
2226 +-------------------------------+---+
2228 +-------------------------------+---+
2229 | IOPRIO_CLASS_RT (real-time) | 1 |
2230 +-------------------------------+---+
2232 +-------------------------------+---+
2234 +-------------------------------+---+
2238 - If I/O priority class policy is promote-to-rt, change the request I/O
2241 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2247 ---
2249 The process number controller is used to allow a cgroup to stop any
2254 controllers cannot prevent, thus warranting its own controller. For
2256 hitting memory restrictions.
2258 Note that PIDs used in this controller refer to TIDs, process IDs as
2266 A read-write single value file which exists on non-root
2272 A read-only single value file which exists on non-root cgroups.
2278 A read-only single value file which exists on non-root cgroups.
2284 A read-only flat-keyed file which exists on non-root cgroups. Unless
2302 through fork() or clone(). These will return -EAGAIN if the creation
2307 ------
2309 The "cpuset" controller provides a mechanism for constraining
2310 the CPU and memory node placement of tasks to only the resources
2314 memory placement to reduce cross-node memory access and contention
2317 The "cpuset" controller is hierarchical. That means the controller
2318 cannot use CPUs or memory nodes not allowed in its parent.
2325 A read-write multiple values file which exists on non-root
2326 cpuset-enabled cgroups.
2333 The CPU numbers are comma-separated numbers or ranges.
2337 0-4,6,8-10
2340 setting as the nearest cgroup ancestor with a non-empty
2347 A read-only multiple values file which exists on all
2348 cpuset-enabled cgroups.
2364 A read-write multiple values file which exists on non-root
2365 cpuset-enabled cgroups.
2367 It lists the requested memory nodes to be used by tasks within
2368 this cgroup. The actual list of memory nodes granted, however,
2370 from the requested memory nodes.
2372 The memory node numbers are comma-separated numbers or ranges.
2376 0-1,3
2379 setting as the nearest cgroup ancestor with a non-empty
2380 "cpuset.mems" or all the available memory nodes if none
2384 and won't be affected by any memory nodes hotplug events.
2386 Setting a non-empty value to "cpuset.mems" causes memory of
2388 they are currently using memory outside of the designated nodes.
2390 There is a cost for this memory migration. The migration
2391 may not be complete and some memory pages may be left behind.
2398 A read-only multiple values file which exists on all
2399 cpuset-enabled cgroups.
2401 It lists the onlined memory nodes that are actually granted to
2402 this cgroup by its parent. These memory nodes are allowed to
2405 If "cpuset.mems" is empty, it shows all the memory nodes from the
2408 the memory nodes listed in "cpuset.mems" can be granted. In this
2411 Its value will be affected by memory nodes hotplug events.
2414 A read-write multiple values file which exists on non-root
2415 cpuset-enabled cgroups.
2448 A read-only multiple values file which exists on all non-root
2449 cpuset-enabled cgroups.
2461 A read-only and root cgroup only multiple values file.
2468 A read-write single value file which exists on non-root
2469 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2475 "member" Non-root member of a partition
2480 A cpuset partition is a collection of cpuset-enabled cgroups with
2487 There are two types of partitions - local and remote. A local
2503 be changed. All other non-root cgroups start out as "member".
2516 two possible states - valid or invalid. An invalid partition
2527 "member" Non-root member of a partition
2554 A valid non-root parent partition may distribute out all its CPUs
2573 A user can pre-configure certain CPUs to an isolated state
2579 Device controller
2580 -----------------
2582 Device controller manages access to device files. It includes both
2586 Cgroup v2 device controller has no interface files and is implemented
2591 on the return value the attempt will succeed or fail with -EPERM.
2596 If the program returns 0, the attempt fails with -EPERM, otherwise it
2604 ----
2606 The "rdma" controller regulates the distribution and accounting of
2613 A readwrite nested-keyed file that exists for all the cgroups
2634 A read-only file that describes current resource usage.
2643 -------
2645 The HugeTLB controller allows to limit the HugeTLB usage per control group and
2646 enforces the controller limit during page fault.
2660 A read-only flat-keyed file which exists on non-root cgroups.
2671 Similar to memory.numa_stat, it shows the numa information of the
2673 use hugetlb pages are included. The per-node values are in bytes.
2676 ----
2680 cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
2683 A resource can be added to the controller via enum misc_res_type{} in the
2689 uncharge APIs. All of the APIs to interact with misc controller are in
2695 Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are re…
2698 A read-only flat-keyed file shown only in the root cgroup. It shows
2707 A read-only flat-keyed file shown in the all cgroups. It shows
2715 A read-only flat-keyed file shown in all cgroups. It shows the
2724 A read-write flat-keyed file shown in the non root cgroups. Allowed
2743 A read-only flat-keyed file which exists on non-root cgroups. The
2766 ------
2771 perf_event controller, if not mounted on a legacy hierarchy, is
2773 always be filtered by cgroup v2 path. The controller can still be
2777 Non-normative information
2778 -------------------------
2784 CPU controller root cgroup process behaviour
2794 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2797 IO controller root cgroup process behaviour
2810 ------
2829 The path '/batchjobs/container_id1' can be considered as system-data
2834 # ls -l /proc/self/ns/cgroup
2835 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2841 # ls -l /proc/self/ns/cgroup
2842 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2846 When some thread from a multi-threaded process unshares its cgroup
2858 ------------------
2869 # ~/unshare -c # unshare cgroupns in some cgroup
2877 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2908 ----------------------
2937 ---------------------------------
2940 running inside a non-init cgroup namespace::
2942 # mount -t cgroup2 none $MOUNT_POINT
2949 the view of cgroup hierarchy by namespace-private cgroupfs mount
2962 --------------------------------
2965 address_space_operations->writepage[s]() to annotate bio's using the
2982 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
2999 - Multiple hierarchies including named ones are not supported.
3001 - All v1 mount options are not supported.
3003 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3005 - "cgroup.clone_children" is removed.
3007 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3015 --------------------
3021 For example, as there is only one instance of each controller, utility
3028 the specific controller.
3032 each controller on its own hierarchy. Only closely related ones, such
3051 Also, as a controller couldn't have any expectation regarding the
3053 controller had to assume that all other controllers were attached to
3060 depending on the specific controller. In other words, hierarchy may
3063 how memory is distributed beyond a certain level while still wanting
3068 ------------------
3076 Generally, in-process knowledge is available only to the process
3077 itself; thus, unlike service-level organization of processes,
3084 sub-hierarchies and control resource distributions along them. This
3085 effectively raised cgroup to the status of a syscall-like API exposed
3095 that the process would actually be operating on its own sub-hierarchy.
3099 system-management pseudo filesystem. cgroup ended up with interface
3102 individual applications through the ill-defined delegation mechanism
3112 -------------------------------------------
3120 The cpu controller considered threads and cgroups as equivalents and
3123 cycles and the number of internal threads fluctuated - the ratios
3129 The io controller implicitly created a hidden leaf node for each
3137 The memory controller didn't have a way to control what happened
3139 clearly defined. There were attempts to add ad-hoc behaviors and
3153 ----------------------
3157 was how an empty cgroup was notified - a userland helper binary was
3160 to in-kernel event delivery filtering mechanism further complicating
3163 Controller interfaces were problematic too. An extreme example is
3175 formats and units even in the same controller.
3181 Controller Issues and Remedies
3182 ------------------------------
3184 Memory subsection
3189 global reclaim prefers is opt-in, rather than opt-out. The costs for
3199 becomes self-defeating.
3201 The memory.low boundary on the other hand is a top-down allocated
3210 available memory. The memory consumption of workloads varies during
3218 The memory.high boundary on the other hand can be set much more
3224 and make corrections until the minimal memory footprint that still
3231 system than killing the group. Otherwise, memory.max is there to
3235 Setting the original memory.limit_in_bytes below the current usage was
3237 limit setting to fail. memory.max on the other hand will first set the
3239 new limit is met - or the task writing to memory.max is killed.
3241 The combined memory+swap accounting and limiting is replaced by real
3244 The main argument for a combined memory+swap facility in the original
3246 able to swap all anonymous memory of a child group, regardless of the
3248 groups can sabotage swapping by other means - such as referencing its
3249 anonymous memory in a tight loop - and an admin can not assume full