Lines Matching +full:memory +full:- +full:controller
9 conventions of cgroup v2. It describes all userland-visible aspects
10 of cgroup including core and specific controller behaviors. All
12 v1 is available under Documentation/cgroup-v1/.
17 1-1. Terminology
18 1-2. What is cgroup?
20 2-1. Mounting
21 2-2. Organizing Processes and Threads
22 2-2-1. Processes
23 2-2-2. Threads
24 2-3. [Un]populated Notification
25 2-4. Controlling Controllers
26 2-4-1. Enabling and Disabling
27 2-4-2. Top-down Constraint
28 2-4-3. No Internal Process Constraint
29 2-5. Delegation
30 2-5-1. Model of Delegation
31 2-5-2. Delegation Containment
32 2-6. Guidelines
33 2-6-1. Organize Once and Control
34 2-6-2. Avoid Name Collisions
36 3-1. Weights
37 3-2. Limits
38 3-3. Protections
39 3-4. Allocations
41 4-1. Format
42 4-2. Conventions
43 4-3. Core Interface Files
45 5-1. CPU
46 5-1-1. CPU Interface Files
47 5-2. Memory
48 5-2-1. Memory Interface Files
49 5-2-2. Usage Guidelines
50 5-2-3. Memory Ownership
51 5-3. IO
52 5-3-1. IO Interface Files
53 5-3-2. Writeback
54 5-3-3. IO Latency
55 5-3-3-1. How IO Latency Throttling Works
56 5-3-3-2. IO Latency Interface Files
57 5-4. PID
58 5-4-1. PID Interface Files
59 5-5. Device
60 5-6. RDMA
61 5-6-1. RDMA Interface Files
62 5-7. Misc
63 5-7-1. perf_event
64 5-N. Non-normative information
65 5-N-1. CPU controller root cgroup process behaviour
66 5-N-2. IO controller root cgroup process behaviour
68 6-1. Basics
69 6-2. The Root and Views
70 6-3. Migration and setns(2)
71 6-4. Interaction with Other Namespaces
73 P-1. Filesystem Support for Writeback
76 R-1. Multiple Hierarchies
77 R-2. Thread Granularity
78 R-3. Competition Between Inner Nodes and Threads
79 R-4. Other Interface Issues
80 R-5. Controller Issues and Remedies
81 R-5-1. Memory
88 -----------
97 ---------------
103 cgroup is largely composed of two parts - the core and controllers.
105 processes. A cgroup controller is usually responsible for
118 disabled selectively on a cgroup. All controller behaviors are
119 hierarchical - if a controller is enabled on a cgroup, it affects all
121 sub-hierarchy of the cgroup. When a controller is enabled on a nested
131 --------
136 # mount -t cgroup2 none $MOUNT_POINT
145 A controller can be moved across hierarchies only after the controller
146 is no longer referenced in its current hierarchy. Because per-cgroup
147 controller states are destroyed asynchronously and controllers may
148 have lingering references, a controller may not show up immediately on
150 Similarly, a controller should be fully disabled to be moved out of
152 controller to become available for other hierarchies; furthermore, due
153 to inter-controller dependencies, other controllers may need to be
159 the hierarchies and controller associations before starting using the
175 ignored on non-init namespace mounts. Please refer to the
180 --------------------------------
186 A child cgroup can be created by creating a sub-directory::
191 structure. Each cgroup has a read-writable interface file
193 belong to the cgroup one-per-line. The PIDs are not ordered and the
224 0::/test-cgroup/test-cgroup-nested
231 0::/test-cgroup/test-cgroup-nested (deleted)
257 constraint - threaded controllers can be enabled on non-leaf cgroups
281 - As the cgroup will join the parent's resource domain. The parent
284 - When the parent is an unthreaded domain, it must not have any domain
288 Topology-wise, a cgroup can be in an invalid state. Please consider
291 A (threaded domain) - B (threaded) - C (domain, just created)
306 threads in the cgroup. Except that the operations are per-thread
307 instead of per-process, "cgroup.threads" has the same format and
322 a threaded controller is enabled inside a threaded subtree, it only
328 constraint, a threaded controller must be able to handle competition
329 between threads in a non-leaf cgroup and its child cgroups. Each
330 threaded controller defines how such competitions are handled.
334 --------------------------
336 Each non-root cgroup has a "cgroup.events" file which contains
337 "populated" field indicating whether the cgroup's sub-hierarchy has
341 example, to start a clean-up operation after all processes of a given
342 sub-hierarchy have exited. The populated state updates and
343 notifications are recursive. Consider the following sub-hierarchy
347 A(4) - B(0) - C(1)
357 -----------------------
366 cpu io memory
368 No controller is enabled by default. Controllers can be enabled and
371 # echo "+cpu +memory -io" > cgroup.subtree_control
375 all succeed or fail. If multiple operations on the same controller
378 Enabling a controller in a cgroup indicates that the distribution of
380 Consider the following sub-hierarchy. The enabled controllers are
383 A(cpu,memory) - B(memory) - C()
386 As A has "cpu" and "memory" enabled, A will control the distribution
387 of CPU cycles and memory to its children, in this case, B. As B has
388 "memory" enabled but not "CPU", C and D will compete freely on CPU
389 cycles but their division of memory available to B will be controlled.
391 As a controller regulates the distribution of the target resource to
392 the cgroup's children, enabling it creates the controller's interface
394 would create the "cpu." prefixed controller interface files in C and
395 D. Likewise, disabling "memory" from B would remove the "memory."
396 prefixed controller interface files from C and D. This means that the
397 controller interface files - anything which doesn't start with
401 Top-down Constraint
404 Resources are distributed top-down and a cgroup can further distribute
406 parent. This means that all non-root "cgroup.subtree_control" files
408 "cgroup.subtree_control" file. A controller can be enabled only if
409 the parent has the controller enabled and a controller can't be
416 Non-root cgroups can distribute domain resources to their children
421 This guarantees that, when a domain controller is looking at the part
430 is up to each controller (for more information on this topic please
431 refer to the Non-normative information section in the Controllers
435 enabled controller in the cgroup's "cgroup.subtree_control". This is
444 ----------
464 delegated, the user can build sub-hierarchy under the directory,
468 happens in the delegated sub-hierarchy, nothing can escape the
472 cgroups in or nesting depth of a delegated sub-hierarchy; however,
479 A delegated sub-hierarchy is contained in the sense that processes
480 can't be moved into or out of the sub-hierarchy by the delegatee.
483 requiring the following conditions for a process with a non-root euid
487 - The writer must have write access to the "cgroup.procs" file.
489 - The writer must have write access to the "cgroup.procs" file of the
493 processes around freely in the delegated sub-hierarchy it can't pull
494 in from or push out to outside the sub-hierarchy.
500 ~~~~~~~~~~~~~ - C0 - C00
503 ~~~~~~~~~~~~~ - C1 - C10
510 will be denied with -EACCES.
515 is not reachable, the migration is rejected with -ENOENT.
519 ----------
525 and stateful resources such as memory are not moved together with the
527 inherent trade-offs between migration and various hot paths in terms
533 resource structure once on start-up. Dynamic adjustments to resource
534 distribution can be made by changing controller configuration through
546 controller's interface files are prefixed with the controller name and
547 a dot. A controller's name is composed of lower case alphabets and
566 -------
572 work-conserving. Due to the dynamic nature, this model is usually
588 ------
591 Limits can be over-committed - the sum of the limits of children can
596 As limits can be over-committed, all configuration combinations are
605 -----------
610 soft boundaries. Protections can also be over-committed in which case
617 As protections can be over-committed, all configuration combinations
621 "memory.low" implements best-effort memory protection and is an
626 -----------
629 resource. Allocations can't be over-committed - the sum of the
636 As allocations can't be over-committed, some configuration
641 "cpu.rt.max" hard-allocates realtime slices and is an example of this
649 ------
654 New-line separated values
662 (when read-only or multiple values can be written at once)
688 -----------
690 - Settings for a single feature should be contained in a single file.
692 - The root cgroup should be exempt from resource control and thus
697 - If a controller implements weight based resource distribution, its
703 - If a controller implements an absolute resource guarantee and/or
705 respectively. If a controller implements best effort resource
712 - If a setting has a configurable default value and keyed specific
726 # cat cgroup-example-interface-file
732 # echo 125 > cgroup-example-interface-file
736 # echo "default 125" > cgroup-example-interface-file
740 # echo "8:16 170" > cgroup-example-interface-file
744 # echo "8:0 default" > cgroup-example-interface-file
745 # cat cgroup-example-interface-file
749 - For events which are not very high frequency, an interface file
756 --------------------
762 A read-write single value file which exists on non-root
768 - "domain" : A normal valid domain cgroup.
770 - "domain threaded" : A threaded domain cgroup which is
773 - "domain invalid" : A cgroup which is in an invalid state.
777 - "threaded" : A threaded cgroup which is a member of a
784 A read-write new-line separated values file which exists on
788 the cgroup one-per-line. The PIDs are not ordered and the
797 - It must have write access to the "cgroup.procs" file.
799 - It must have write access to the "cgroup.procs" file of the
802 When delegating a sub-hierarchy, write access to this file
810 A read-write new-line separated values file which exists on
814 the cgroup one-per-line. The TIDs are not ordered and the
823 - It must have write access to the "cgroup.threads" file.
825 - The cgroup that the thread is currently in must be in the
828 - It must have write access to the "cgroup.procs" file of the
831 When delegating a sub-hierarchy, write access to this file
835 A read-only space separated values file which exists on all
842 A read-write space separated values file which exists on all
849 Space separated list of controllers prefixed with '+' or '-'
850 can be written to enable or disable controllers. A controller
851 name prefixed with '+' enables the controller and '-'
852 disables. If a controller appears more than once on the list,
857 A read-only flat-keyed file which exists on non-root cgroups.
867 A read-write single value files. The default is "max".
874 A read-write single value files. The default is "max".
881 A read-only flat-keyed file with the following entries:
903 ---
906 controller implements weight and absolute bandwidth limit models for
911 the cpu controller can only be enabled when all RT processes are in
915 before the cpu controller can be enabled.
924 A read-only flat-keyed file which exists on non-root cgroups.
925 This file exists whether the controller is enabled or not.
929 - usage_usec
930 - user_usec
931 - system_usec
933 and the following three when the controller is enabled:
935 - nr_periods
936 - nr_throttled
937 - throttled_usec
940 A read-write single value file which exists on non-root
946 A read-write single value file which exists on non-root
949 The nice value is in the range [-20, 19].
958 A read-write two value file which exists on non-root cgroups.
970 Memory section in Controllers
971 ------
973 The "memory" controller regulates distribution of memory. Memory is
975 intertwining between memory usage and reclaim pressure and the
976 stateful nature of memory, the distribution model is relatively
979 While not completely water-tight, all major memory usages by a given
980 cgroup are tracked so that the total memory consumption can be
982 following types of memory usages are tracked.
984 - Userland memory - page cache and anonymous memory.
986 - Kernel data structures such as dentries and inodes.
988 - TCP socket buffers.
993 Memory Interface Files argument
996 All memory amounts are in bytes. If a value which is not aligned to
1000 memory.current
1001 A read-only single value file which exists on non-root
1004 The total amount of memory currently being used by the cgroup
1007 memory.min
1008 A read-write single value file which exists on non-root
1011 Hard memory protection. If the memory usage of a cgroup
1012 is within its effective min boundary, the cgroup's memory
1014 unprotected reclaimable memory available, OOM killer
1017 Effective min boundary is limited by memory.min values of
1018 all ancestor cgroups. If there is memory.min overcommitment
1019 (child cgroup or cgroups are requiring more protected memory
1022 actual memory usage below memory.min.
1024 Putting more memory than generally available under this
1027 If a memory cgroup is not populated with processes,
1028 its memory.min is ignored.
1030 memory.low
1031 A read-write single value file which exists on non-root
1034 Best-effort memory protection. If the memory usage of a
1036 memory won't be reclaimed unless memory can be reclaimed
1039 Effective low boundary is limited by memory.low values of
1040 all ancestor cgroups. If there is memory.low overcommitment
1041 (child cgroup or cgroups are requiring more protected memory
1044 actual memory usage below memory.low.
1046 Putting more memory than generally available under this
1049 memory.high
1050 A read-write single value file which exists on non-root
1053 Memory usage throttle limit. This is the main mechanism to
1054 control memory usage of a cgroup. If a cgroup's usage goes
1061 memory.max
1062 A read-write single value file which exists on non-root
1065 Memory usage hard limit. This is the final protection
1066 mechanism. If a cgroup's memory usage reaches this limit and
1075 memory.oom.group
1076 A read-write single value file which exists on non-root
1082 (if the memory cgroup is not a leaf cgroup) are killed
1086 Tasks with the OOM protection (oom_score_adj set to -1000)
1091 memory.oom.group values of ancestor cgroups.
1093 memory.events
1094 A read-only flat-keyed file which exists on non-root cgroups.
1101 high memory pressure even though its usage is under
1103 boundary is over-committed.
1107 throttled and routed to perform direct memory reclaim
1108 because the high memory boundary was exceeded. For a
1109 cgroup whose memory usage is capped by the high limit
1110 rather than global memory pressure, this event's
1114 The number of times the cgroup's memory usage was
1119 The number of time the cgroup's memory usage was
1126 userspace as -ENOMEM or silently ignored in cases like
1127 disk readahead. For now OOM in memory cgroup kills
1134 memory.stat
1135 A read-only flat-keyed file which exists on non-root cgroups.
1137 This breaks down the cgroup's memory footprint into different
1138 types of memory, type-specific details, and other information
1139 on the state and past events of the memory management system.
1141 All memory amounts are in bytes.
1148 Amount of memory used in anonymous mappings such as
1152 Amount of memory used to cache filesystem data,
1153 including tmpfs and shared memory.
1156 Amount of memory allocated to kernel stacks.
1159 Amount of memory used for storing in-kernel data
1163 Amount of memory used in network transmission buffers
1166 Amount of cached filesystem data that is swap-backed,
1181 Amount of memory, swap-backed and filesystem-backed,
1182 on the internal memory management lists used by the
1190 Part of "slab" that cannot be reclaimed on memory
1233 Amount of pages postponed to be freed under memory pressure
1239 memory.swap.current
1240 A read-only single value file which exists on non-root
1246 memory.swap.max
1247 A read-write single value file which exists on non-root
1251 limit, anonymous memory of the cgroup will not be swapped out.
1253 memory.swap.events
1254 A read-only flat-keyed file which exists on non-root cgroups.
1266 because of running out of swap system-wide or max
1272 reduces the impact on the workload and memory management.
1278 "memory.high" is the main mechanism to control memory usage.
1279 Over-committing on high limit (sum of high limits > available memory)
1280 and letting global memory pressure to distribute memory according to
1286 more memory or terminating the workload.
1288 Determining whether a cgroup has enough memory is not trivial as
1289 memory usage doesn't indicate whether the workload can benefit from
1290 more memory. For example, a workload which writes data received from
1291 network to a file can use all available memory but can also operate as
1292 performant with a small amount of memory. A measure of memory
1293 pressure - how much the workload is being impacted due to lack of
1294 memory - is necessary to determine whether a workload needs more
1295 memory; unfortunately, memory pressure monitoring mechanism isn't
1299 Memory Ownership argument
1302 A memory area is charged to the cgroup which instantiated it and stays
1304 to a different cgroup doesn't move the memory usages that it
1307 A memory area may be used by processes belonging to different cgroups.
1308 To which cgroup the area will be charged is in-deterministic; however,
1309 over time, the memory area is likely to end up in a cgroup which has
1310 enough memory allowance to avoid high reclaim pressure.
1312 If a cgroup sweeps a considerable amount of memory which is expected
1314 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1315 belonging to the affected files to ensure correct memory ownership.
1319 --
1321 The "io" controller regulates the distribution of IO resources. This
1322 controller implements both weight based and absolute bandwidth or IOPS
1324 only if cfq-iosched is in use and neither scheme is available for
1325 blk-mq devices.
1332 A read-only nested-keyed file which exists on non-root
1353 A read-write flat-keyed file which exists on non-root cgroups.
1373 A read-write nested-keyed file which exists on non-root
1387 When writing, any number of nested key-value pairs can be
1417 mechanism. Writeback sits between the memory and IO domains and
1418 regulates the proportion of dirty memory by balancing dirtying and
1421 The io controller, in conjunction with the memory controller,
1422 implements control of page cache writeback IOs. The memory controller
1423 defines the memory domain that dirty memory ratio is calculated and
1424 maintained for and the io controller defines the io domain which
1425 writes out dirty pages for the memory domain. Both system-wide and
1426 per-cgroup dirty memory states are examined and the more restrictive
1434 There are inherent differences in memory and writeback management
1435 which affects how cgroup ownership is tracked. Memory is tracked per
1440 As cgroup ownership for memory is tracked per page, there can be pages
1452 As memory controller assigns page ownership on the first use and
1463 amount of available memory capped by limits imposed by the
1464 memory controller and system-wide clean memory.
1468 total available memory and applied the same way as
1475 This is a cgroup v2 controller for IO workload protection. You provide a group
1477 controller will throttle any peers that have a lower latency target than the
1497 your real setting, setting at 10-15% higher than the value in io.stat.
1503 target the controller doesn't do anything. Once a group starts missing its
1507 - Queue depth throttling. This is the number of outstanding IO's a group is
1511 - Artificial delay induction. There are certain types of IO that cannot be
1534 If the controller is enabled you will see extra stats in io.stat in
1552 ---
1554 The process number controller is used to allow a cgroup to stop any
1559 controllers cannot prevent, thus warranting its own controller. For
1561 hitting memory restrictions.
1563 Note that PIDs used in this controller refer to TIDs, process IDs as
1571 A read-write single value file which exists on non-root
1577 A read-only single value file which exists on all cgroups.
1587 through fork() or clone(). These will return -EAGAIN if the creation
1591 Device controller
1592 -----------------
1594 Device controller manages access to device files. It includes both
1598 Cgroup v2 device controller has no interface files and is implemented
1603 the attempt will succeed or fail with -EPERM.
1608 If the program returns 0, the attempt fails with -EPERM, otherwise
1616 ----
1618 The "rdma" controller regulates the distribution and accounting of
1625 A readwrite nested-keyed file that exists for all the cgroups
1646 A read-only file that describes current resource usage.
1656 ----
1661 perf_event controller, if not mounted on a legacy hierarchy, is
1663 always be filtered by cgroup v2 path. The controller can still be
1667 Non-normative information
1668 -------------------------
1674 CPU controller root cgroup process behaviour
1684 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
1687 IO controller root cgroup process behaviour
1700 ------
1719 The path '/batchjobs/container_id1' can be considered as system-data
1724 # ls -l /proc/self/ns/cgroup
1725 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
1731 # ls -l /proc/self/ns/cgroup
1732 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
1736 When some thread from a multi-threaded process unshares its cgroup
1748 ------------------
1759 # ~/unshare -c # unshare cgroupns in some cgroup
1767 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
1798 ----------------------
1827 ---------------------------------
1830 running inside a non-init cgroup namespace::
1832 # mount -t cgroup2 none $MOUNT_POINT
1839 the view of cgroup hierarchy by namespace-private cgroupfs mount
1852 --------------------------------
1855 address_space_operations->writepage[s]() to annotate bio's using the
1870 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
1887 - Multiple hierarchies including named ones are not supported.
1889 - All v1 mount options are not supported.
1891 - The "tasks" file is removed and "cgroup.procs" is not sorted.
1893 - "cgroup.clone_children" is removed.
1895 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
1903 --------------------
1909 For example, as there is only one instance of each controller, utility
1916 the specific controller.
1920 each controller on its own hierarchy. Only closely related ones, such
1939 Also, as a controller couldn't have any expectation regarding the
1941 controller had to assume that all other controllers were attached to
1948 depending on the specific controller. In other words, hierarchy may
1951 how memory is distributed beyond a certain level while still wanting
1956 ------------------
1964 Generally, in-process knowledge is available only to the process
1965 itself; thus, unlike service-level organization of processes,
1972 sub-hierarchies and control resource distributions along them. This
1973 effectively raised cgroup to the status of a syscall-like API exposed
1983 that the process would actually be operating on its own sub-hierarchy.
1987 system-management pseudo filesystem. cgroup ended up with interface
1990 individual applications through the ill-defined delegation mechanism
2000 -------------------------------------------
2008 The cpu controller considered threads and cgroups as equivalents and
2011 cycles and the number of internal threads fluctuated - the ratios
2017 The io controller implicitly created a hidden leaf node for each
2025 The memory controller didn't have a way to control what happened
2027 clearly defined. There were attempts to add ad-hoc behaviors and
2041 ----------------------
2045 was how an empty cgroup was notified - a userland helper binary was
2048 to in-kernel event delivery filtering mechanism further complicating
2051 Controller interfaces were problematic too. An extreme example is
2063 formats and units even in the same controller.
2069 Controller Issues and Remedies
2070 ------------------------------
2072 Memory subsection
2077 global reclaim prefers is opt-in, rather than opt-out. The costs for
2087 becomes self-defeating.
2089 The memory.low boundary on the other hand is a top-down allocated
2096 available memory. The memory consumption of workloads varies during
2104 The memory.high boundary on the other hand can be set much more
2110 and make corrections until the minimal memory footprint that still
2117 system than killing the group. Otherwise, memory.max is there to
2121 Setting the original memory.limit_in_bytes below the current usage was
2123 limit setting to fail. memory.max on the other hand will first set the
2125 new limit is met - or the task writing to memory.max is killed.
2127 The combined memory+swap accounting and limiting is replaced by real
2130 The main argument for a combined memory+swap facility in the original
2132 able to swap all anonymous memory of a child group, regardless of the
2134 groups can sabotage swapping by other means - such as referencing its
2135 anonymous memory in a tight loop - and an admin can not assume full