Lines Matching +full:charge +full:- +full:current +full:- +full:limit +full:- +full:mapping
1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. HugeTLB
68 5.8-1. HugeTLB Interface Files
69 5-9. Misc
70 5.9-1 Miscellaneous cgroup Interface Files
71 5.9-2 Migration and Ownership
72 5-10. Others
73 5-10-1. perf_event
74 5-N. Non-normative information
75 5-N-1. CPU controller root cgroup process behaviour
76 5-N-2. IO controller root cgroup process behaviour
78 6-1. Basics
79 6-2. The Root and Views
80 6-3. Migration and setns(2)
81 6-4. Interaction with Other Namespaces
83 P-1. Filesystem Support for Writeback
86 R-1. Multiple Hierarchies
87 R-2. Thread Granularity
88 R-3. Competition Between Inner Nodes and Threads
89 R-4. Other Interface Issues
90 R-5. Controller Issues and Remedies
91 R-5-1. Memory
98 -----------
107 ---------------
113 cgroup is largely composed of two parts - the core and controllers.
129 hierarchical - if a controller is enabled on a cgroup, it affects all
131 sub-hierarchy of the cgroup. When a controller is enabled on a nested
141 --------
146 # mount -t cgroup2 none $MOUNT_POINT
156 is no longer referenced in its current hierarchy. Because per-cgroup
163 to inter-controller dependencies, other controllers may need to be
184 ignored on non-init namespace mounts. Please refer to the
196 Only populate memory.events with data for the current cgroup,
201 option is ignored on non-init namespace mounts.
209 behavior but is a mount-option to avoid regressing setups
215 --------------------------------
221 A child cgroup can be created by creating a sub-directory::
226 structure. Each cgroup has a read-writable interface file
228 belong to the cgroup one-per-line. The PIDs are not ordered and the
259 0::/test-cgroup/test-cgroup-nested
266 0::/test-cgroup/test-cgroup-nested (deleted)
292 constraint - threaded controllers can be enabled on non-leaf cgroups
302 The current operation mode or type of the cgroup is shown in the
316 - As the cgroup will join the parent's resource domain. The parent
319 - When the parent is an unthreaded domain, it must not have any domain
323 Topology-wise, a cgroup can be in an invalid state. Please consider
326 A (threaded domain) - B (threaded) - C (domain, just created)
341 threads in the cgroup. Except that the operations are per-thread
342 instead of per-process, "cgroup.threads" has the same format and
364 between threads in a non-leaf cgroup and its child cgroups. Each
369 --------------------------
371 Each non-root cgroup has a "cgroup.events" file which contains
372 "populated" field indicating whether the cgroup's sub-hierarchy has
376 example, to start a clean-up operation after all processes of a given
377 sub-hierarchy have exited. The populated state updates and
378 notifications are recursive. Consider the following sub-hierarchy
382 A(4) - B(0) - C(1)
392 -----------------------
406 # echo "+cpu +memory -io" > cgroup.subtree_control
415 Consider the following sub-hierarchy. The enabled controllers are
418 A(cpu,memory) - B(memory) - C()
432 controller interface files - anything which doesn't start with
436 Top-down Constraint
439 Resources are distributed top-down and a cgroup can further distribute
441 parent. This means that all non-root "cgroup.subtree_control" files
451 Non-root cgroups can distribute domain resources to their children
466 refer to the Non-normative information section in the Controllers
479 ----------
499 delegated, the user can build sub-hierarchy under the directory,
503 happens in the delegated sub-hierarchy, nothing can escape the
507 cgroups in or nesting depth of a delegated sub-hierarchy; however,
514 A delegated sub-hierarchy is contained in the sense that processes
515 can't be moved into or out of the sub-hierarchy by the delegatee.
518 requiring the following conditions for a process with a non-root euid
522 - The writer must have write access to the "cgroup.procs" file.
524 - The writer must have write access to the "cgroup.procs" file of the
528 processes around freely in the delegated sub-hierarchy it can't pull
529 in from or push out to outside the sub-hierarchy.
535 ~~~~~~~~~~~~~ - C0 - C00
538 ~~~~~~~~~~~~~ - C1 - C10
545 will be denied with -EACCES.
550 is not reachable, the migration is rejected with -ENOENT.
554 ----------
562 inherent trade-offs between migration and various hot paths in terms
568 resource structure once on start-up. Dynamic adjustments to resource
601 -------
607 work-conserving. Due to the dynamic nature, this model is usually
622 .. _cgroupv2-limits-distributor:
625 ------
628 Limits can be over-committed - the sum of the limits of children can
633 As limits can be over-committed, all configuration combinations are
640 .. _cgroupv2-protections-distributor:
643 -----------
648 soft boundaries. Protections can also be over-committed in which case
655 As protections can be over-committed, all configuration combinations
659 "memory.low" implements best-effort memory protection and is an
664 -----------
667 resource. Allocations can't be over-committed - the sum of the
674 As allocations can't be over-committed, some configuration
679 "cpu.rt.max" hard-allocates realtime slices and is an example of this
687 ------
692 New-line separated values
700 (when read-only or multiple values can be written at once)
726 -----------
728 - Settings for a single feature should be contained in a single file.
730 - The root cgroup should be exempt from resource control and thus
733 - The default time unit is microseconds. If a different unit is ever
736 - A parts-per quantity should use a percentage decimal with at least
737 two digit fractional part - e.g. 13.40.
739 - If a controller implements weight based resource distribution, its
745 - If a controller implements an absolute resource guarantee and/or
746 limit, the interface files should be named "min" and "max"
748 guarantee and/or limit, the interface files should be named "low"
754 - If a setting has a configurable default value and keyed specific
768 # cat cgroup-example-interface-file
774 # echo 125 > cgroup-example-interface-file
778 # echo "default 125" > cgroup-example-interface-file
782 # echo "8:16 170" > cgroup-example-interface-file
786 # echo "8:0 default" > cgroup-example-interface-file
787 # cat cgroup-example-interface-file
791 - For events which are not very high frequency, an interface file
798 --------------------
803 A read-write single value file which exists on non-root
806 When read, it indicates the current type of the cgroup, which
809 - "domain" : A normal valid domain cgroup.
811 - "domain threaded" : A threaded domain cgroup which is
814 - "domain invalid" : A cgroup which is in an invalid state.
818 - "threaded" : A threaded cgroup which is a member of a
825 A read-write new-line separated values file which exists on
829 the cgroup one-per-line. The PIDs are not ordered and the
838 - It must have write access to the "cgroup.procs" file.
840 - It must have write access to the "cgroup.procs" file of the
843 When delegating a sub-hierarchy, write access to this file
851 A read-write new-line separated values file which exists on
855 the cgroup one-per-line. The TIDs are not ordered and the
864 - It must have write access to the "cgroup.threads" file.
866 - The cgroup that the thread is currently in must be in the
869 - It must have write access to the "cgroup.procs" file of the
872 When delegating a sub-hierarchy, write access to this file
876 A read-only space separated values file which exists on all
883 A read-write space separated values file which exists on all
890 Space separated list of controllers prefixed with '+' or '-'
892 name prefixed with '+' enables the controller and '-'
898 A read-only flat-keyed file which exists on non-root cgroups.
910 A read-write single value files. The default is "max".
917 A read-write single value files. The default is "max".
919 Maximum allowed descent depth below the current cgroup.
924 A read-only flat-keyed file with the following entries:
942 A read-write single value file which exists on non-root cgroups.
965 create new sub-cgroups.
968 A write-only single value file which exists in non-root cgroups.
980 the whole thread-group.
983 A read-write single value file that allowed values are "0" and "1".
987 Writing "1" to the file will re-enable the cgroup PSI accounting.
995 This may cause non-negligible overhead for some workloads when under
997 be used to disable PSI accounting in the non-leaf cgroups.
1000 A read-write nested-keyed file.
1008 .. _cgroup-v2-cpu:
1011 ---
1014 controller implements weight and absolute bandwidth limit models for
1039 A read-only flat-keyed file.
1044 - usage_usec
1045 - user_usec
1046 - system_usec
1050 - nr_periods
1051 - nr_throttled
1052 - throttled_usec
1053 - nr_bursts
1054 - burst_usec
1057 A read-write single value file which exists on non-root
1063 A read-write single value file which exists on non-root
1066 The nice value is in the range [-20, 19].
1072 the closest approximation of the current weight.
1075 A read-write two value file which exists on non-root cgroups.
1078 The maximum bandwidth limit. It's in the following format::
1083 $PERIOD duration. "max" for $MAX indicates no limit. If only
1087 A read-write single value file which exists on non-root
1093 A read-write nested-keyed file.
1099 A read-write single value file which exists on non-root cgroups.
1110 the current value for the maximum utilization (limit), i.e.
1114 A read-write single value file which exists on non-root cgroups.
1117 The requested maximum utilization (limit) as a percentage rational
1127 ------
1130 stateful and implements both limit and protection models. Due to the
1135 While not completely water-tight, all major memory usages by a given
1140 - Userland memory - page cache and anonymous memory.
1142 - Kernel data structures such as dentries and inodes.
1144 - TCP socket buffers.
1156 memory.current
1157 A read-only single value file which exists on non-root
1164 A read-write single value file which exists on non-root
1190 A read-write single value file which exists on non-root
1193 Best-effort memory protection. If the memory usage of a
1213 A read-write single value file which exists on non-root
1216 Memory usage throttle limit. If a cgroup's usage goes
1220 Going over the high limit never invokes the OOM killer and
1221 under extreme conditions the limit may be breached. The high
1222 limit should be used in scenarios where an external process
1227 A read-write single value file which exists on non-root
1230 Memory usage hard limit. This is the main mechanism to limit
1232 this limit and can't be reduced, the OOM killer is invoked in
1234 over the limit temporarily.
1236 In default configuration regular 0-order allocations always
1237 succeed unless OOM killer chooses current task as a victim.
1241 as -ENOMEM or silently ignore in cases like disk readahead.
1244 A write-only nested-keyed file which exists for all cgroups.
1262 specified amount, -EAGAIN is returned.
1272 A read-only single value file which exists on non-root
1279 A read-write single value file which exists on non-root
1289 Tasks with the OOM protection (oom_score_adj set to -1000)
1297 A read-only flat-keyed file which exists on non-root cgroups.
1311 boundary is over-committed.
1317 cgroup whose memory usage is capped by the high limit
1328 reached the limit and allocation was about to fail.
1331 considered as an option, e.g. for failed high-order
1347 A read-only flat-keyed file which exists on non-root cgroups.
1350 types of memory, type-specific details, and other information
1359 If the entry has no per-node counter (or not show in the
1360 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1388 Amount of memory used for storing per-cpu kernel
1398 Amount of cached filesystem data that is swap-backed,
1435 Amount of memory, swap-backed and filesystem-backed,
1441 the value for the foo counter, since the foo counter is type-based, not
1442 list-based.
1453 Amount of memory used for storing in-kernel data
1536 A read-only nested-keyed file which exists on non-root cgroups.
1539 types of memory, type-specific details, and other information
1560 memory.swap.current
1561 A read-only single value file which exists on non-root
1568 A read-write single value file which exists on non-root
1571 Swap usage throttle limit. If a cgroup's swap usage exceeds
1572 this limit, all its further allocations will be throttled to
1573 allow userspace to implement custom out-of-memory procedures.
1575 This limit marks a point of no return for the cgroup. It is NOT
1581 Healthy workloads are not expected to reach this limit.
1584 A read-only single value file which exists on non-root
1591 A read-write single value file which exists on non-root
1594 Swap usage hard limit. If a cgroup's swap usage reaches this
1595 limit, anonymous memory of the cgroup will not be swapped out.
1598 A read-only flat-keyed file which exists on non-root cgroups.
1614 because of running out of swap system-wide or max
1615 limit.
1617 When reduced under the current usage, the existing swap
1619 higher than the limit for an extended period of time. This
1622 memory.zswap.current
1623 A read-only single value file which exists on non-root
1630 A read-write single value file which exists on non-root
1633 Zswap usage hard limit. If a cgroup's zswap pool reaches this
1634 limit, it will refuse to take any more stores before existing
1638 A read-only nested-keyed file.
1648 Over-committing on high limit (sum of high limits > available memory)
1652 Because breach of the high limit doesn't trigger the OOM killer but
1662 pressure - how much the workload is being impacted due to lack of
1663 memory - is necessary to determine whether a workload needs more
1677 To which cgroup the area will be charged is in-deterministic; however,
1688 --
1692 limit distribution; however, weight based distribution is available
1693 only if cfq-iosched is in use and neither scheme is available for
1694 blk-mq devices.
1701 A read-only nested-keyed file.
1721 A read-write nested-keyed file which exists only on the root
1733 enable Weight-based control enable
1765 devices which show wide temporary behavior changes - e.g. a
1776 A read-write nested-keyed file which exists only on the root
1789 model The cost model in use - "linear"
1815 generate device-specific coefficients.
1818 A read-write flat-keyed file which exists on non-root cgroups.
1838 A read-write nested-keyed file which exists on non-root
1841 BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
1852 When writing, any number of nested key-value pairs can be
1854 to remove a specific limit. If the same key is specified
1858 delayed if limit is reached. Temporary bursts are allowed.
1860 Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
1868 Write IOPS limit can be removed by writing the following::
1877 A read-only nested-keyed file.
1896 writes out dirty pages for the memory domain. Both system-wide and
1897 per-cgroup dirty memory states are examined and the more restrictive
1935 memory controller and system-wide clean memory.
1968 your real setting, setting at 10-15% higher than the value in io.stat.
1978 - Queue depth throttling. This is the number of outstanding IO's a group is
1979 allowed to have. We will clamp down relatively quickly, starting at no limit
1982 - Artificial delay induction. There are certain types of IO that cannot be
1990 limit the individual delay events to 1 second at a time.
2009 This is the current queue depth for the group.
2029 no-change
2032 promote-to-rt
2033 For requests that have a non-RT I/O priority class, change it into RT.
2037 restrict-to-be
2047 none-to-rt
2048 Deprecated. Just an alias for promote-to-rt.
2052 +----------------+---+
2053 | no-change | 0 |
2054 +----------------+---+
2055 | rt-to-be | 2 |
2056 +----------------+---+
2057 | all-to-idle | 3 |
2058 +----------------+---+
2062 +-------------------------------+---+
2064 +-------------------------------+---+
2065 | IOPRIO_CLASS_RT (real-time) | 1 |
2066 +-------------------------------+---+
2068 +-------------------------------+---+
2070 +-------------------------------+---+
2074 - If I/O priority class policy is promote-to-rt, change the request I/O
2077 - If I/O priorityt class is not promote-to-rt, translate the I/O priority
2083 ---
2086 new tasks from being fork()'d or clone()'d after a specified limit is
2102 A read-write single value file which exists on non-root
2105 Hard limit of number of processes.
2107 pids.current
2108 A read-only single value file which exists on all cgroups.
2114 possible to have pids.current > pids.max. This can be done by either
2115 setting the limit to be smaller than pids.current, or attaching enough
2116 processes to the cgroup such that pids.current is larger than
2118 through fork() or clone(). These will return -EAGAIN if the creation
2123 ------
2127 specified in the cpuset interface files in a task's current cgroup.
2130 memory placement to reduce cross-node memory access and contention
2141 A read-write multiple values file which exists on non-root
2142 cpuset-enabled cgroups.
2149 The CPU numbers are comma-separated numbers or ranges.
2153 0-4,6,8-10
2156 setting as the nearest cgroup ancestor with a non-empty
2163 A read-only multiple values file which exists on all
2164 cpuset-enabled cgroups.
2168 tasks within the current cgroup.
2180 A read-write multiple values file which exists on non-root
2181 cpuset-enabled cgroups.
2188 The memory node numbers are comma-separated numbers or ranges.
2192 0-1,3
2195 setting as the nearest cgroup ancestor with a non-empty
2202 Setting a non-empty value to "cpuset.mems" causes memory of
2214 A read-only multiple values file which exists on all
2215 cpuset-enabled cgroups.
2219 be used by tasks within the current cgroup.
2230 A read-write single value file which exists on non-root
2231 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2237 "member" Non-root member of a partition
2243 cannot be changed. All other non-root cgroups start out as
2246 When set to "root", the current cgroup is the root of a new
2263 two possible states - valid or invalid. An invalid partition
2274 "member" Non-root member of a partition
2306 A valid non-root parent partition may distribute out all its CPUs
2326 -----------------
2337 on the return value the attempt will succeed or fail with -EPERM.
2342 If the program returns 0, the attempt fails with -EPERM, otherwise it
2350 ----
2359 A readwrite nested-keyed file that exists for all the cgroups
2360 except root that describes current configured resource limit
2365 limit that can be distributed.
2379 rdma.current
2380 A read-only file that describes current resource usage.
2389 -------
2391 The HugeTLB controller allows to limit the HugeTLB usage per control group and
2392 enforces the controller limit during page fault.
2397 hugetlb.<hugepagesize>.current
2398 Show current usage for "hugepagesize" hugetlb. It exists for all
2402 Set/show the hard limit of "hugepagesize" hugetlb usage.
2406 A read-only flat-keyed file which exists on non-root cgroups.
2409 The number of allocation failure due to HugeTLB limit
2419 use hugetlb pages are included. The per-node values are in bytes.
2422 ----
2434 Once a capacity is set then the resource usage can be updated using charge and
2444 A read-only flat-keyed file shown only in the root cgroup. It shows
2452 misc.current
2453 A read-only flat-keyed file shown in the all cgroups. It shows
2454 the current usage of the resources in the cgroup and its children.::
2456 $ cat misc.current
2461 A read-write flat-keyed file shown in the non root cgroups. Allowed
2468 Limit can be set by::
2472 Limit can be set to max by::
2480 A read-only flat-keyed file which exists on non-root cgroups. The
2494 a process to a different cgroup does not move the charge to the destination
2498 ------
2509 Non-normative information
2510 -------------------------
2524 For details of this mapping see sched_prio_to_weight array in
2526 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2542 ------
2561 The path '/batchjobs/container_id1' can be considered as system-data
2566 # ls -l /proc/self/ns/cgroup
2567 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2573 # ls -l /proc/self/ns/cgroup
2574 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2578 When some thread from a multi-threaded process unshares its cgroup
2590 ------------------
2601 # ~/unshare -c # unshare cgroupns in some cgroup
2609 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2640 ----------------------
2659 (a) the process has CAP_SYS_ADMIN against its current user namespace
2669 ---------------------------------
2672 running inside a non-init cgroup namespace::
2674 # mount -t cgroup2 none $MOUNT_POINT
2681 the view of cgroup hierarchy by namespace-private cgroupfs mount
2694 --------------------------------
2697 address_space_operations->writepage[s]() to annotate bio's using the
2714 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
2731 - Multiple hierarchies including named ones are not supported.
2733 - All v1 mount options are not supported.
2735 - The "tasks" file is removed and "cgroup.procs" is not sorted.
2737 - "cgroup.clone_children" is removed.
2739 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
2747 --------------------
2775 There was no limit on how many hierarchies there might be, which meant
2800 ------------------
2808 Generally, in-process knowledge is available only to the process
2809 itself; thus, unlike service-level organization of processes,
2816 sub-hierarchies and control resource distributions along them. This
2817 effectively raised cgroup to the status of a syscall-like API exposed
2827 that the process would actually be operating on its own sub-hierarchy.
2831 system-management pseudo filesystem. cgroup ended up with interface
2834 individual applications through the ill-defined delegation mechanism
2844 -------------------------------------------
2855 cycles and the number of internal threads fluctuated - the ratios
2857 There also were other issues. The mapping from nice level to weight
2871 clearly defined. There were attempts to add ad-hoc behaviors and
2885 ----------------------
2889 was how an empty cgroup was notified - a userland helper binary was
2892 to in-kernel event delivery filtering mechanism further complicating
2914 ------------------------------
2919 The original lower boundary, the soft limit, is defined as a limit
2921 global reclaim prefers is opt-in, rather than opt-out. The costs for
2924 basic desirable behavior. First off, the soft limit has no
2928 the soft limit reclaim pass is so aggressive that it not just
2931 becomes self-defeating.
2933 The memory.low boundary on the other hand is a top-down allocated
2939 The original high boundary, the hard limit, is defined as a strict
2940 limit that can not budge, even if the OOM killer has to be called.
2944 strict upper limit requires either a fairly accurate prediction of the
2945 working set size or adding slack to the limit. Since working set size
2947 OOM kills, most users tend to err on the side of a looser limit and
2964 limit this type of spillover and ultimately contain buggy or even
2967 Setting the original memory.limit_in_bytes below the current usage was
2969 limit setting to fail. memory.max on the other hand will first set the
2970 limit to prevent new charges, and then reclaim and OOM kill until the
2971 new limit is met - or the task writing to memory.max is killed.
2980 groups can sabotage swapping by other means - such as referencing its
2981 anonymous memory in a tight loop - and an admin can not assume full
2986 that cgroup controllers should account and limit specific physical