cgroup-v2.rst - OpenGrok cross reference for /kernel/linux/linux-5.10/Documentation/admin-guide/cgroup-v2.rst

Lines Matching full:and
8 This is the authoritative documentation on the design, interface and
10 of cgroup including core and specific controller behaviors.  All
21      2-2. Organizing Processes and Threads
26        2-4-1. Enabling and Disabling
33        2-6-1. Organize Once and Control
73      6-2. The Root and Views
74      6-3. Migration and setns(2)
79    R. Issues with v1 and Rationales for v2
82      R-3. Competition Between Inner Nodes and Threads
84      R-5. Controller Issues and Remedies
94 "cgroup" stands for "control group" and is never capitalized.  The
95 singular form is used to designate the whole feature and also as a
103 cgroup is a mechanism to organize processes hierarchically and
104 distribute system resources along the hierarchy in a controlled and
107 cgroup is largely composed of two parts - the core and controllers.
114 cgroups form a tree structure and every process in the system belongs
115 to one and only one cgroup.  All threads of a process belong to the
143 controllers which support v2 and are not bound to a v1 hierarchy are
144 automatically bound to the v2 hierarchy and show up at the root.
151 controller states are destroyed asynchronously and controllers may
155 the unified hierarchy and it may take some time for the disabled
160 While useful for development and manual configurations, moving
161 controllers dynamically between the v2 and other hierarchies is
163 the hierarchies and controller associations before starting using the
167 automount the v1 cgroup filesystem and so hijack all controllers
169 and experimenting easier, the kernel parameter cgroup_no_v1= allows
170 disabling controllers in v1 and make them always available in v2.
177 	option is system wide and can only be set on mount or modified
185         and not any subtrees. This is legacy behaviour, the default
187         This option is system wide and can only be set on mount or
193         Recursively apply memory.min and memory.low protection to
203 Organizing Processes and Threads
217 belong to the cgroup one-per-line.  The PIDs are not ordered and the
219 another cgroup and then back or the PID got recycled while reading.
231 zombie process does not appear in "cgroup.procs" and thus can't be
236 have any children and is associated only with zombie processes is
237 considered empty and can be removed::
250 If the process becomes a zombie and the cgroup it was associated with
276 threaded, is called threaded domain or thread root interchangeably and
280 different cgroups and are not subject to the no internal process
286 resource consumptions whether there are processes in it or not and
289 serve both as a threaded domain and a parent to domain cgroups.
296 On creation, a cgroup is always a domain cgroup and can be made
331 instead of per-process, "cgroup.threads" has the same format and
338 subtree, and, while the threads can be scattered across the subtree,
341 processes in the subtree and is not readable in the subtree proper.
347 accounts for and controls resource consumptions associated with the
348 threads in the cgroup and its descendants.  All consumptions which
353 between threads in a non-leaf cgroup and its child cgroups.  Each
363 the cgroup and its descendants; otherwise, 1.  poll and [id]notify
366 sub-hierarchy have exited.  The populated state updates and
374 A, B and C's "populated" fields would be 1 while D's 0.  After the one
375 process in C exits, B and C's "populated" fields would flip to "0" and
383 Enabling and Disabling
392 No controller is enabled by default.  Controllers can be enabled and
410 As A has "cpu" and "memory" enabled, A will control the distribution
411 of CPU cycles and memory to its children, in this case, B.  As B has
412 "memory" enabled but not "CPU", C and D will compete freely on CPU
418 would create the "cpu." prefixed controller interface files in C and
420 prefixed controller interface files from C and D.  This means that the
428 Resources are distributed top-down and a cgroup can further distribute
433 the parent has the controller enabled and a controller can't be
451 processes and anonymous resource consumption which can't be associated
452 with any other cgroups and requires special treatment from most
462 cgroup must create children and transfer all its processes to the
474 user by granting write access of the directory and its "cgroup.procs",
475 "cgroup.threads" and "cgroup.subtree_control" files to the user.
483 kernel rejects writes to all files other than "cgroup.procs" and
489 organize processes inside it as it sees fit and further distribute the
490 resources it received from the parent.  The limits and other settings
491 of all resource controllers are hierarchical and regardless of what
514   common ancestor of the source and destination cgroups.
520 For an example, let's assume cgroups C0 and C1 have been delegated to
521 user U0 who created C00, C01 under C0 and C10 under C1 as follows and
522 all processes under C0 and C1 belong to U0::
531 file; however, the common ancestor of the source cgroup C10 and the
532 destination cgroup C00 is above the points of delegation and U0 would
533 not have write access to its "cgroup.procs" files and thus the write
537 that both the source and destination cgroups are reachable from the
545 Organize Once and Control
549 and stateful resources such as memory are not moved together with the
551 inherent trade-offs between migration and various hot paths in terms
556 should be assigned to a cgroup according to the system's logical and
565 Interface files for a cgroup and its children cgroups occupy the same
566 directory and it is possible to create children cgroups which collide
569 All cgroup core interface files are prefixed with "cgroup." and each
570 controller's interface files are prefixed with the controller name and
571 a dot.  A controller's name is composed of lower case alphabets and
577 cgroup doesn't do anything to prevent name collisions and it's the
585 depending on the resource type and expected use cases.  This section
593 active children and giving each the fraction matching the ratio of its
604 valid and there is no reason to reject configuration changes or
608 and is an example of this type.
618 Limits are in the range [0, max] and defaults to "max", which is noop.
621 valid and there is no reason to reject configuration changes or
624 "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
625 on an IO device and is an example of this type.
638 Protections are in the range [0, max] and defaults to 0, which is
642 are valid and there is no reason to reject configuration changes or
645 "memory.low" implements best-effort memory protection and is an
657 Allocations are in the range [0, max] and defaults to 0, which is no
661 combinations are invalid and should be rejected.  Also, if the
665 "cpu.rt.max" hard-allocates realtime slices and is an example of this
706 For both flat and nested keyed files, only the values for a single key
708 may be specified in any order and not all pairs have to be specified.
716 - The root cgroup should be exempt from resource control and thus
726   interface file should be named "weight" and have the range [1,
728   enough and symmetric bias in both directions while keeping it
731 - If a controller implements an absolute resource guarantee and/or
732   limit, the interface files should be named "min" and "max"
734   guarantee and/or limit, the interface files should be named "low"
735   and "high" respectively.
738   used to represent upward infinity for both reading and writing.
740 - If a setting has a configurable default value and keyed specific
741   overrides, the default entry should be keyed with "default" and
770   and cleared by::
816 	the cgroup one-per-line.  The PIDs are not ordered and the
818 	to another cgroup and then back or the PID got recycled while
828 	  common ancestor of the source and destination cgroups.
835 	supported and moves every thread of the process to the cgroup.
842 	the cgroup one-per-line.  The TIDs are not ordered and the
844 	another cgroup and then back or the TID got recycled while
857 	  common ancestor of the source and destination cgroups.
879 	name prefixed with '+' enables the controller and '-'
881 	the last one is effective.  When multiple enable and disable
930 	Allowed values are "0" and "1". The default is "0".
932 	Writing "1" to the file causes freezing of the cgroup and all
934 	be stopped and will not run until the cgroup will be explicitly
937 	will be updated to "1" and the corresponding notification will be
945 	They also can enter and leave a frozen cgroup: either by an explicit
951 	it's possible to delete a frozen (and empty) cgroup, as well as
961 controller implements weight and absolute bandwidth limit models for
962 normal scheduling policy and absolute bandwidth allocation model for
966 base and it does not account for the frequency at which tasks are executed.
972 WARNING: cgroup2 doesn't yet support control of realtime processes and
976 process, and these processes may need to be moved to the root cgroup
995 	and the following three when the controller is enabled:
1014 	"cpu.weight" and allows reading and setting weight using the
1015 	same values used by nice(2).  Because the range is smaller and
1044         This interface allows reading and setting minimum utilization clamp
1059         This interface allows reading and setting maximum utilization clamp
1069 stateful and implements both limit and protection models.  Due to the
1070 intertwining between memory usage and reclaim pressure and the
1076 accounted and controlled to a reasonable extent.  Currently, the
1079 - Userland memory - page cache and anonymous memory.
1081 - Kernel data structures such as dentries and inodes.
1100 	and its descendants.
1123 	protection is discouraged and may lead to constant OOMs.
1158 	throttled and put under heavy reclaim pressure.
1160 	Going over the high limit never invokes the OOM killer and
1168 	mechanism.  If a cgroup's memory usage reaches this limit and
1181 	high limit is used and monitored properly, this limit's
1196 	are treated as an exception and are never killed.
1208 	Note that all fields in this file are hierarchical and the
1221 		throttled and routed to perform direct memory reclaim
1234 		reached the limit and allocation was about to fail.
1253 	types of memory, type-specific details, and other information
1254 	on the state and past events of the memory management system.
1258 	The entries are ordered to be human readable, and new entries
1268 		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
1272 		including tmpfs and shared memory.
1296 		Amount of cached filesystem data that was modified and
1304 		Amount of memory, swap-backed and filesystem-backed,
1315 		dentries and inodes.
1390 	types of memory, type-specific details, and other information
1405 	The entries are ordered to be human readable, and new entries
1416 	and its descendants.
1453 		to go over the max boundary and swap allocation
1462 	entries are reclaimed gradually and the swap usage may stay
1464 	reduces the impact on the workload and memory management.
1478 and letting global memory pressure to distribute memory according to
1483 opportunities to monitor and take appropriate actions such as granting
1500 A memory area is charged to the cgroup which instantiated it and stays
1520 controller implements both weight based and absolute bandwidth or IOPS
1522 only if cfq-iosched is in use and neither scheme is available for
1532 	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
1556 	are keyed by $MAJ:$MIN device numbers and not ordered.  The
1572 	The controller is disabled by default and can be enabled by
1573 	setting "enable" to 1.  "rpct" and "wpct" parameters default
1574 	to zero and the controller uses internal device saturation
1575 	state to adjust the overall IO rate between "min" and "max".
1584 	latencies is above 75ms or write 150ms, and adjust the overall
1585 	IO issue rate between 50% and 150% accordingly.
1589 	adjustment range between "min" and "max", the more conformant
1591 	base rate may be far off from 100% and setting "min" and "max"
1593 	control quality.  "min" and "max" are useful for regulating
1595 	ssd which accepts writes at the line speed for a while and
1599 	kernel and may change automatically.  Setting "ctrl" to "user"
1600 	or setting any of the percentile and latency parameters puts
1601 	it into "user" mode and disables the automatic changes.  The
1611 	by $MAJ:$MIN device numbers and not ordered.  The line for a
1623 	parameters are written to, "ctrl" become "user" and the
1636 	costs of a sequential and random IO and the cost coefficient
1641 	sense and is scaled to the device behavior dynamically.
1652 	$MAJ:$MIN device numbers and not ordered.  The weights are in
1653 	the range [1, 10000] and specifies the relative amount IO time
1658 	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
1670 	BPS and IOPS based IO limit.  Lines are keyed by $MAJ:$MIN
1671 	device numbers and not ordered.  The following nested keys are
1686 	BPS and IOPS are measured in each IO direction and IOs are
1689 	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
1715 Page cache is dirtied through buffered writes and shared mmaps and
1717 mechanism.  Writeback sits between the memory and IO domains and
1718 regulates the proportion of dirty memory by balancing dirtying and
1723 defines the memory domain that dirty memory ratio is calculated and
1724 maintained for and the io controller defines the io domain which
1725 writes out dirty pages for the memory domain.  Both system-wide and
1726 per-cgroup dirty memory states are examined and the more restrictive
1731 btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are 
1734 There are inherent differences in memory and writeback management
1737 inode is assigned to a cgroup and all IO requests to write dirty pages
1743 constantly keeps track of foreign pages and, if a particular foreign
1752 As memory controller assigns page ownership on the first use and
1764 	memory controller and system-wide clean memory.
1768 	total available memory and applied the same way as
1776 with a latency target, and if the average latency exceeds that target the
1781 in the diagram below, only groups A, B, and C will influence each other, and
1782 groups D and F will influence each other.  Group G will influence nobody::
1791 So the ideal way to configure this is to set io.latency in groups A, B, and C.
1794 Start at higher than the expected latency for your device and watch the
1809   and going all the way down to 1 IO at a time.
1813   includes swapping and metadata IO.  These types of IO are allowed to occur
1815   originating group is being throttled you will see the use_delay and delay
1879 	The number of processes currently in the cgroup and its
1895 the CPU and memory node placement of tasks to only the resources
1898 on properly sized subsets of the systems with careful processor and
1899 memory placement to reduce cross-node memory access and contention
1915 	subjected to constraints imposed by its parent and can differ
1929 	and won't be affected by any CPU hotplug events.
1954 	is subjected to constraints imposed by its parent and can differ
1969 	and won't be affected by any memory nodes hotplug events.
1990 	and is not delegatable.
1999 	itself and all its descendants except those that are separate
2000 	partition roots themselves and their descendants.  The root
2007 	1) The "cpuset.cpus" is not empty and the list of CPUs are
2028 	partition and the new "cpuset.cpus" value is a superset of its
2041 	above are true and at least one CPU from "cpuset.cpus" is
2065 creation of new device files (using mknod), and access to the
2068 Cgroup v2 device controller has no interface files and is implemented
2070 create bpf programs of the BPF_CGROUP_DEVICE type and attach them
2072 BPF programs will be executed, and depending on the return value
2077 (mknod/read/write) and device (type, major and minor numbers).
2088 The "rdma" controller regulates the distribution and accounting of
2099 	Lines are keyed by device name and are not ordered.
2100 	Each line contains space separated resource name and its configured
2110 	An example for mlx4 and ocrdma device follows::
2119 	An example for mlx4 and ocrdma device follows::
2127 The HugeTLB controller allows to limit the HugeTLB usage per control group and
2168 the stable kernel API and so is subject to change.
2200 "/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
2201 flag can be used with clone(2) and unshare(2) to create a new cgroup
2209 a set of cgroups and namespaces are intended to isolate processes the
2217 and undesirable to expose to the isolated processes.  cgroup namespace
2240 namespace is destroyed.  The cgroupns root and the actual cgroups
2244 The Root and Views
2294 Migration and setns(2)
2297 Processes inside a cgroup namespace can move into and out of the
2300 /batchjobs/container_id1, and assuming that the global hierarchy is
2332 filesystem root.  The process needs CAP_SYS_ADMIN against its user and
2344 where interacting with cgroup is necessary.  cgroup core and
2356 	Should be called for each bio carrying writeback data and
2357 	associates the bio with the inode's owner cgroup and the
2359 	a queue (device) has been associated with the bio and
2365 	during the writeback session, it's the easiest and most
2375 the configuration, the bio may be executed at a lower priority and if
2379 cases by skipping wbc_init_bio() and using bio_associate_blkg()
2390 - The "tasks" file is removed and "cgroup.procs" is not sorted.
2398 Issues with v1 and Rationales for v2
2404 cgroup v1 allowed an arbitrary number of hierarchies and each
2418 put on the same hierarchy and most configurations resorted to putting
2420 as the cpu and cpuacct controllers, made sense to be put on the same
2428 used in general and what controllers was able to do.
2432 length.  The key might contain any number of entries and was unlimited
2433 in length, which made it highly awkward to manipulate and led to
2458 This didn't make sense for some controllers and those controllers
2461 individual applications and system management interface.
2470 individual applications so that they can create and manage their own
2471 sub-hierarchies and control resource distributions along them.  This
2479 and then read and/or write to it.  This is not only extremely clunky
2480 and unusual but also inherently racy.  There is no conventional way to
2481 define transaction across the required steps and nothing can guarantee
2487 knobs which were not properly abstracted or refined and directly
2493 This was painful for both userland and kernel.  Userland ended up with
2494 misbehaving and poorly abstracted interfaces and kernel exposing and
2498 Competition Between Inner Nodes and Threads
2502 interesting problem where threads belonging to a parent cgroup and its
2504 different types of entities competed and there was no obvious way to
2507 The cpu controller considered threads and cgroups as equivalents and
2510 cycles and the number of internal threads fluctuated - the ratios
2513 wasn't obvious or universal, and there were various other knobs which
2521 otherwise, made the interface messy and significantly complicated the
2525 between internal tasks and child cgroups and the behavior was not
2526 clearly defined.  There were attempts to add ad-hoc behaviors and
2530 Multiple controllers struggled with internal tasks and came up with
2532 severely flawed and, furthermore, the widely different behaviors
2542 cgroup v1 grew without oversight and developed a large number of
2543 idiosyncrasies and inconsistencies.  One issue on the cgroup core side
2545 forked and executed for each event.  The event delivery wasn't
2551 controllers completely ignoring hierarchical organization and treating
2560 control used widely differing naming schemes and formats.  Statistics
2561 and information knobs were named arbitrarily and used different
2562 formats and units even in the same controller.
2564 cgroup v2 establishes common conventions where appropriate and updates
2565 controllers so that they expose minimal and consistent interfaces.
2568 Controller Issues and Remedies
2581 rbtree and treated like equal peers, regardless where they are located
2598 runtime, and that requires users to overcommit.  But doing that with a
2601 estimation is hard and error prone, and getting it wrong results in
2602 OOM kills, most users tend to err on the side of a looser limit and
2611 and make corrections until the minimal memory footprint that still
2614 In extreme cases, with many concurrent allocations and a complete
2619 limit this type of spillover and ultimately contain buggy or even
2625 limit to prevent new charges, and then reclaim and OOM kill until the
2628 The combined memory+swap accounting and limiting is replaced by real
2636 anonymous memory in a tight loop - and an admin can not assume full
2640 intuitive userspace interface, and it flies in the face of the idea
2641 that cgroup controllers should account and limit specific physical
2643 and that's why unified hierarchy allows distributing it separately.