• Home
  • Raw
  • Download

Lines Matching +full:system +full:- +full:cache +full:- +full:controller

1 Memory Resource Controller
8 NOTE: The Memory Resource Controller has generically been referred to as the
9 memory controller in this document. Do not confuse memory controller
10 used here with the memory controller that is used in hardware.
14 When we mention a cgroup (cgroupfs's directory) with memory controller,
15 we call it "memory cgroup". When you see git-log and source code, you'll
19 Benefits and Purpose of the memory controller
21 The memory controller isolates the memory behaviour of a group of tasks
22 from the rest of the system. The article on LWN [12] mentions some probable
23 uses of the memory controller. The memory controller can be used to
26 Memory-hungry applications can be isolated and limited to a smaller
33 rest of the system to ensure that burning does not fail due to lack
35 e. There are several other use cases; find one or use the controller just
38 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
41 - accounting anonymous pages, file caches, swap caches usage and limiting them.
42 - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
43 - optionally, memory+swap usage can be accounted and limited.
44 - hierarchical accounting
45 - soft limit
46 - moving (recharging) account at moving a task is selectable.
47 - usage threshold notifier
48 - memory pressure notifier
49 - oom-killer disable knob and oom-notifier
50 - Root cgroup has no limit controls.
93 The memory controller has a long history. A request for comments for the memory
94 controller was posted by Balbir Singh [1]. At the time the RFC was posted
97 for memory control. The first RSS controller was posted by Balbir Singh[2]
99 RSS controller. At OLS, at the resource management BoF, everyone suggested
100 that we handle both page cache and RSS together. Another request was raised
101 to allow user space handling of OOM. The current memory controller is
103 Cache Control [11].
112 The memory controller implementation has been divided into phases. These
115 1. Memory controller
116 2. mlock(2) controller
118 4. user mappings length controller
120 The memory controller is the first controller developed.
126 processes associated with the controller. Each cgroup has a memory controller
131 +--------------------+
134 +--------------------+
137 +---------------+ | +---------------+
140 +---------------+ | +---------------+
142 + --------------+
144 +---------------+ +------+--------+
145 | page +----------> page_cgroup|
147 +---------------+ +---------------+
152 Figure 1 shows the important aspects of the controller
163 If everything goes well, a page meta-data-structure called page_cgroup is
165 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
169 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
174 for earlier. A file page will be accounted for as Page Cache when it's
175 inserted into inode (radix-tree). While it's mapped into the page tables of
179 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
180 unmapped (by kswapd), they may exist as SwapCache in the system until they
182 A swapped-in page is not accounted until it's mapped.
184 Note: The kernel does swapin-readahead and reads multiple swaps at once.
185 This means swapped-in pages may contain pages for other tasks than a task
186 causing page fault. So, we avoid accounting at swap-in I/O.
190 Note: we just account pages-on-LRU because our purpose is to control amount
191 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
199 the cgroup that brought it in -- this will happen on memory pressure).
205 When you do swapoff and make swapped-out pages of shmem(tmpfs) to
211 Swap Extension allows you to record charge for swap. A swapped-in page is
215 - memory.memsw.usage_in_bytes.
216 - memory.memsw.limit_in_bytes.
221 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
224 By using the memsw limit, you can avoid system OOM which can be caused by swap
228 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
235 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
236 in this cgroup. Then, swap-out will not be done by cgroup routine and file
238 from it for sanity of the system's memory management state. You can't forbid
251 pages that are selected for reclaiming come from the per-cgroup LRU
257 Note2: When panic_on_oom is set to "2", the whole system will panic.
269 mm->page_table_lock
273 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
278 With the Kernel memory extension, the Memory Controller is able to limit
279 the amount of kernel memory used by the system. Kernel memory is fundamentally
281 possible to DoS the system by consuming too much of this precious resource.
284 it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
304 of each kmem_cache is created every time the cache is touched by the first time
306 skipped while the cache is being created. All objects in a slab page should
308 different memcg during the page allocation by the cache.
311 thresholds. The Memory Controller allows them to be controlled individually
329 deployments where the total amount of memory per-cgroup is overcommited.
331 box can still run out of non-reclaimable memory.
355 # mount -t tmpfs none /sys/fs/cgroup
357 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
369 NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
382 availability of memory on the system. The user is required to re-read
399 Performance test is also important. To see pure memory controller's overhead,
403 Page-fault scalability is also important. At measuring parallel
404 page fault test, multi-process test may be better than multi-thread
408 Trying usual test under memory controller is always helpful.
419 some of the pages cached in the cgroup (page cache pages).
462 Because rmdir() moves all pages to parent, some out-of-use page caches can be
476 # per-memory cgroup local status
477 cache - # of bytes of page cache memory.
478 rss - # of bytes of anonymous and swap cache memory (includes
480 rss_huge - # of bytes of anonymous transparent hugepages.
481 mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
482 pgpgin - # of charging events to the memory cgroup. The charging
484 anon page(RSS) or cache page(Page Cache) to the cgroup.
485 pgpgout - # of uncharging events to the memory cgroup. The uncharging
487 swap - # of bytes of swap usage
488 dirty - # of bytes that are waiting to get written back to the disk.
489 writeback - # of bytes of file/anon cache that are queued for syncing to
491 inactive_anon - # of bytes of anonymous and swap cache memory on inactive
493 active_anon - # of bytes of anonymous and swap cache memory on active
495 inactive_file - # of bytes of file-backed memory on inactive LRU list.
496 active_file - # of bytes of file-backed memory on active LRU list.
497 unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
501 hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
503 hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
506 total_<counter> - # hierarchical version of <counter>, which in
513 recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
514 recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
515 recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
516 recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
524 Only anonymous and swap cache memory is listed as part of 'rss' stat.
530 cache.)
558 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
563 This is similar to numa_maps but operates on a per-memcg basis. This is
570 per-node page counts including "hierarchical_<counter>" which sums up all
585 The memory controller supports a deep hierarchy and hierarchical accounting.
619 NOTE2: When panic_on_oom is set to "2", the whole system will panic in
630 When the system detects memory contention or low memory, control groups
635 Please note that soft limits is a best-effort feature; it comes with
675 Note: Charges are moved only when you move mm->owner, in other words,
694 -----+------------------------------------------------------------------------
697 -----+------------------------------------------------------------------------
709 - All of moving charge operations are done under cgroup_mutex. It's not good
719 - create an eventfd using eventfd(2);
720 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
721 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
727 It's applicable for root and non-root cgroup.
738 - create an eventfd using eventfd(2)
739 - open memory.oom_control file
740 - write string like "<event_fd> <fd of memory.oom_control>" to
746 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
750 If OOM-killer is disabled, tasks under cgroup will hang/sleep
751 in memory cgroup's OOM-waitqueue when they request accountable memory.
763 oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
774 The "low" level means that the system is reclaiming memory for new
776 maintaining cache level. Upon notification, the program (typically
780 The "medium" level means that the system is experiencing medium memory
781 pressure, the system might be making swap, paging out active file caches,
784 resources that can be easily reconstructed or re-read from a disk.
786 The "critical" level means that the system is actively thrashing, it is
787 about to out of memory (OOM) or even the in-kernel OOM killer is on its
789 system. It might be too late to consult with vmstat or any other
793 events are not pass-through. For example, you have three cgroups: A->B->C. Now
797 excessive "broadcasting" of messages, which disturbs the system and which is
803 - "default": this is the default behavior specified above. This mode is the
807 - "hierarchy": events always propagate up to the root, similar to the default
812 - "local": events are pass-through, i.e. they only receive notifications when
821 specified by a comma-delimited string, i.e. "low,hierarchy" specifies
822 hierarchical, pass-through, notification for all ancestor memcgs. Notification
823 that is the default, non pass-through behavior, does not specify a mode.
824 "medium,local" specifies pass-through notification for the medium level.
829 - create an eventfd using eventfd(2);
830 - open memory.pressure_level;
831 - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
853 (Expect a bunch of notifications, and eventually, the oom-killer will
858 1. Make per-cgroup scanner reclaim not-shared pages first
859 2. Teach controller to account for shared-pages
865 Overall, the memory controller has been a stable controller and has been
870 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
871 2. Singh, Balbir. Memory Controller (RSS Control),
875 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
877 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
882 8. Singh, Balbir. RSS controller v2 test results (lmbench),
884 9. Singh, Balbir. RSS controller v2 AIM9 results
886 10. Singh, Balbir. Memory controller v6 test results,
888 11. Singh, Balbir. Memory controller introduction (v6),