1=============================== 2Documentation for /proc/sys/vm/ 3=============================== 4 5kernel version 2.6.29 6 7Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> 8 9Copyright (c) 2008 Peter W. Morreale <pmorreale@novell.com> 10 11For general info and legal blurb, please look in index.rst. 12 13------------------------------------------------------------------------------ 14 15This file contains the documentation for the sysctl files in 16/proc/sys/vm and is valid for Linux kernel version 2.6.29. 17 18The files in this directory can be used to tune the operation 19of the virtual memory (VM) subsystem of the Linux kernel and 20the writeout of dirty data to disk. 21 22Default values and initialization routines for most of these 23files can be found in mm/swap.c. 24 25Currently, these files are in /proc/sys/vm: 26 27- admin_reserve_kbytes 28- block_dump 29- compact_memory 30- compact_unevictable_allowed 31- dirty_background_bytes 32- dirty_background_ratio 33- dirty_bytes 34- dirty_expire_centisecs 35- dirty_ratio 36- dirtytime_expire_seconds 37- dirty_writeback_centisecs 38- drop_caches 39- extfrag_threshold 40- extra_free_kbytes 41- hugetlb_shm_group 42- laptop_mode 43- legacy_va_layout 44- lowmem_reserve_ratio 45- max_map_count 46- memory_failure_early_kill 47- memory_failure_recovery 48- min_free_kbytes 49- min_slab_ratio 50- min_unmapped_ratio 51- mmap_min_addr 52- mmap_rnd_bits 53- mmap_rnd_compat_bits 54- nr_hugepages 55- nr_hugepages_mempolicy 56- nr_overcommit_hugepages 57- nr_trim_pages (only if CONFIG_MMU=n) 58- numa_zonelist_order 59- oom_dump_tasks 60- oom_kill_allocating_task 61- overcommit_kbytes 62- overcommit_memory 63- overcommit_ratio 64- page-cluster 65- panic_on_oom 66- percpu_pagelist_fraction 67- stat_interval 68- stat_refresh 69- numa_stat 70- swappiness 71- unprivileged_userfaultfd 72- user_reserve_kbytes 73- vfs_cache_pressure 74- watermark_boost_factor 75- watermark_scale_factor 76- zone_reclaim_mode 77 78 79admin_reserve_kbytes 80==================== 81 82The amount of free memory in the system that should be reserved for users 83with the capability cap_sys_admin. 84 85admin_reserve_kbytes defaults to min(3% of free pages, 8MB) 86 87That should provide enough for the admin to log in and kill a process, 88if necessary, under the default overcommit 'guess' mode. 89 90Systems running under overcommit 'never' should increase this to account 91for the full Virtual Memory Size of programs used to recover. Otherwise, 92root may not be able to log in to recover the system. 93 94How do you calculate a minimum useful reserve? 95 96sshd or login + bash (or some other shell) + top (or ps, kill, etc.) 97 98For overcommit 'guess', we can sum resident set sizes (RSS). 99On x86_64 this is about 8MB. 100 101For overcommit 'never', we can take the max of their virtual sizes (VSZ) 102and add the sum of their RSS. 103On x86_64 this is about 128MB. 104 105Changing this takes effect whenever an application requests memory. 106 107 108block_dump 109========== 110 111block_dump enables block I/O debugging when set to a nonzero value. More 112information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst. 113 114 115compact_memory 116============== 117 118Available only when CONFIG_COMPACTION is set. When 1 is written to the file, 119all zones are compacted such that free memory is available in contiguous 120blocks where possible. This can be important for example in the allocation of 121huge pages although processes will also directly compact memory as required. 122 123 124compact_unevictable_allowed 125=========================== 126 127Available only when CONFIG_COMPACTION is set. When set to 1, compaction is 128allowed to examine the unevictable lru (mlocked pages) for pages to compact. 129This should be used on systems where stalls for minor page faults are an 130acceptable trade for large contiguous free memory. Set to 0 to prevent 131compaction from moving pages that are unevictable. Default value is 1. 132 133 134dirty_background_bytes 135====================== 136 137Contains the amount of dirty memory at which the background kernel 138flusher threads will start writeback. 139 140Note: 141 dirty_background_bytes is the counterpart of dirty_background_ratio. Only 142 one of them may be specified at a time. When one sysctl is written it is 143 immediately taken into account to evaluate the dirty memory limits and the 144 other appears as 0 when read. 145 146 147dirty_background_ratio 148====================== 149 150Contains, as a percentage of total available memory that contains free pages 151and reclaimable pages, the number of pages at which the background kernel 152flusher threads will start writing out dirty data. 153 154The total available memory is not equal to total system memory. 155 156 157dirty_bytes 158=========== 159 160Contains the amount of dirty memory at which a process generating disk writes 161will itself start writeback. 162 163Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 164specified at a time. When one sysctl is written it is immediately taken into 165account to evaluate the dirty memory limits and the other appears as 0 when 166read. 167 168Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any 169value lower than this limit will be ignored and the old configuration will be 170retained. 171 172 173dirty_expire_centisecs 174====================== 175 176This tunable is used to define when dirty data is old enough to be eligible 177for writeout by the kernel flusher threads. It is expressed in 100'ths 178of a second. Data which has been dirty in-memory for longer than this 179interval will be written out next time a flusher thread wakes up. 180 181 182dirty_ratio 183=========== 184 185Contains, as a percentage of total available memory that contains free pages 186and reclaimable pages, the number of pages at which a process which is 187generating disk writes will itself start writing out dirty data. 188 189The total available memory is not equal to total system memory. 190 191 192dirtytime_expire_seconds 193======================== 194 195When a lazytime inode is constantly having its pages dirtied, the inode with 196an updated timestamp will never get chance to be written out. And, if the 197only thing that has happened on the file system is a dirtytime inode caused 198by an atime update, a worker will be scheduled to make sure that inode 199eventually gets pushed out to disk. This tunable is used to define when dirty 200inode is old enough to be eligible for writeback by the kernel flusher threads. 201And, it is also used as the interval to wakeup dirtytime_writeback thread. 202 203 204dirty_writeback_centisecs 205========================= 206 207The kernel flusher threads will periodically wake up and write `old` data 208out to disk. This tunable expresses the interval between those wakeups, in 209100'ths of a second. 210 211Setting this to zero disables periodic writeback altogether. 212 213 214drop_caches 215=========== 216 217Writing to this will cause the kernel to drop clean caches, as well as 218reclaimable slab objects like dentries and inodes. Once dropped, their 219memory becomes free. 220 221To free pagecache:: 222 223 echo 1 > /proc/sys/vm/drop_caches 224 225To free reclaimable slab objects (includes dentries and inodes):: 226 227 echo 2 > /proc/sys/vm/drop_caches 228 229To free slab objects and pagecache:: 230 231 echo 3 > /proc/sys/vm/drop_caches 232 233This is a non-destructive operation and will not free any dirty objects. 234To increase the number of objects freed by this operation, the user may run 235`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the 236number of dirty objects on the system and create more candidates to be 237dropped. 238 239This file is not a means to control the growth of the various kernel caches 240(inodes, dentries, pagecache, etc...) These objects are automatically 241reclaimed by the kernel when memory is needed elsewhere on the system. 242 243Use of this file can cause performance problems. Since it discards cached 244objects, it may cost a significant amount of I/O and CPU to recreate the 245dropped objects, especially if they were under heavy use. Because of this, 246use outside of a testing or debugging environment is not recommended. 247 248You may see informational messages in your kernel log when this file is 249used:: 250 251 cat (1234): drop_caches: 3 252 253These are informational only. They do not mean that anything is wrong 254with your system. To disable them, echo 4 (bit 2) into drop_caches. 255 256 257extfrag_threshold 258================= 259 260This parameter affects whether the kernel will compact memory or direct 261reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in 262debugfs shows what the fragmentation index for each order is in each zone in 263the system. Values tending towards 0 imply allocations would fail due to lack 264of memory, values towards 1000 imply failures are due to fragmentation and -1 265implies that the allocation will succeed as long as watermarks are met. 266 267The kernel will not compact memory in a zone if the 268fragmentation index is <= extfrag_threshold. The default value is 500. 269 270 271highmem_is_dirtyable 272==================== 273 274Available only for systems with CONFIG_HIGHMEM enabled (32b systems). 275 276This parameter controls whether the high memory is considered for dirty 277writers throttling. This is not the case by default which means that 278only the amount of memory directly visible/usable by the kernel can 279be dirtied. As a result, on systems with a large amount of memory and 280lowmem basically depleted writers might be throttled too early and 281streaming writes can get very slow. 282 283Changing the value to non zero would allow more memory to be dirtied 284and thus allow writers to write more data which can be flushed to the 285storage more effectively. Note this also comes with a risk of pre-mature 286OOM killer because some writers (e.g. direct block device writes) can 287only use the low memory and they can fill it up with dirty data without 288any throttling. 289 290 291extra_free_kbytes 292 293This parameter tells the VM to keep extra free memory between the threshold 294where background reclaim (kswapd) kicks in, and the threshold where direct 295reclaim (by allocating processes) kicks in. 296 297This is useful for workloads that require low latency memory allocations 298and have a bounded burstiness in memory allocations, for example a 299realtime application that receives and transmits network traffic 300(causing in-kernel memory allocations) with a maximum total message burst 301size of 200MB may need 200MB of extra free memory to avoid direct reclaim 302related latencies. 303 304============================================================== 305 306hugetlb_shm_group 307================= 308 309hugetlb_shm_group contains group id that is allowed to create SysV 310shared memory segment using hugetlb page. 311 312 313laptop_mode 314=========== 315 316laptop_mode is a knob that controls "laptop mode". All the things that are 317controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. 318 319 320legacy_va_layout 321================ 322 323If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel 324will use the legacy (2.4) layout for all processes. 325 326 327lowmem_reserve_ratio 328==================== 329 330For some specialised workloads on highmem machines it is dangerous for 331the kernel to allow process memory to be allocated from the "lowmem" 332zone. This is because that memory could then be pinned via the mlock() 333system call, or by unavailability of swapspace. 334 335And on large highmem machines this lack of reclaimable lowmem memory 336can be fatal. 337 338So the Linux page allocator has a mechanism which prevents allocations 339which *could* use highmem from using too much lowmem. This means that 340a certain amount of lowmem is defended from the possibility of being 341captured into pinned user memory. 342 343(The same argument applies to the old 16 megabyte ISA DMA region. This 344mechanism will also defend that region from allocations which could use 345highmem or lowmem). 346 347The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is 348in defending these lower zones. 349 350If you have a machine which uses highmem or ISA DMA and your 351applications are using mlock(), or if you are running with no swap then 352you probably should change the lowmem_reserve_ratio setting. 353 354The lowmem_reserve_ratio is an array. You can see them by reading this file:: 355 356 % cat /proc/sys/vm/lowmem_reserve_ratio 357 256 256 32 358 359But, these values are not used directly. The kernel calculates # of protection 360pages for each zones from them. These are shown as array of protection pages 361in /proc/zoneinfo like followings. (This is an example of x86-64 box). 362Each zone has an array of protection pages like this:: 363 364 Node 0, zone DMA 365 pages free 1355 366 min 3 367 low 3 368 high 4 369 : 370 : 371 numa_other 0 372 protection: (0, 2004, 2004, 2004) 373 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 374 pagesets 375 cpu: 0 pcp: 0 376 : 377 378These protections are added to score to judge whether this zone should be used 379for page allocation or should be reclaimed. 380 381In this example, if normal pages (index=2) are required to this DMA zone and 382watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should 383not be used because pages_free(1355) is smaller than watermark + protection[2] 384(4 + 2004 = 2008). If this protection value is 0, this zone would be used for 385normal page requirement. If requirement is DMA zone(index=0), protection[0] 386(=0) is used. 387 388zone[i]'s protection[j] is calculated by following expression:: 389 390 (i < j): 391 zone[i]->protection[j] 392 = (total sums of managed_pages from zone[i+1] to zone[j] on the node) 393 / lowmem_reserve_ratio[i]; 394 (i = j): 395 (should not be protected. = 0; 396 (i > j): 397 (not necessary, but looks 0) 398 399The default values of lowmem_reserve_ratio[i] are 400 401 === ==================================== 402 256 (if zone[i] means DMA or DMA32 zone) 403 32 (others) 404 === ==================================== 405 406As above expression, they are reciprocal number of ratio. 407256 means 1/256. # of protection pages becomes about "0.39%" of total managed 408pages of higher zones on the node. 409 410If you would like to protect more pages, smaller values are effective. 411The minimum value is 1 (1/1 -> 100%). The value less than 1 completely 412disables protection of the pages. 413 414 415max_map_count: 416============== 417 418This file contains the maximum number of memory map areas a process 419may have. Memory map areas are used as a side-effect of calling 420malloc, directly by mmap, mprotect, and madvise, and also when loading 421shared libraries. 422 423While most applications need less than a thousand maps, certain 424programs, particularly malloc debuggers, may consume lots of them, 425e.g., up to one or two maps per allocation. 426 427The default value is 65536. 428 429 430memory_failure_early_kill: 431========================== 432 433Control how to kill processes when uncorrected memory error (typically 434a 2bit error in a memory module) is detected in the background by hardware 435that cannot be handled by the kernel. In some cases (like the page 436still having a valid copy on disk) the kernel will handle the failure 437transparently without affecting any applications. But if there is 438no other uptodate copy of the data it will kill to prevent any data 439corruptions from propagating. 440 4411: Kill all processes that have the corrupted and not reloadable page mapped 442as soon as the corruption is detected. Note this is not supported 443for a few types of pages, like kernel internally allocated data or 444the swap cache, but works for the majority of user pages. 445 4460: Only unmap the corrupted page from all processes and only kill a process 447who tries to access it. 448 449The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can 450handle this if they want to. 451 452This is only active on architectures/platforms with advanced machine 453check handling and depends on the hardware capabilities. 454 455Applications can override this setting individually with the PR_MCE_KILL prctl 456 457 458memory_failure_recovery 459======================= 460 461Enable memory failure recovery (when supported by the platform) 462 4631: Attempt recovery. 464 4650: Always panic on a memory failure. 466 467 468min_free_kbytes 469=============== 470 471This is used to force the Linux VM to keep a minimum number 472of kilobytes free. The VM uses this number to compute a 473watermark[WMARK_MIN] value for each lowmem zone in the system. 474Each lowmem zone gets a number of reserved free pages based 475proportionally on its size. 476 477Some minimal amount of memory is needed to satisfy PF_MEMALLOC 478allocations; if you set this to lower than 1024KB, your system will 479become subtly broken, and prone to deadlock under high loads. 480 481Setting this too high will OOM your machine instantly. 482 483 484min_slab_ratio 485============== 486 487This is available only on NUMA kernels. 488 489A percentage of the total pages in each zone. On Zone reclaim 490(fallback from the local zone occurs) slabs will be reclaimed if more 491than this percentage of pages in a zone are reclaimable slab pages. 492This insures that the slab growth stays under control even in NUMA 493systems that rarely perform global reclaim. 494 495The default is 5 percent. 496 497Note that slab reclaim is triggered in a per zone / node fashion. 498The process of reclaiming slab memory is currently not node specific 499and may not be fast. 500 501 502min_unmapped_ratio 503================== 504 505This is available only on NUMA kernels. 506 507This is a percentage of the total pages in each zone. Zone reclaim will 508only occur if more than this percentage of pages are in a state that 509zone_reclaim_mode allows to be reclaimed. 510 511If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared 512against all file-backed unmapped pages including swapcache pages and tmpfs 513files. Otherwise, only unmapped pages backed by normal files but not tmpfs 514files and similar are considered. 515 516The default is 1 percent. 517 518 519mmap_min_addr 520============= 521 522This file indicates the amount of address space which a user process will 523be restricted from mmapping. Since kernel null dereference bugs could 524accidentally operate based on the information in the first couple of pages 525of memory userspace processes should not be allowed to write to them. By 526default this value is set to 0 and no protections will be enforced by the 527security module. Setting this value to something like 64k will allow the 528vast majority of applications to work correctly and provide defense in depth 529against future potential kernel bugs. 530 531 532mmap_rnd_bits 533============= 534 535This value can be used to select the number of bits to use to 536determine the random offset to the base address of vma regions 537resulting from mmap allocations on architectures which support 538tuning address space randomization. This value will be bounded 539by the architecture's minimum and maximum supported values. 540 541This value can be changed after boot using the 542/proc/sys/vm/mmap_rnd_bits tunable 543 544 545mmap_rnd_compat_bits 546==================== 547 548This value can be used to select the number of bits to use to 549determine the random offset to the base address of vma regions 550resulting from mmap allocations for applications run in 551compatibility mode on architectures which support tuning address 552space randomization. This value will be bounded by the 553architecture's minimum and maximum supported values. 554 555This value can be changed after boot using the 556/proc/sys/vm/mmap_rnd_compat_bits tunable 557 558 559nr_hugepages 560============ 561 562Change the minimum size of the hugepage pool. 563 564See Documentation/admin-guide/mm/hugetlbpage.rst 565 566 567nr_hugepages_mempolicy 568====================== 569 570Change the size of the hugepage pool at run-time on a specific 571set of NUMA nodes. 572 573See Documentation/admin-guide/mm/hugetlbpage.rst 574 575 576nr_overcommit_hugepages 577======================= 578 579Change the maximum size of the hugepage pool. The maximum is 580nr_hugepages + nr_overcommit_hugepages. 581 582See Documentation/admin-guide/mm/hugetlbpage.rst 583 584 585nr_trim_pages 586============= 587 588This is available only on NOMMU kernels. 589 590This value adjusts the excess page trimming behaviour of power-of-2 aligned 591NOMMU mmap allocations. 592 593A value of 0 disables trimming of allocations entirely, while a value of 1 594trims excess pages aggressively. Any value >= 1 acts as the watermark where 595trimming of allocations is initiated. 596 597The default value is 1. 598 599See Documentation/nommu-mmap.txt for more information. 600 601 602numa_zonelist_order 603=================== 604 605This sysctl is only for NUMA and it is deprecated. Anything but 606Node order will fail! 607 608'where the memory is allocated from' is controlled by zonelists. 609 610(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. 611you may be able to read ZONE_DMA as ZONE_DMA32...) 612 613In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. 614ZONE_NORMAL -> ZONE_DMA 615This means that a memory allocation request for GFP_KERNEL will 616get memory from ZONE_DMA only when ZONE_NORMAL is not available. 617 618In NUMA case, you can think of following 2 types of order. 619Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: 620 621 (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL 622 (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. 623 624Type(A) offers the best locality for processes on Node(0), but ZONE_DMA 625will be used before ZONE_NORMAL exhaustion. This increases possibility of 626out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. 627 628Type(B) cannot offer the best locality but is more robust against OOM of 629the DMA zone. 630 631Type(A) is called as "Node" order. Type (B) is "Zone" order. 632 633"Node order" orders the zonelists by node, then by zone within each node. 634Specify "[Nn]ode" for node order 635 636"Zone Order" orders the zonelists by zone type, then by node within each 637zone. Specify "[Zz]one" for zone order. 638 639Specify "[Dd]efault" to request automatic configuration. 640 641On 32-bit, the Normal zone needs to be preserved for allocations accessible 642by the kernel, so "zone" order will be selected. 643 644On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" 645order will be selected. 646 647Default order is recommended unless this is causing problems for your 648system/application. 649 650 651oom_dump_tasks 652============== 653 654Enables a system-wide task dump (excluding kernel threads) to be produced 655when the kernel performs an OOM-killing and includes such information as 656pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj 657score, and name. This is helpful to determine why the OOM killer was 658invoked, to identify the rogue task that caused it, and to determine why 659the OOM killer chose the task it did to kill. 660 661If this is set to zero, this information is suppressed. On very 662large systems with thousands of tasks it may not be feasible to dump 663the memory state information for each one. Such systems should not 664be forced to incur a performance penalty in OOM conditions when the 665information may not be desired. 666 667If this is set to non-zero, this information is shown whenever the 668OOM killer actually kills a memory-hogging task. 669 670The default value is 1 (enabled). 671 672 673oom_kill_allocating_task 674======================== 675 676This enables or disables killing the OOM-triggering task in 677out-of-memory situations. 678 679If this is set to zero, the OOM killer will scan through the entire 680tasklist and select a task based on heuristics to kill. This normally 681selects a rogue memory-hogging task that frees up a large amount of 682memory when killed. 683 684If this is set to non-zero, the OOM killer simply kills the task that 685triggered the out-of-memory condition. This avoids the expensive 686tasklist scan. 687 688If panic_on_oom is selected, it takes precedence over whatever value 689is used in oom_kill_allocating_task. 690 691The default value is 0. 692 693 694overcommit_kbytes 695================= 696 697When overcommit_memory is set to 2, the committed address space is not 698permitted to exceed swap plus this amount of physical RAM. See below. 699 700Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 701of them may be specified at a time. Setting one disables the other (which 702then appears as 0 when read). 703 704 705overcommit_memory 706================= 707 708This value contains a flag that enables memory overcommitment. 709 710When this flag is 0, the kernel attempts to estimate the amount 711of free memory left when userspace requests more memory. 712 713When this flag is 1, the kernel pretends there is always enough 714memory until it actually runs out. 715 716When this flag is 2, the kernel uses a "never overcommit" 717policy that attempts to prevent any overcommit of memory. 718Note that user_reserve_kbytes affects this policy. 719 720This feature can be very useful because there are a lot of 721programs that malloc() huge amounts of memory "just-in-case" 722and don't use much of it. 723 724The default value is 0. 725 726See Documentation/vm/overcommit-accounting.rst and 727mm/util.c::__vm_enough_memory() for more information. 728 729 730overcommit_ratio 731================ 732 733When overcommit_memory is set to 2, the committed address 734space is not permitted to exceed swap plus this percentage 735of physical RAM. See above. 736 737 738page-cluster 739============ 740 741page-cluster controls the number of pages up to which consecutive pages 742are read in from swap in a single attempt. This is the swap counterpart 743to page cache readahead. 744The mentioned consecutivity is not in terms of virtual/physical addresses, 745but consecutive on swap space - that means they were swapped out together. 746 747It is a logarithmic value - setting it to zero means "1 page", setting 748it to 1 means "2 pages", setting it to 2 means "4 pages", etc. 749Zero disables swap readahead completely. 750 751The default value is three (eight pages at a time). There may be some 752small benefits in tuning this to a different value if your workload is 753swap-intensive. 754 755Lower values mean lower latencies for initial faults, but at the same time 756extra faults and I/O delays for following faults if they would have been part of 757that consecutive pages readahead would have brought in. 758 759 760panic_on_oom 761============ 762 763This enables or disables panic on out-of-memory feature. 764 765If this is set to 0, the kernel will kill some rogue process, 766called oom_killer. Usually, oom_killer can kill rogue processes and 767system will survive. 768 769If this is set to 1, the kernel panics when out-of-memory happens. 770However, if a process limits using nodes by mempolicy/cpusets, 771and those nodes become memory exhaustion status, one process 772may be killed by oom-killer. No panic occurs in this case. 773Because other nodes' memory may be free. This means system total status 774may be not fatal yet. 775 776If this is set to 2, the kernel panics compulsorily even on the 777above-mentioned. Even oom happens under memory cgroup, the whole 778system panics. 779 780The default value is 0. 781 7821 and 2 are for failover of clustering. Please select either 783according to your policy of failover. 784 785panic_on_oom=2+kdump gives you very strong tool to investigate 786why oom happens. You can get snapshot. 787 788 789percpu_pagelist_fraction 790======================== 791 792This is the fraction of pages at most (high mark pcp->high) in each zone that 793are allocated for each per cpu page list. The min value for this is 8. It 794means that we don't allow more than 1/8th of pages in each zone to be 795allocated in any single per_cpu_pagelist. This entry only changes the value 796of hot per cpu pagelists. User can specify a number like 100 to allocate 7971/100th of each zone to each per cpu page list. 798 799The batch value of each per cpu pagelist is also updated as a result. It is 800set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) 801 802The initial value is zero. Kernel does not use this value at boot time to set 803the high water marks for each per cpu page list. If the user writes '0' to this 804sysctl, it will revert to this default behavior. 805 806 807stat_interval 808============= 809 810The time interval between which vm statistics are updated. The default 811is 1 second. 812 813 814stat_refresh 815============ 816 817Any read or write (by root only) flushes all the per-cpu vm statistics 818into their global totals, for more accurate reports when testing 819e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo 820 821As a side-effect, it also checks for negative totals (elsewhere reported 822as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. 823(At time of writing, a few stats are known sometimes to be found negative, 824with no ill effects: errors and warnings on these stats are suppressed.) 825 826 827numa_stat 828========= 829 830This interface allows runtime configuration of numa statistics. 831 832When page allocation performance becomes a bottleneck and you can tolerate 833some possible tool breakage and decreased numa counter precision, you can 834do:: 835 836 echo 0 > /proc/sys/vm/numa_stat 837 838When page allocation performance is not a bottleneck and you want all 839tooling to work, you can do:: 840 841 echo 1 > /proc/sys/vm/numa_stat 842 843 844swappiness 845========== 846 847This control is used to define how aggressive the kernel will swap 848memory pages. Higher values will increase aggressiveness, lower values 849decrease the amount of swap. A value of 0 instructs the kernel not to 850initiate swap until the amount of free and file-backed pages is less 851than the high water mark in a zone. 852 853The default value is 60. 854 855 856unprivileged_userfaultfd 857======================== 858 859This flag controls whether unprivileged users can use the userfaultfd 860system calls. Set this to 1 to allow unprivileged users to use the 861userfaultfd system calls, or set this to 0 to restrict userfaultfd to only 862privileged users (with SYS_CAP_PTRACE capability). 863 864The default value is 1. 865 866 867user_reserve_kbytes 868=================== 869 870When overcommit_memory is set to 2, "never overcommit" mode, reserve 871min(3% of current process size, user_reserve_kbytes) of free memory. 872This is intended to prevent a user from starting a single memory hogging 873process, such that they cannot recover (kill the hog). 874 875user_reserve_kbytes defaults to min(3% of the current process size, 128MB). 876 877If this is reduced to zero, then the user will be allowed to allocate 878all free memory with a single process, minus admin_reserve_kbytes. 879Any subsequent attempts to execute a command will result in 880"fork: Cannot allocate memory". 881 882Changing this takes effect whenever an application requests memory. 883 884 885vfs_cache_pressure 886================== 887 888This percentage value controls the tendency of the kernel to reclaim 889the memory which is used for caching of directory and inode objects. 890 891At the default value of vfs_cache_pressure=100 the kernel will attempt to 892reclaim dentries and inodes at a "fair" rate with respect to pagecache and 893swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer 894to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will 895never reclaim dentries and inodes due to memory pressure and this can easily 896lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 897causes the kernel to prefer to reclaim dentries and inodes. 898 899Increasing vfs_cache_pressure significantly beyond 100 may have negative 900performance impact. Reclaim code needs to take various locks to find freeable 901directory and inode objects. With vfs_cache_pressure=1000, it will look for 902ten times more freeable objects than there are. 903 904 905watermark_boost_factor 906====================== 907 908This factor controls the level of reclaim when memory is being fragmented. 909It defines the percentage of the high watermark of a zone that will be 910reclaimed if pages of different mobility are being mixed within pageblocks. 911The intent is that compaction has less work to do in the future and to 912increase the success rate of future high-order allocations such as SLUB 913allocations, THP and hugetlbfs pages. 914 915To make it sensible with respect to the watermark_scale_factor 916parameter, the unit is in fractions of 10,000. The default value of 91715,000 on !DISCONTIGMEM configurations means that up to 150% of the high 918watermark will be reclaimed in the event of a pageblock being mixed due 919to fragmentation. The level of reclaim is determined by the number of 920fragmentation events that occurred in the recent past. If this value is 921smaller than a pageblock then a pageblocks worth of pages will be reclaimed 922(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. 923 924 925watermark_scale_factor 926====================== 927 928This factor controls the aggressiveness of kswapd. It defines the 929amount of memory left in a node/system before kswapd is woken up and 930how much memory needs to be free before kswapd goes back to sleep. 931 932The unit is in fractions of 10,000. The default value of 10 means the 933distances between watermarks are 0.1% of the available memory in the 934node/system. The maximum value is 1000, or 10% of memory. 935 936A high rate of threads entering direct reclaim (allocstall) or kswapd 937going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate 938that the number of free pages kswapd maintains for latency reasons is 939too small for the allocation bursts occurring in the system. This knob 940can then be used to tune kswapd aggressiveness accordingly. 941 942 943zone_reclaim_mode 944================= 945 946Zone_reclaim_mode allows someone to set more or less aggressive approaches to 947reclaim memory when a zone runs out of memory. If it is set to zero then no 948zone reclaim occurs. Allocations will be satisfied from other zones / nodes 949in the system. 950 951This is value OR'ed together of 952 953= =================================== 9541 Zone reclaim on 9552 Zone reclaim writes dirty pages out 9564 Zone reclaim swaps pages 957= =================================== 958 959zone_reclaim_mode is disabled by default. For file servers or workloads 960that benefit from having their data cached, zone_reclaim_mode should be 961left disabled as the caching effect is likely to be more important than 962data locality. 963 964zone_reclaim may be enabled if it's known that the workload is partitioned 965such that each partition fits within a NUMA node and that accessing remote 966memory would cause a measurable performance reduction. The page allocator 967will then reclaim easily reusable pages (those page cache pages that are 968currently not used) before allocating off node pages. 969 970Allowing zone reclaim to write out pages stops processes that are 971writing large amounts of data from dirtying pages on other nodes. Zone 972reclaim will write out dirty pages if a zone fills up and so effectively 973throttle the process. This may decrease the performance of a single process 974since it cannot use all of system memory to buffer the outgoing writes 975anymore but it preserve the memory on other nodes so that the performance 976of other processes running on other nodes will not be affected. 977 978Allowing regular swap effectively restricts allocations to the local 979node unless explicitly overridden by memory policies or cpuset 980configurations. 981