162306a36Sopenharmony_ci=============================== 262306a36Sopenharmony_ciDocumentation for /proc/sys/vm/ 362306a36Sopenharmony_ci=============================== 462306a36Sopenharmony_ci 562306a36Sopenharmony_cikernel version 2.6.29 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciCopyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciCopyright (c) 2008 Peter W. Morreale <pmorreale@novell.com> 1062306a36Sopenharmony_ci 1162306a36Sopenharmony_ciFor general info and legal blurb, please look in index.rst. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ci------------------------------------------------------------------------------ 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciThis file contains the documentation for the sysctl files in 1662306a36Sopenharmony_ci/proc/sys/vm and is valid for Linux kernel version 2.6.29. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ciThe files in this directory can be used to tune the operation 1962306a36Sopenharmony_ciof the virtual memory (VM) subsystem of the Linux kernel and 2062306a36Sopenharmony_cithe writeout of dirty data to disk. 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ciDefault values and initialization routines for most of these 2362306a36Sopenharmony_cifiles can be found in mm/swap.c. 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciCurrently, these files are in /proc/sys/vm: 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ci- admin_reserve_kbytes 2862306a36Sopenharmony_ci- compact_memory 2962306a36Sopenharmony_ci- compaction_proactiveness 3062306a36Sopenharmony_ci- compact_unevictable_allowed 3162306a36Sopenharmony_ci- dirty_background_bytes 3262306a36Sopenharmony_ci- dirty_background_ratio 3362306a36Sopenharmony_ci- dirty_bytes 3462306a36Sopenharmony_ci- dirty_expire_centisecs 3562306a36Sopenharmony_ci- dirty_ratio 3662306a36Sopenharmony_ci- dirtytime_expire_seconds 3762306a36Sopenharmony_ci- dirty_writeback_centisecs 3862306a36Sopenharmony_ci- drop_caches 3962306a36Sopenharmony_ci- extfrag_threshold 4062306a36Sopenharmony_ci- highmem_is_dirtyable 4162306a36Sopenharmony_ci- hugetlb_shm_group 4262306a36Sopenharmony_ci- laptop_mode 4362306a36Sopenharmony_ci- legacy_va_layout 4462306a36Sopenharmony_ci- lowmem_reserve_ratio 4562306a36Sopenharmony_ci- max_map_count 4662306a36Sopenharmony_ci- memory_failure_early_kill 4762306a36Sopenharmony_ci- memory_failure_recovery 4862306a36Sopenharmony_ci- min_free_kbytes 4962306a36Sopenharmony_ci- min_slab_ratio 5062306a36Sopenharmony_ci- min_unmapped_ratio 5162306a36Sopenharmony_ci- mmap_min_addr 5262306a36Sopenharmony_ci- mmap_rnd_bits 5362306a36Sopenharmony_ci- mmap_rnd_compat_bits 5462306a36Sopenharmony_ci- nr_hugepages 5562306a36Sopenharmony_ci- nr_hugepages_mempolicy 5662306a36Sopenharmony_ci- nr_overcommit_hugepages 5762306a36Sopenharmony_ci- nr_trim_pages (only if CONFIG_MMU=n) 5862306a36Sopenharmony_ci- numa_zonelist_order 5962306a36Sopenharmony_ci- oom_dump_tasks 6062306a36Sopenharmony_ci- oom_kill_allocating_task 6162306a36Sopenharmony_ci- overcommit_kbytes 6262306a36Sopenharmony_ci- overcommit_memory 6362306a36Sopenharmony_ci- overcommit_ratio 6462306a36Sopenharmony_ci- page-cluster 6562306a36Sopenharmony_ci- page_lock_unfairness 6662306a36Sopenharmony_ci- panic_on_oom 6762306a36Sopenharmony_ci- percpu_pagelist_high_fraction 6862306a36Sopenharmony_ci- stat_interval 6962306a36Sopenharmony_ci- stat_refresh 7062306a36Sopenharmony_ci- numa_stat 7162306a36Sopenharmony_ci- swappiness 7262306a36Sopenharmony_ci- unprivileged_userfaultfd 7362306a36Sopenharmony_ci- user_reserve_kbytes 7462306a36Sopenharmony_ci- vfs_cache_pressure 7562306a36Sopenharmony_ci- watermark_boost_factor 7662306a36Sopenharmony_ci- watermark_scale_factor 7762306a36Sopenharmony_ci- zone_reclaim_mode 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_ciadmin_reserve_kbytes 8162306a36Sopenharmony_ci==================== 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ciThe amount of free memory in the system that should be reserved for users 8462306a36Sopenharmony_ciwith the capability cap_sys_admin. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciadmin_reserve_kbytes defaults to min(3% of free pages, 8MB) 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ciThat should provide enough for the admin to log in and kill a process, 8962306a36Sopenharmony_ciif necessary, under the default overcommit 'guess' mode. 9062306a36Sopenharmony_ci 9162306a36Sopenharmony_ciSystems running under overcommit 'never' should increase this to account 9262306a36Sopenharmony_cifor the full Virtual Memory Size of programs used to recover. Otherwise, 9362306a36Sopenharmony_ciroot may not be able to log in to recover the system. 9462306a36Sopenharmony_ci 9562306a36Sopenharmony_ciHow do you calculate a minimum useful reserve? 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_cisshd or login + bash (or some other shell) + top (or ps, kill, etc.) 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ciFor overcommit 'guess', we can sum resident set sizes (RSS). 10062306a36Sopenharmony_ciOn x86_64 this is about 8MB. 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_ciFor overcommit 'never', we can take the max of their virtual sizes (VSZ) 10362306a36Sopenharmony_ciand add the sum of their RSS. 10462306a36Sopenharmony_ciOn x86_64 this is about 128MB. 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciChanging this takes effect whenever an application requests memory. 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_cicompact_memory 11062306a36Sopenharmony_ci============== 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ciAvailable only when CONFIG_COMPACTION is set. When 1 is written to the file, 11362306a36Sopenharmony_ciall zones are compacted such that free memory is available in contiguous 11462306a36Sopenharmony_ciblocks where possible. This can be important for example in the allocation of 11562306a36Sopenharmony_cihuge pages although processes will also directly compact memory as required. 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_cicompaction_proactiveness 11862306a36Sopenharmony_ci======================== 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ciThis tunable takes a value in the range [0, 100] with a default value of 12162306a36Sopenharmony_ci20. This tunable determines how aggressively compaction is done in the 12262306a36Sopenharmony_cibackground. Write of a non zero value to this tunable will immediately 12362306a36Sopenharmony_citrigger the proactive compaction. Setting it to 0 disables proactive compaction. 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ciNote that compaction has a non-trivial system-wide impact as pages 12662306a36Sopenharmony_cibelonging to different processes are moved around, which could also lead 12762306a36Sopenharmony_cito latency spikes in unsuspecting applications. The kernel employs 12862306a36Sopenharmony_civarious heuristics to avoid wasting CPU cycles if it detects that 12962306a36Sopenharmony_ciproactive compaction is not being effective. 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ciBe careful when setting it to extreme values like 100, as that may 13262306a36Sopenharmony_cicause excessive background compaction activity. 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_cicompact_unevictable_allowed 13562306a36Sopenharmony_ci=========================== 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ciAvailable only when CONFIG_COMPACTION is set. When set to 1, compaction is 13862306a36Sopenharmony_ciallowed to examine the unevictable lru (mlocked pages) for pages to compact. 13962306a36Sopenharmony_ciThis should be used on systems where stalls for minor page faults are an 14062306a36Sopenharmony_ciacceptable trade for large contiguous free memory. Set to 0 to prevent 14162306a36Sopenharmony_cicompaction from moving pages that are unevictable. Default value is 1. 14262306a36Sopenharmony_ciOn CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due 14362306a36Sopenharmony_cito compaction, which would block the task from becoming active until the fault 14462306a36Sopenharmony_ciis resolved. 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_cidirty_background_bytes 14862306a36Sopenharmony_ci====================== 14962306a36Sopenharmony_ci 15062306a36Sopenharmony_ciContains the amount of dirty memory at which the background kernel 15162306a36Sopenharmony_ciflusher threads will start writeback. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciNote: 15462306a36Sopenharmony_ci dirty_background_bytes is the counterpart of dirty_background_ratio. Only 15562306a36Sopenharmony_ci one of them may be specified at a time. When one sysctl is written it is 15662306a36Sopenharmony_ci immediately taken into account to evaluate the dirty memory limits and the 15762306a36Sopenharmony_ci other appears as 0 when read. 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_cidirty_background_ratio 16162306a36Sopenharmony_ci====================== 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ciContains, as a percentage of total available memory that contains free pages 16462306a36Sopenharmony_ciand reclaimable pages, the number of pages at which the background kernel 16562306a36Sopenharmony_ciflusher threads will start writing out dirty data. 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ciThe total available memory is not equal to total system memory. 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_cidirty_bytes 17162306a36Sopenharmony_ci=========== 17262306a36Sopenharmony_ci 17362306a36Sopenharmony_ciContains the amount of dirty memory at which a process generating disk writes 17462306a36Sopenharmony_ciwill itself start writeback. 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciNote: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 17762306a36Sopenharmony_cispecified at a time. When one sysctl is written it is immediately taken into 17862306a36Sopenharmony_ciaccount to evaluate the dirty memory limits and the other appears as 0 when 17962306a36Sopenharmony_ciread. 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ciNote: the minimum value allowed for dirty_bytes is two pages (in bytes); any 18262306a36Sopenharmony_civalue lower than this limit will be ignored and the old configuration will be 18362306a36Sopenharmony_ciretained. 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_cidirty_expire_centisecs 18762306a36Sopenharmony_ci====================== 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ciThis tunable is used to define when dirty data is old enough to be eligible 19062306a36Sopenharmony_cifor writeout by the kernel flusher threads. It is expressed in 100'ths 19162306a36Sopenharmony_ciof a second. Data which has been dirty in-memory for longer than this 19262306a36Sopenharmony_ciinterval will be written out next time a flusher thread wakes up. 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_cidirty_ratio 19662306a36Sopenharmony_ci=========== 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ciContains, as a percentage of total available memory that contains free pages 19962306a36Sopenharmony_ciand reclaimable pages, the number of pages at which a process which is 20062306a36Sopenharmony_cigenerating disk writes will itself start writing out dirty data. 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ciThe total available memory is not equal to total system memory. 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci 20562306a36Sopenharmony_cidirtytime_expire_seconds 20662306a36Sopenharmony_ci======================== 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ciWhen a lazytime inode is constantly having its pages dirtied, the inode with 20962306a36Sopenharmony_cian updated timestamp will never get chance to be written out. And, if the 21062306a36Sopenharmony_cionly thing that has happened on the file system is a dirtytime inode caused 21162306a36Sopenharmony_ciby an atime update, a worker will be scheduled to make sure that inode 21262306a36Sopenharmony_cieventually gets pushed out to disk. This tunable is used to define when dirty 21362306a36Sopenharmony_ciinode is old enough to be eligible for writeback by the kernel flusher threads. 21462306a36Sopenharmony_ciAnd, it is also used as the interval to wakeup dirtytime_writeback thread. 21562306a36Sopenharmony_ci 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_cidirty_writeback_centisecs 21862306a36Sopenharmony_ci========================= 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciThe kernel flusher threads will periodically wake up and write `old` data 22162306a36Sopenharmony_ciout to disk. This tunable expresses the interval between those wakeups, in 22262306a36Sopenharmony_ci100'ths of a second. 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ciSetting this to zero disables periodic writeback altogether. 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_cidrop_caches 22862306a36Sopenharmony_ci=========== 22962306a36Sopenharmony_ci 23062306a36Sopenharmony_ciWriting to this will cause the kernel to drop clean caches, as well as 23162306a36Sopenharmony_cireclaimable slab objects like dentries and inodes. Once dropped, their 23262306a36Sopenharmony_cimemory becomes free. 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ciTo free pagecache:: 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_ci echo 1 > /proc/sys/vm/drop_caches 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ciTo free reclaimable slab objects (includes dentries and inodes):: 23962306a36Sopenharmony_ci 24062306a36Sopenharmony_ci echo 2 > /proc/sys/vm/drop_caches 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciTo free slab objects and pagecache:: 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ci echo 3 > /proc/sys/vm/drop_caches 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ciThis is a non-destructive operation and will not free any dirty objects. 24762306a36Sopenharmony_ciTo increase the number of objects freed by this operation, the user may run 24862306a36Sopenharmony_ci`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the 24962306a36Sopenharmony_cinumber of dirty objects on the system and create more candidates to be 25062306a36Sopenharmony_cidropped. 25162306a36Sopenharmony_ci 25262306a36Sopenharmony_ciThis file is not a means to control the growth of the various kernel caches 25362306a36Sopenharmony_ci(inodes, dentries, pagecache, etc...) These objects are automatically 25462306a36Sopenharmony_cireclaimed by the kernel when memory is needed elsewhere on the system. 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ciUse of this file can cause performance problems. Since it discards cached 25762306a36Sopenharmony_ciobjects, it may cost a significant amount of I/O and CPU to recreate the 25862306a36Sopenharmony_cidropped objects, especially if they were under heavy use. Because of this, 25962306a36Sopenharmony_ciuse outside of a testing or debugging environment is not recommended. 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ciYou may see informational messages in your kernel log when this file is 26262306a36Sopenharmony_ciused:: 26362306a36Sopenharmony_ci 26462306a36Sopenharmony_ci cat (1234): drop_caches: 3 26562306a36Sopenharmony_ci 26662306a36Sopenharmony_ciThese are informational only. They do not mean that anything is wrong 26762306a36Sopenharmony_ciwith your system. To disable them, echo 4 (bit 2) into drop_caches. 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_ci 27062306a36Sopenharmony_ciextfrag_threshold 27162306a36Sopenharmony_ci================= 27262306a36Sopenharmony_ci 27362306a36Sopenharmony_ciThis parameter affects whether the kernel will compact memory or direct 27462306a36Sopenharmony_cireclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in 27562306a36Sopenharmony_cidebugfs shows what the fragmentation index for each order is in each zone in 27662306a36Sopenharmony_cithe system. Values tending towards 0 imply allocations would fail due to lack 27762306a36Sopenharmony_ciof memory, values towards 1000 imply failures are due to fragmentation and -1 27862306a36Sopenharmony_ciimplies that the allocation will succeed as long as watermarks are met. 27962306a36Sopenharmony_ci 28062306a36Sopenharmony_ciThe kernel will not compact memory in a zone if the 28162306a36Sopenharmony_cifragmentation index is <= extfrag_threshold. The default value is 500. 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ci 28462306a36Sopenharmony_cihighmem_is_dirtyable 28562306a36Sopenharmony_ci==================== 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_ciAvailable only for systems with CONFIG_HIGHMEM enabled (32b systems). 28862306a36Sopenharmony_ci 28962306a36Sopenharmony_ciThis parameter controls whether the high memory is considered for dirty 29062306a36Sopenharmony_ciwriters throttling. This is not the case by default which means that 29162306a36Sopenharmony_cionly the amount of memory directly visible/usable by the kernel can 29262306a36Sopenharmony_cibe dirtied. As a result, on systems with a large amount of memory and 29362306a36Sopenharmony_cilowmem basically depleted writers might be throttled too early and 29462306a36Sopenharmony_cistreaming writes can get very slow. 29562306a36Sopenharmony_ci 29662306a36Sopenharmony_ciChanging the value to non zero would allow more memory to be dirtied 29762306a36Sopenharmony_ciand thus allow writers to write more data which can be flushed to the 29862306a36Sopenharmony_cistorage more effectively. Note this also comes with a risk of pre-mature 29962306a36Sopenharmony_ciOOM killer because some writers (e.g. direct block device writes) can 30062306a36Sopenharmony_cionly use the low memory and they can fill it up with dirty data without 30162306a36Sopenharmony_ciany throttling. 30262306a36Sopenharmony_ci 30362306a36Sopenharmony_ci 30462306a36Sopenharmony_cihugetlb_shm_group 30562306a36Sopenharmony_ci================= 30662306a36Sopenharmony_ci 30762306a36Sopenharmony_cihugetlb_shm_group contains group id that is allowed to create SysV 30862306a36Sopenharmony_cishared memory segment using hugetlb page. 30962306a36Sopenharmony_ci 31062306a36Sopenharmony_ci 31162306a36Sopenharmony_cilaptop_mode 31262306a36Sopenharmony_ci=========== 31362306a36Sopenharmony_ci 31462306a36Sopenharmony_cilaptop_mode is a knob that controls "laptop mode". All the things that are 31562306a36Sopenharmony_cicontrolled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. 31662306a36Sopenharmony_ci 31762306a36Sopenharmony_ci 31862306a36Sopenharmony_cilegacy_va_layout 31962306a36Sopenharmony_ci================ 32062306a36Sopenharmony_ci 32162306a36Sopenharmony_ciIf non-zero, this sysctl disables the new 32-bit mmap layout - the kernel 32262306a36Sopenharmony_ciwill use the legacy (2.4) layout for all processes. 32362306a36Sopenharmony_ci 32462306a36Sopenharmony_ci 32562306a36Sopenharmony_cilowmem_reserve_ratio 32662306a36Sopenharmony_ci==================== 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ciFor some specialised workloads on highmem machines it is dangerous for 32962306a36Sopenharmony_cithe kernel to allow process memory to be allocated from the "lowmem" 33062306a36Sopenharmony_cizone. This is because that memory could then be pinned via the mlock() 33162306a36Sopenharmony_cisystem call, or by unavailability of swapspace. 33262306a36Sopenharmony_ci 33362306a36Sopenharmony_ciAnd on large highmem machines this lack of reclaimable lowmem memory 33462306a36Sopenharmony_cican be fatal. 33562306a36Sopenharmony_ci 33662306a36Sopenharmony_ciSo the Linux page allocator has a mechanism which prevents allocations 33762306a36Sopenharmony_ciwhich *could* use highmem from using too much lowmem. This means that 33862306a36Sopenharmony_cia certain amount of lowmem is defended from the possibility of being 33962306a36Sopenharmony_cicaptured into pinned user memory. 34062306a36Sopenharmony_ci 34162306a36Sopenharmony_ci(The same argument applies to the old 16 megabyte ISA DMA region. This 34262306a36Sopenharmony_cimechanism will also defend that region from allocations which could use 34362306a36Sopenharmony_cihighmem or lowmem). 34462306a36Sopenharmony_ci 34562306a36Sopenharmony_ciThe `lowmem_reserve_ratio` tunable determines how aggressive the kernel is 34662306a36Sopenharmony_ciin defending these lower zones. 34762306a36Sopenharmony_ci 34862306a36Sopenharmony_ciIf you have a machine which uses highmem or ISA DMA and your 34962306a36Sopenharmony_ciapplications are using mlock(), or if you are running with no swap then 35062306a36Sopenharmony_ciyou probably should change the lowmem_reserve_ratio setting. 35162306a36Sopenharmony_ci 35262306a36Sopenharmony_ciThe lowmem_reserve_ratio is an array. You can see them by reading this file:: 35362306a36Sopenharmony_ci 35462306a36Sopenharmony_ci % cat /proc/sys/vm/lowmem_reserve_ratio 35562306a36Sopenharmony_ci 256 256 32 35662306a36Sopenharmony_ci 35762306a36Sopenharmony_ciBut, these values are not used directly. The kernel calculates # of protection 35862306a36Sopenharmony_cipages for each zones from them. These are shown as array of protection pages 35962306a36Sopenharmony_ciin /proc/zoneinfo like the following. (This is an example of x86-64 box). 36062306a36Sopenharmony_ciEach zone has an array of protection pages like this:: 36162306a36Sopenharmony_ci 36262306a36Sopenharmony_ci Node 0, zone DMA 36362306a36Sopenharmony_ci pages free 1355 36462306a36Sopenharmony_ci min 3 36562306a36Sopenharmony_ci low 3 36662306a36Sopenharmony_ci high 4 36762306a36Sopenharmony_ci : 36862306a36Sopenharmony_ci : 36962306a36Sopenharmony_ci numa_other 0 37062306a36Sopenharmony_ci protection: (0, 2004, 2004, 2004) 37162306a36Sopenharmony_ci ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 37262306a36Sopenharmony_ci pagesets 37362306a36Sopenharmony_ci cpu: 0 pcp: 0 37462306a36Sopenharmony_ci : 37562306a36Sopenharmony_ci 37662306a36Sopenharmony_ciThese protections are added to score to judge whether this zone should be used 37762306a36Sopenharmony_cifor page allocation or should be reclaimed. 37862306a36Sopenharmony_ci 37962306a36Sopenharmony_ciIn this example, if normal pages (index=2) are required to this DMA zone and 38062306a36Sopenharmony_ciwatermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should 38162306a36Sopenharmony_cinot be used because pages_free(1355) is smaller than watermark + protection[2] 38262306a36Sopenharmony_ci(4 + 2004 = 2008). If this protection value is 0, this zone would be used for 38362306a36Sopenharmony_cinormal page requirement. If requirement is DMA zone(index=0), protection[0] 38462306a36Sopenharmony_ci(=0) is used. 38562306a36Sopenharmony_ci 38662306a36Sopenharmony_cizone[i]'s protection[j] is calculated by following expression:: 38762306a36Sopenharmony_ci 38862306a36Sopenharmony_ci (i < j): 38962306a36Sopenharmony_ci zone[i]->protection[j] 39062306a36Sopenharmony_ci = (total sums of managed_pages from zone[i+1] to zone[j] on the node) 39162306a36Sopenharmony_ci / lowmem_reserve_ratio[i]; 39262306a36Sopenharmony_ci (i = j): 39362306a36Sopenharmony_ci (should not be protected. = 0; 39462306a36Sopenharmony_ci (i > j): 39562306a36Sopenharmony_ci (not necessary, but looks 0) 39662306a36Sopenharmony_ci 39762306a36Sopenharmony_ciThe default values of lowmem_reserve_ratio[i] are 39862306a36Sopenharmony_ci 39962306a36Sopenharmony_ci === ==================================== 40062306a36Sopenharmony_ci 256 (if zone[i] means DMA or DMA32 zone) 40162306a36Sopenharmony_ci 32 (others) 40262306a36Sopenharmony_ci === ==================================== 40362306a36Sopenharmony_ci 40462306a36Sopenharmony_ciAs above expression, they are reciprocal number of ratio. 40562306a36Sopenharmony_ci256 means 1/256. # of protection pages becomes about "0.39%" of total managed 40662306a36Sopenharmony_cipages of higher zones on the node. 40762306a36Sopenharmony_ci 40862306a36Sopenharmony_ciIf you would like to protect more pages, smaller values are effective. 40962306a36Sopenharmony_ciThe minimum value is 1 (1/1 -> 100%). The value less than 1 completely 41062306a36Sopenharmony_cidisables protection of the pages. 41162306a36Sopenharmony_ci 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_cimax_map_count: 41462306a36Sopenharmony_ci============== 41562306a36Sopenharmony_ci 41662306a36Sopenharmony_ciThis file contains the maximum number of memory map areas a process 41762306a36Sopenharmony_cimay have. Memory map areas are used as a side-effect of calling 41862306a36Sopenharmony_cimalloc, directly by mmap, mprotect, and madvise, and also when loading 41962306a36Sopenharmony_cishared libraries. 42062306a36Sopenharmony_ci 42162306a36Sopenharmony_ciWhile most applications need less than a thousand maps, certain 42262306a36Sopenharmony_ciprograms, particularly malloc debuggers, may consume lots of them, 42362306a36Sopenharmony_cie.g., up to one or two maps per allocation. 42462306a36Sopenharmony_ci 42562306a36Sopenharmony_ciThe default value is 65530. 42662306a36Sopenharmony_ci 42762306a36Sopenharmony_ci 42862306a36Sopenharmony_cimemory_failure_early_kill: 42962306a36Sopenharmony_ci========================== 43062306a36Sopenharmony_ci 43162306a36Sopenharmony_ciControl how to kill processes when uncorrected memory error (typically 43262306a36Sopenharmony_cia 2bit error in a memory module) is detected in the background by hardware 43362306a36Sopenharmony_cithat cannot be handled by the kernel. In some cases (like the page 43462306a36Sopenharmony_cistill having a valid copy on disk) the kernel will handle the failure 43562306a36Sopenharmony_citransparently without affecting any applications. But if there is 43662306a36Sopenharmony_cino other up-to-date copy of the data it will kill to prevent any data 43762306a36Sopenharmony_cicorruptions from propagating. 43862306a36Sopenharmony_ci 43962306a36Sopenharmony_ci1: Kill all processes that have the corrupted and not reloadable page mapped 44062306a36Sopenharmony_cias soon as the corruption is detected. Note this is not supported 44162306a36Sopenharmony_cifor a few types of pages, like kernel internally allocated data or 44262306a36Sopenharmony_cithe swap cache, but works for the majority of user pages. 44362306a36Sopenharmony_ci 44462306a36Sopenharmony_ci0: Only unmap the corrupted page from all processes and only kill a process 44562306a36Sopenharmony_ciwho tries to access it. 44662306a36Sopenharmony_ci 44762306a36Sopenharmony_ciThe kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can 44862306a36Sopenharmony_cihandle this if they want to. 44962306a36Sopenharmony_ci 45062306a36Sopenharmony_ciThis is only active on architectures/platforms with advanced machine 45162306a36Sopenharmony_cicheck handling and depends on the hardware capabilities. 45262306a36Sopenharmony_ci 45362306a36Sopenharmony_ciApplications can override this setting individually with the PR_MCE_KILL prctl 45462306a36Sopenharmony_ci 45562306a36Sopenharmony_ci 45662306a36Sopenharmony_cimemory_failure_recovery 45762306a36Sopenharmony_ci======================= 45862306a36Sopenharmony_ci 45962306a36Sopenharmony_ciEnable memory failure recovery (when supported by the platform) 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ci1: Attempt recovery. 46262306a36Sopenharmony_ci 46362306a36Sopenharmony_ci0: Always panic on a memory failure. 46462306a36Sopenharmony_ci 46562306a36Sopenharmony_ci 46662306a36Sopenharmony_cimin_free_kbytes 46762306a36Sopenharmony_ci=============== 46862306a36Sopenharmony_ci 46962306a36Sopenharmony_ciThis is used to force the Linux VM to keep a minimum number 47062306a36Sopenharmony_ciof kilobytes free. The VM uses this number to compute a 47162306a36Sopenharmony_ciwatermark[WMARK_MIN] value for each lowmem zone in the system. 47262306a36Sopenharmony_ciEach lowmem zone gets a number of reserved free pages based 47362306a36Sopenharmony_ciproportionally on its size. 47462306a36Sopenharmony_ci 47562306a36Sopenharmony_ciSome minimal amount of memory is needed to satisfy PF_MEMALLOC 47662306a36Sopenharmony_ciallocations; if you set this to lower than 1024KB, your system will 47762306a36Sopenharmony_cibecome subtly broken, and prone to deadlock under high loads. 47862306a36Sopenharmony_ci 47962306a36Sopenharmony_ciSetting this too high will OOM your machine instantly. 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ci 48262306a36Sopenharmony_cimin_slab_ratio 48362306a36Sopenharmony_ci============== 48462306a36Sopenharmony_ci 48562306a36Sopenharmony_ciThis is available only on NUMA kernels. 48662306a36Sopenharmony_ci 48762306a36Sopenharmony_ciA percentage of the total pages in each zone. On Zone reclaim 48862306a36Sopenharmony_ci(fallback from the local zone occurs) slabs will be reclaimed if more 48962306a36Sopenharmony_cithan this percentage of pages in a zone are reclaimable slab pages. 49062306a36Sopenharmony_ciThis insures that the slab growth stays under control even in NUMA 49162306a36Sopenharmony_cisystems that rarely perform global reclaim. 49262306a36Sopenharmony_ci 49362306a36Sopenharmony_ciThe default is 5 percent. 49462306a36Sopenharmony_ci 49562306a36Sopenharmony_ciNote that slab reclaim is triggered in a per zone / node fashion. 49662306a36Sopenharmony_ciThe process of reclaiming slab memory is currently not node specific 49762306a36Sopenharmony_ciand may not be fast. 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ci 50062306a36Sopenharmony_cimin_unmapped_ratio 50162306a36Sopenharmony_ci================== 50262306a36Sopenharmony_ci 50362306a36Sopenharmony_ciThis is available only on NUMA kernels. 50462306a36Sopenharmony_ci 50562306a36Sopenharmony_ciThis is a percentage of the total pages in each zone. Zone reclaim will 50662306a36Sopenharmony_cionly occur if more than this percentage of pages are in a state that 50762306a36Sopenharmony_cizone_reclaim_mode allows to be reclaimed. 50862306a36Sopenharmony_ci 50962306a36Sopenharmony_ciIf zone_reclaim_mode has the value 4 OR'd, then the percentage is compared 51062306a36Sopenharmony_ciagainst all file-backed unmapped pages including swapcache pages and tmpfs 51162306a36Sopenharmony_cifiles. Otherwise, only unmapped pages backed by normal files but not tmpfs 51262306a36Sopenharmony_cifiles and similar are considered. 51362306a36Sopenharmony_ci 51462306a36Sopenharmony_ciThe default is 1 percent. 51562306a36Sopenharmony_ci 51662306a36Sopenharmony_ci 51762306a36Sopenharmony_cimmap_min_addr 51862306a36Sopenharmony_ci============= 51962306a36Sopenharmony_ci 52062306a36Sopenharmony_ciThis file indicates the amount of address space which a user process will 52162306a36Sopenharmony_cibe restricted from mmapping. Since kernel null dereference bugs could 52262306a36Sopenharmony_ciaccidentally operate based on the information in the first couple of pages 52362306a36Sopenharmony_ciof memory userspace processes should not be allowed to write to them. By 52462306a36Sopenharmony_cidefault this value is set to 0 and no protections will be enforced by the 52562306a36Sopenharmony_cisecurity module. Setting this value to something like 64k will allow the 52662306a36Sopenharmony_civast majority of applications to work correctly and provide defense in depth 52762306a36Sopenharmony_ciagainst future potential kernel bugs. 52862306a36Sopenharmony_ci 52962306a36Sopenharmony_ci 53062306a36Sopenharmony_cimmap_rnd_bits 53162306a36Sopenharmony_ci============= 53262306a36Sopenharmony_ci 53362306a36Sopenharmony_ciThis value can be used to select the number of bits to use to 53462306a36Sopenharmony_cidetermine the random offset to the base address of vma regions 53562306a36Sopenharmony_ciresulting from mmap allocations on architectures which support 53662306a36Sopenharmony_cituning address space randomization. This value will be bounded 53762306a36Sopenharmony_ciby the architecture's minimum and maximum supported values. 53862306a36Sopenharmony_ci 53962306a36Sopenharmony_ciThis value can be changed after boot using the 54062306a36Sopenharmony_ci/proc/sys/vm/mmap_rnd_bits tunable 54162306a36Sopenharmony_ci 54262306a36Sopenharmony_ci 54362306a36Sopenharmony_cimmap_rnd_compat_bits 54462306a36Sopenharmony_ci==================== 54562306a36Sopenharmony_ci 54662306a36Sopenharmony_ciThis value can be used to select the number of bits to use to 54762306a36Sopenharmony_cidetermine the random offset to the base address of vma regions 54862306a36Sopenharmony_ciresulting from mmap allocations for applications run in 54962306a36Sopenharmony_cicompatibility mode on architectures which support tuning address 55062306a36Sopenharmony_cispace randomization. This value will be bounded by the 55162306a36Sopenharmony_ciarchitecture's minimum and maximum supported values. 55262306a36Sopenharmony_ci 55362306a36Sopenharmony_ciThis value can be changed after boot using the 55462306a36Sopenharmony_ci/proc/sys/vm/mmap_rnd_compat_bits tunable 55562306a36Sopenharmony_ci 55662306a36Sopenharmony_ci 55762306a36Sopenharmony_cinr_hugepages 55862306a36Sopenharmony_ci============ 55962306a36Sopenharmony_ci 56062306a36Sopenharmony_ciChange the minimum size of the hugepage pool. 56162306a36Sopenharmony_ci 56262306a36Sopenharmony_ciSee Documentation/admin-guide/mm/hugetlbpage.rst 56362306a36Sopenharmony_ci 56462306a36Sopenharmony_ci 56562306a36Sopenharmony_cihugetlb_optimize_vmemmap 56662306a36Sopenharmony_ci======================== 56762306a36Sopenharmony_ci 56862306a36Sopenharmony_ciThis knob is not available when the size of 'struct page' (a structure defined 56962306a36Sopenharmony_ciin include/linux/mm_types.h) is not power of two (an unusual system config could 57062306a36Sopenharmony_ciresult in this). 57162306a36Sopenharmony_ci 57262306a36Sopenharmony_ciEnable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO). 57362306a36Sopenharmony_ci 57462306a36Sopenharmony_ciOnce enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from 57562306a36Sopenharmony_cibuddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages 57662306a36Sopenharmony_ciper 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be 57762306a36Sopenharmony_cioptimized. When those optimized HugeTLB pages are freed from the HugeTLB pool 57862306a36Sopenharmony_cito the buddy allocator, the vmemmap pages representing that range needs to be 57962306a36Sopenharmony_ciremapped again and the vmemmap pages discarded earlier need to be rellocated 58062306a36Sopenharmony_ciagain. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g. 58162306a36Sopenharmony_cinever explicitly allocating HugeTLB pages with 'nr_hugepages' but only set 58262306a36Sopenharmony_ci'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on 58362306a36Sopenharmony_cithe fly') instead of being pulled from the HugeTLB pool, you should weigh the 58462306a36Sopenharmony_cibenefits of memory savings against the more overhead (~2x slower than before) 58562306a36Sopenharmony_ciof allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy 58662306a36Sopenharmony_ciallocator. Another behavior to note is that if the system is under heavy memory 58762306a36Sopenharmony_cipressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB 58862306a36Sopenharmony_cipool to the buddy allocator since the allocation of vmemmap pages could be 58962306a36Sopenharmony_cifailed, you have to retry later if your system encounter this situation. 59062306a36Sopenharmony_ci 59162306a36Sopenharmony_ciOnce disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from 59262306a36Sopenharmony_cibuddy allocator will not be optimized meaning the extra overhead at allocation 59362306a36Sopenharmony_citime from buddy allocator disappears, whereas already optimized HugeTLB pages 59462306a36Sopenharmony_ciwill not be affected. If you want to make sure there are no optimized HugeTLB 59562306a36Sopenharmony_cipages, you can set "nr_hugepages" to 0 first and then disable this. Note that 59662306a36Sopenharmony_ciwriting 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus 59762306a36Sopenharmony_cipages. So, those surplus pages are still optimized until they are no longer 59862306a36Sopenharmony_ciin use. You would need to wait for those surplus pages to be released before 59962306a36Sopenharmony_cithere are no optimized pages in the system. 60062306a36Sopenharmony_ci 60162306a36Sopenharmony_ci 60262306a36Sopenharmony_cinr_hugepages_mempolicy 60362306a36Sopenharmony_ci====================== 60462306a36Sopenharmony_ci 60562306a36Sopenharmony_ciChange the size of the hugepage pool at run-time on a specific 60662306a36Sopenharmony_ciset of NUMA nodes. 60762306a36Sopenharmony_ci 60862306a36Sopenharmony_ciSee Documentation/admin-guide/mm/hugetlbpage.rst 60962306a36Sopenharmony_ci 61062306a36Sopenharmony_ci 61162306a36Sopenharmony_cinr_overcommit_hugepages 61262306a36Sopenharmony_ci======================= 61362306a36Sopenharmony_ci 61462306a36Sopenharmony_ciChange the maximum size of the hugepage pool. The maximum is 61562306a36Sopenharmony_cinr_hugepages + nr_overcommit_hugepages. 61662306a36Sopenharmony_ci 61762306a36Sopenharmony_ciSee Documentation/admin-guide/mm/hugetlbpage.rst 61862306a36Sopenharmony_ci 61962306a36Sopenharmony_ci 62062306a36Sopenharmony_cinr_trim_pages 62162306a36Sopenharmony_ci============= 62262306a36Sopenharmony_ci 62362306a36Sopenharmony_ciThis is available only on NOMMU kernels. 62462306a36Sopenharmony_ci 62562306a36Sopenharmony_ciThis value adjusts the excess page trimming behaviour of power-of-2 aligned 62662306a36Sopenharmony_ciNOMMU mmap allocations. 62762306a36Sopenharmony_ci 62862306a36Sopenharmony_ciA value of 0 disables trimming of allocations entirely, while a value of 1 62962306a36Sopenharmony_citrims excess pages aggressively. Any value >= 1 acts as the watermark where 63062306a36Sopenharmony_citrimming of allocations is initiated. 63162306a36Sopenharmony_ci 63262306a36Sopenharmony_ciThe default value is 1. 63362306a36Sopenharmony_ci 63462306a36Sopenharmony_ciSee Documentation/admin-guide/mm/nommu-mmap.rst for more information. 63562306a36Sopenharmony_ci 63662306a36Sopenharmony_ci 63762306a36Sopenharmony_cinuma_zonelist_order 63862306a36Sopenharmony_ci=================== 63962306a36Sopenharmony_ci 64062306a36Sopenharmony_ciThis sysctl is only for NUMA and it is deprecated. Anything but 64162306a36Sopenharmony_ciNode order will fail! 64262306a36Sopenharmony_ci 64362306a36Sopenharmony_ci'where the memory is allocated from' is controlled by zonelists. 64462306a36Sopenharmony_ci 64562306a36Sopenharmony_ci(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. 64662306a36Sopenharmony_ciyou may be able to read ZONE_DMA as ZONE_DMA32...) 64762306a36Sopenharmony_ci 64862306a36Sopenharmony_ciIn non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. 64962306a36Sopenharmony_ciZONE_NORMAL -> ZONE_DMA 65062306a36Sopenharmony_ciThis means that a memory allocation request for GFP_KERNEL will 65162306a36Sopenharmony_ciget memory from ZONE_DMA only when ZONE_NORMAL is not available. 65262306a36Sopenharmony_ci 65362306a36Sopenharmony_ciIn NUMA case, you can think of following 2 types of order. 65462306a36Sopenharmony_ciAssume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: 65562306a36Sopenharmony_ci 65662306a36Sopenharmony_ci (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL 65762306a36Sopenharmony_ci (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. 65862306a36Sopenharmony_ci 65962306a36Sopenharmony_ciType(A) offers the best locality for processes on Node(0), but ZONE_DMA 66062306a36Sopenharmony_ciwill be used before ZONE_NORMAL exhaustion. This increases possibility of 66162306a36Sopenharmony_ciout-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. 66262306a36Sopenharmony_ci 66362306a36Sopenharmony_ciType(B) cannot offer the best locality but is more robust against OOM of 66462306a36Sopenharmony_cithe DMA zone. 66562306a36Sopenharmony_ci 66662306a36Sopenharmony_ciType(A) is called as "Node" order. Type (B) is "Zone" order. 66762306a36Sopenharmony_ci 66862306a36Sopenharmony_ci"Node order" orders the zonelists by node, then by zone within each node. 66962306a36Sopenharmony_ciSpecify "[Nn]ode" for node order 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ci"Zone Order" orders the zonelists by zone type, then by node within each 67262306a36Sopenharmony_cizone. Specify "[Zz]one" for zone order. 67362306a36Sopenharmony_ci 67462306a36Sopenharmony_ciSpecify "[Dd]efault" to request automatic configuration. 67562306a36Sopenharmony_ci 67662306a36Sopenharmony_ciOn 32-bit, the Normal zone needs to be preserved for allocations accessible 67762306a36Sopenharmony_ciby the kernel, so "zone" order will be selected. 67862306a36Sopenharmony_ci 67962306a36Sopenharmony_ciOn 64-bit, devices that require DMA32/DMA are relatively rare, so "node" 68062306a36Sopenharmony_ciorder will be selected. 68162306a36Sopenharmony_ci 68262306a36Sopenharmony_ciDefault order is recommended unless this is causing problems for your 68362306a36Sopenharmony_cisystem/application. 68462306a36Sopenharmony_ci 68562306a36Sopenharmony_ci 68662306a36Sopenharmony_cioom_dump_tasks 68762306a36Sopenharmony_ci============== 68862306a36Sopenharmony_ci 68962306a36Sopenharmony_ciEnables a system-wide task dump (excluding kernel threads) to be produced 69062306a36Sopenharmony_ciwhen the kernel performs an OOM-killing and includes such information as 69162306a36Sopenharmony_cipid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj 69262306a36Sopenharmony_ciscore, and name. This is helpful to determine why the OOM killer was 69362306a36Sopenharmony_ciinvoked, to identify the rogue task that caused it, and to determine why 69462306a36Sopenharmony_cithe OOM killer chose the task it did to kill. 69562306a36Sopenharmony_ci 69662306a36Sopenharmony_ciIf this is set to zero, this information is suppressed. On very 69762306a36Sopenharmony_cilarge systems with thousands of tasks it may not be feasible to dump 69862306a36Sopenharmony_cithe memory state information for each one. Such systems should not 69962306a36Sopenharmony_cibe forced to incur a performance penalty in OOM conditions when the 70062306a36Sopenharmony_ciinformation may not be desired. 70162306a36Sopenharmony_ci 70262306a36Sopenharmony_ciIf this is set to non-zero, this information is shown whenever the 70362306a36Sopenharmony_ciOOM killer actually kills a memory-hogging task. 70462306a36Sopenharmony_ci 70562306a36Sopenharmony_ciThe default value is 1 (enabled). 70662306a36Sopenharmony_ci 70762306a36Sopenharmony_ci 70862306a36Sopenharmony_cioom_kill_allocating_task 70962306a36Sopenharmony_ci======================== 71062306a36Sopenharmony_ci 71162306a36Sopenharmony_ciThis enables or disables killing the OOM-triggering task in 71262306a36Sopenharmony_ciout-of-memory situations. 71362306a36Sopenharmony_ci 71462306a36Sopenharmony_ciIf this is set to zero, the OOM killer will scan through the entire 71562306a36Sopenharmony_citasklist and select a task based on heuristics to kill. This normally 71662306a36Sopenharmony_ciselects a rogue memory-hogging task that frees up a large amount of 71762306a36Sopenharmony_cimemory when killed. 71862306a36Sopenharmony_ci 71962306a36Sopenharmony_ciIf this is set to non-zero, the OOM killer simply kills the task that 72062306a36Sopenharmony_citriggered the out-of-memory condition. This avoids the expensive 72162306a36Sopenharmony_citasklist scan. 72262306a36Sopenharmony_ci 72362306a36Sopenharmony_ciIf panic_on_oom is selected, it takes precedence over whatever value 72462306a36Sopenharmony_ciis used in oom_kill_allocating_task. 72562306a36Sopenharmony_ci 72662306a36Sopenharmony_ciThe default value is 0. 72762306a36Sopenharmony_ci 72862306a36Sopenharmony_ci 72962306a36Sopenharmony_ciovercommit_kbytes 73062306a36Sopenharmony_ci================= 73162306a36Sopenharmony_ci 73262306a36Sopenharmony_ciWhen overcommit_memory is set to 2, the committed address space is not 73362306a36Sopenharmony_cipermitted to exceed swap plus this amount of physical RAM. See below. 73462306a36Sopenharmony_ci 73562306a36Sopenharmony_ciNote: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 73662306a36Sopenharmony_ciof them may be specified at a time. Setting one disables the other (which 73762306a36Sopenharmony_cithen appears as 0 when read). 73862306a36Sopenharmony_ci 73962306a36Sopenharmony_ci 74062306a36Sopenharmony_ciovercommit_memory 74162306a36Sopenharmony_ci================= 74262306a36Sopenharmony_ci 74362306a36Sopenharmony_ciThis value contains a flag that enables memory overcommitment. 74462306a36Sopenharmony_ci 74562306a36Sopenharmony_ciWhen this flag is 0, the kernel attempts to estimate the amount 74662306a36Sopenharmony_ciof free memory left when userspace requests more memory. 74762306a36Sopenharmony_ci 74862306a36Sopenharmony_ciWhen this flag is 1, the kernel pretends there is always enough 74962306a36Sopenharmony_cimemory until it actually runs out. 75062306a36Sopenharmony_ci 75162306a36Sopenharmony_ciWhen this flag is 2, the kernel uses a "never overcommit" 75262306a36Sopenharmony_cipolicy that attempts to prevent any overcommit of memory. 75362306a36Sopenharmony_ciNote that user_reserve_kbytes affects this policy. 75462306a36Sopenharmony_ci 75562306a36Sopenharmony_ciThis feature can be very useful because there are a lot of 75662306a36Sopenharmony_ciprograms that malloc() huge amounts of memory "just-in-case" 75762306a36Sopenharmony_ciand don't use much of it. 75862306a36Sopenharmony_ci 75962306a36Sopenharmony_ciThe default value is 0. 76062306a36Sopenharmony_ci 76162306a36Sopenharmony_ciSee Documentation/mm/overcommit-accounting.rst and 76262306a36Sopenharmony_cimm/util.c::__vm_enough_memory() for more information. 76362306a36Sopenharmony_ci 76462306a36Sopenharmony_ci 76562306a36Sopenharmony_ciovercommit_ratio 76662306a36Sopenharmony_ci================ 76762306a36Sopenharmony_ci 76862306a36Sopenharmony_ciWhen overcommit_memory is set to 2, the committed address 76962306a36Sopenharmony_cispace is not permitted to exceed swap plus this percentage 77062306a36Sopenharmony_ciof physical RAM. See above. 77162306a36Sopenharmony_ci 77262306a36Sopenharmony_ci 77362306a36Sopenharmony_cipage-cluster 77462306a36Sopenharmony_ci============ 77562306a36Sopenharmony_ci 77662306a36Sopenharmony_cipage-cluster controls the number of pages up to which consecutive pages 77762306a36Sopenharmony_ciare read in from swap in a single attempt. This is the swap counterpart 77862306a36Sopenharmony_cito page cache readahead. 77962306a36Sopenharmony_ciThe mentioned consecutivity is not in terms of virtual/physical addresses, 78062306a36Sopenharmony_cibut consecutive on swap space - that means they were swapped out together. 78162306a36Sopenharmony_ci 78262306a36Sopenharmony_ciIt is a logarithmic value - setting it to zero means "1 page", setting 78362306a36Sopenharmony_ciit to 1 means "2 pages", setting it to 2 means "4 pages", etc. 78462306a36Sopenharmony_ciZero disables swap readahead completely. 78562306a36Sopenharmony_ci 78662306a36Sopenharmony_ciThe default value is three (eight pages at a time). There may be some 78762306a36Sopenharmony_cismall benefits in tuning this to a different value if your workload is 78862306a36Sopenharmony_ciswap-intensive. 78962306a36Sopenharmony_ci 79062306a36Sopenharmony_ciLower values mean lower latencies for initial faults, but at the same time 79162306a36Sopenharmony_ciextra faults and I/O delays for following faults if they would have been part of 79262306a36Sopenharmony_cithat consecutive pages readahead would have brought in. 79362306a36Sopenharmony_ci 79462306a36Sopenharmony_ci 79562306a36Sopenharmony_cipage_lock_unfairness 79662306a36Sopenharmony_ci==================== 79762306a36Sopenharmony_ci 79862306a36Sopenharmony_ciThis value determines the number of times that the page lock can be 79962306a36Sopenharmony_cistolen from under a waiter. After the lock is stolen the number of times 80062306a36Sopenharmony_cispecified in this file (default is 5), the "fair lock handoff" semantics 80162306a36Sopenharmony_ciwill apply, and the waiter will only be awakened if the lock can be taken. 80262306a36Sopenharmony_ci 80362306a36Sopenharmony_cipanic_on_oom 80462306a36Sopenharmony_ci============ 80562306a36Sopenharmony_ci 80662306a36Sopenharmony_ciThis enables or disables panic on out-of-memory feature. 80762306a36Sopenharmony_ci 80862306a36Sopenharmony_ciIf this is set to 0, the kernel will kill some rogue process, 80962306a36Sopenharmony_cicalled oom_killer. Usually, oom_killer can kill rogue processes and 81062306a36Sopenharmony_cisystem will survive. 81162306a36Sopenharmony_ci 81262306a36Sopenharmony_ciIf this is set to 1, the kernel panics when out-of-memory happens. 81362306a36Sopenharmony_ciHowever, if a process limits using nodes by mempolicy/cpusets, 81462306a36Sopenharmony_ciand those nodes become memory exhaustion status, one process 81562306a36Sopenharmony_cimay be killed by oom-killer. No panic occurs in this case. 81662306a36Sopenharmony_ciBecause other nodes' memory may be free. This means system total status 81762306a36Sopenharmony_cimay be not fatal yet. 81862306a36Sopenharmony_ci 81962306a36Sopenharmony_ciIf this is set to 2, the kernel panics compulsorily even on the 82062306a36Sopenharmony_ciabove-mentioned. Even oom happens under memory cgroup, the whole 82162306a36Sopenharmony_cisystem panics. 82262306a36Sopenharmony_ci 82362306a36Sopenharmony_ciThe default value is 0. 82462306a36Sopenharmony_ci 82562306a36Sopenharmony_ci1 and 2 are for failover of clustering. Please select either 82662306a36Sopenharmony_ciaccording to your policy of failover. 82762306a36Sopenharmony_ci 82862306a36Sopenharmony_cipanic_on_oom=2+kdump gives you very strong tool to investigate 82962306a36Sopenharmony_ciwhy oom happens. You can get snapshot. 83062306a36Sopenharmony_ci 83162306a36Sopenharmony_ci 83262306a36Sopenharmony_cipercpu_pagelist_high_fraction 83362306a36Sopenharmony_ci============================= 83462306a36Sopenharmony_ci 83562306a36Sopenharmony_ciThis is the fraction of pages in each zone that are can be stored to 83662306a36Sopenharmony_ciper-cpu page lists. It is an upper boundary that is divided depending 83762306a36Sopenharmony_cion the number of online CPUs. The min value for this is 8 which means 83862306a36Sopenharmony_cithat we do not allow more than 1/8th of pages in each zone to be stored 83962306a36Sopenharmony_cion per-cpu page lists. This entry only changes the value of hot per-cpu 84062306a36Sopenharmony_cipage lists. A user can specify a number like 100 to allocate 1/100th of 84162306a36Sopenharmony_cieach zone between per-cpu lists. 84262306a36Sopenharmony_ci 84362306a36Sopenharmony_ciThe batch value of each per-cpu page list remains the same regardless of 84462306a36Sopenharmony_cithe value of the high fraction so allocation latencies are unaffected. 84562306a36Sopenharmony_ci 84662306a36Sopenharmony_ciThe initial value is zero. Kernel uses this value to set the high pcp->high 84762306a36Sopenharmony_cimark based on the low watermark for the zone and the number of local 84862306a36Sopenharmony_cionline CPUs. If the user writes '0' to this sysctl, it will revert to 84962306a36Sopenharmony_cithis default behavior. 85062306a36Sopenharmony_ci 85162306a36Sopenharmony_ci 85262306a36Sopenharmony_cistat_interval 85362306a36Sopenharmony_ci============= 85462306a36Sopenharmony_ci 85562306a36Sopenharmony_ciThe time interval between which vm statistics are updated. The default 85662306a36Sopenharmony_ciis 1 second. 85762306a36Sopenharmony_ci 85862306a36Sopenharmony_ci 85962306a36Sopenharmony_cistat_refresh 86062306a36Sopenharmony_ci============ 86162306a36Sopenharmony_ci 86262306a36Sopenharmony_ciAny read or write (by root only) flushes all the per-cpu vm statistics 86362306a36Sopenharmony_ciinto their global totals, for more accurate reports when testing 86462306a36Sopenharmony_cie.g. cat /proc/sys/vm/stat_refresh /proc/meminfo 86562306a36Sopenharmony_ci 86662306a36Sopenharmony_ciAs a side-effect, it also checks for negative totals (elsewhere reported 86762306a36Sopenharmony_cias 0) and "fails" with EINVAL if any are found, with a warning in dmesg. 86862306a36Sopenharmony_ci(At time of writing, a few stats are known sometimes to be found negative, 86962306a36Sopenharmony_ciwith no ill effects: errors and warnings on these stats are suppressed.) 87062306a36Sopenharmony_ci 87162306a36Sopenharmony_ci 87262306a36Sopenharmony_cinuma_stat 87362306a36Sopenharmony_ci========= 87462306a36Sopenharmony_ci 87562306a36Sopenharmony_ciThis interface allows runtime configuration of numa statistics. 87662306a36Sopenharmony_ci 87762306a36Sopenharmony_ciWhen page allocation performance becomes a bottleneck and you can tolerate 87862306a36Sopenharmony_cisome possible tool breakage and decreased numa counter precision, you can 87962306a36Sopenharmony_cido:: 88062306a36Sopenharmony_ci 88162306a36Sopenharmony_ci echo 0 > /proc/sys/vm/numa_stat 88262306a36Sopenharmony_ci 88362306a36Sopenharmony_ciWhen page allocation performance is not a bottleneck and you want all 88462306a36Sopenharmony_citooling to work, you can do:: 88562306a36Sopenharmony_ci 88662306a36Sopenharmony_ci echo 1 > /proc/sys/vm/numa_stat 88762306a36Sopenharmony_ci 88862306a36Sopenharmony_ci 88962306a36Sopenharmony_ciswappiness 89062306a36Sopenharmony_ci========== 89162306a36Sopenharmony_ci 89262306a36Sopenharmony_ciThis control is used to define the rough relative IO cost of swapping 89362306a36Sopenharmony_ciand filesystem paging, as a value between 0 and 200. At 100, the VM 89462306a36Sopenharmony_ciassumes equal IO cost and will thus apply memory pressure to the page 89562306a36Sopenharmony_cicache and swap-backed pages equally; lower values signify more 89662306a36Sopenharmony_ciexpensive swap IO, higher values indicates cheaper. 89762306a36Sopenharmony_ci 89862306a36Sopenharmony_ciKeep in mind that filesystem IO patterns under memory pressure tend to 89962306a36Sopenharmony_cibe more efficient than swap's random IO. An optimal value will require 90062306a36Sopenharmony_ciexperimentation and will also be workload-dependent. 90162306a36Sopenharmony_ci 90262306a36Sopenharmony_ciThe default value is 60. 90362306a36Sopenharmony_ci 90462306a36Sopenharmony_ciFor in-memory swap, like zram or zswap, as well as hybrid setups that 90562306a36Sopenharmony_cihave swap on faster devices than the filesystem, values beyond 100 can 90662306a36Sopenharmony_cibe considered. For example, if the random IO against the swap device 90762306a36Sopenharmony_ciis on average 2x faster than IO from the filesystem, swappiness should 90862306a36Sopenharmony_cibe 133 (x + 2x = 200, 2x = 133.33). 90962306a36Sopenharmony_ci 91062306a36Sopenharmony_ciAt 0, the kernel will not initiate swap until the amount of free and 91162306a36Sopenharmony_cifile-backed pages is less than the high watermark in a zone. 91262306a36Sopenharmony_ci 91362306a36Sopenharmony_ci 91462306a36Sopenharmony_ciunprivileged_userfaultfd 91562306a36Sopenharmony_ci======================== 91662306a36Sopenharmony_ci 91762306a36Sopenharmony_ciThis flag controls the mode in which unprivileged users can use the 91862306a36Sopenharmony_ciuserfaultfd system calls. Set this to 0 to restrict unprivileged users 91962306a36Sopenharmony_cito handle page faults in user mode only. In this case, users without 92062306a36Sopenharmony_ciSYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to 92162306a36Sopenharmony_cisucceed. Prohibiting use of userfaultfd for handling faults from kernel 92262306a36Sopenharmony_cimode may make certain vulnerabilities more difficult to exploit. 92362306a36Sopenharmony_ci 92462306a36Sopenharmony_ciSet this to 1 to allow unprivileged users to use the userfaultfd system 92562306a36Sopenharmony_cicalls without any restrictions. 92662306a36Sopenharmony_ci 92762306a36Sopenharmony_ciThe default value is 0. 92862306a36Sopenharmony_ci 92962306a36Sopenharmony_ciAnother way to control permissions for userfaultfd is to use 93062306a36Sopenharmony_ci/dev/userfaultfd instead of userfaultfd(2). See 93162306a36Sopenharmony_ciDocumentation/admin-guide/mm/userfaultfd.rst. 93262306a36Sopenharmony_ci 93362306a36Sopenharmony_ciuser_reserve_kbytes 93462306a36Sopenharmony_ci=================== 93562306a36Sopenharmony_ci 93662306a36Sopenharmony_ciWhen overcommit_memory is set to 2, "never overcommit" mode, reserve 93762306a36Sopenharmony_cimin(3% of current process size, user_reserve_kbytes) of free memory. 93862306a36Sopenharmony_ciThis is intended to prevent a user from starting a single memory hogging 93962306a36Sopenharmony_ciprocess, such that they cannot recover (kill the hog). 94062306a36Sopenharmony_ci 94162306a36Sopenharmony_ciuser_reserve_kbytes defaults to min(3% of the current process size, 128MB). 94262306a36Sopenharmony_ci 94362306a36Sopenharmony_ciIf this is reduced to zero, then the user will be allowed to allocate 94462306a36Sopenharmony_ciall free memory with a single process, minus admin_reserve_kbytes. 94562306a36Sopenharmony_ciAny subsequent attempts to execute a command will result in 94662306a36Sopenharmony_ci"fork: Cannot allocate memory". 94762306a36Sopenharmony_ci 94862306a36Sopenharmony_ciChanging this takes effect whenever an application requests memory. 94962306a36Sopenharmony_ci 95062306a36Sopenharmony_ci 95162306a36Sopenharmony_civfs_cache_pressure 95262306a36Sopenharmony_ci================== 95362306a36Sopenharmony_ci 95462306a36Sopenharmony_ciThis percentage value controls the tendency of the kernel to reclaim 95562306a36Sopenharmony_cithe memory which is used for caching of directory and inode objects. 95662306a36Sopenharmony_ci 95762306a36Sopenharmony_ciAt the default value of vfs_cache_pressure=100 the kernel will attempt to 95862306a36Sopenharmony_cireclaim dentries and inodes at a "fair" rate with respect to pagecache and 95962306a36Sopenharmony_ciswapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer 96062306a36Sopenharmony_cito retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will 96162306a36Sopenharmony_cinever reclaim dentries and inodes due to memory pressure and this can easily 96262306a36Sopenharmony_cilead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 96362306a36Sopenharmony_cicauses the kernel to prefer to reclaim dentries and inodes. 96462306a36Sopenharmony_ci 96562306a36Sopenharmony_ciIncreasing vfs_cache_pressure significantly beyond 100 may have negative 96662306a36Sopenharmony_ciperformance impact. Reclaim code needs to take various locks to find freeable 96762306a36Sopenharmony_cidirectory and inode objects. With vfs_cache_pressure=1000, it will look for 96862306a36Sopenharmony_citen times more freeable objects than there are. 96962306a36Sopenharmony_ci 97062306a36Sopenharmony_ci 97162306a36Sopenharmony_ciwatermark_boost_factor 97262306a36Sopenharmony_ci====================== 97362306a36Sopenharmony_ci 97462306a36Sopenharmony_ciThis factor controls the level of reclaim when memory is being fragmented. 97562306a36Sopenharmony_ciIt defines the percentage of the high watermark of a zone that will be 97662306a36Sopenharmony_cireclaimed if pages of different mobility are being mixed within pageblocks. 97762306a36Sopenharmony_ciThe intent is that compaction has less work to do in the future and to 97862306a36Sopenharmony_ciincrease the success rate of future high-order allocations such as SLUB 97962306a36Sopenharmony_ciallocations, THP and hugetlbfs pages. 98062306a36Sopenharmony_ci 98162306a36Sopenharmony_ciTo make it sensible with respect to the watermark_scale_factor 98262306a36Sopenharmony_ciparameter, the unit is in fractions of 10,000. The default value of 98362306a36Sopenharmony_ci15,000 means that up to 150% of the high watermark will be reclaimed in the 98462306a36Sopenharmony_cievent of a pageblock being mixed due to fragmentation. The level of reclaim 98562306a36Sopenharmony_ciis determined by the number of fragmentation events that occurred in the 98662306a36Sopenharmony_cirecent past. If this value is smaller than a pageblock then a pageblocks 98762306a36Sopenharmony_ciworth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor 98862306a36Sopenharmony_ciof 0 will disable the feature. 98962306a36Sopenharmony_ci 99062306a36Sopenharmony_ci 99162306a36Sopenharmony_ciwatermark_scale_factor 99262306a36Sopenharmony_ci====================== 99362306a36Sopenharmony_ci 99462306a36Sopenharmony_ciThis factor controls the aggressiveness of kswapd. It defines the 99562306a36Sopenharmony_ciamount of memory left in a node/system before kswapd is woken up and 99662306a36Sopenharmony_cihow much memory needs to be free before kswapd goes back to sleep. 99762306a36Sopenharmony_ci 99862306a36Sopenharmony_ciThe unit is in fractions of 10,000. The default value of 10 means the 99962306a36Sopenharmony_cidistances between watermarks are 0.1% of the available memory in the 100062306a36Sopenharmony_cinode/system. The maximum value is 3000, or 30% of memory. 100162306a36Sopenharmony_ci 100262306a36Sopenharmony_ciA high rate of threads entering direct reclaim (allocstall) or kswapd 100362306a36Sopenharmony_cigoing to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate 100462306a36Sopenharmony_cithat the number of free pages kswapd maintains for latency reasons is 100562306a36Sopenharmony_citoo small for the allocation bursts occurring in the system. This knob 100662306a36Sopenharmony_cican then be used to tune kswapd aggressiveness accordingly. 100762306a36Sopenharmony_ci 100862306a36Sopenharmony_ci 100962306a36Sopenharmony_cizone_reclaim_mode 101062306a36Sopenharmony_ci================= 101162306a36Sopenharmony_ci 101262306a36Sopenharmony_ciZone_reclaim_mode allows someone to set more or less aggressive approaches to 101362306a36Sopenharmony_cireclaim memory when a zone runs out of memory. If it is set to zero then no 101462306a36Sopenharmony_cizone reclaim occurs. Allocations will be satisfied from other zones / nodes 101562306a36Sopenharmony_ciin the system. 101662306a36Sopenharmony_ci 101762306a36Sopenharmony_ciThis is value OR'ed together of 101862306a36Sopenharmony_ci 101962306a36Sopenharmony_ci= =================================== 102062306a36Sopenharmony_ci1 Zone reclaim on 102162306a36Sopenharmony_ci2 Zone reclaim writes dirty pages out 102262306a36Sopenharmony_ci4 Zone reclaim swaps pages 102362306a36Sopenharmony_ci= =================================== 102462306a36Sopenharmony_ci 102562306a36Sopenharmony_cizone_reclaim_mode is disabled by default. For file servers or workloads 102662306a36Sopenharmony_cithat benefit from having their data cached, zone_reclaim_mode should be 102762306a36Sopenharmony_cileft disabled as the caching effect is likely to be more important than 102862306a36Sopenharmony_cidata locality. 102962306a36Sopenharmony_ci 103062306a36Sopenharmony_ciConsider enabling one or more zone_reclaim mode bits if it's known that the 103162306a36Sopenharmony_ciworkload is partitioned such that each partition fits within a NUMA node 103262306a36Sopenharmony_ciand that accessing remote memory would cause a measurable performance 103362306a36Sopenharmony_cireduction. The page allocator will take additional actions before 103462306a36Sopenharmony_ciallocating off node pages. 103562306a36Sopenharmony_ci 103662306a36Sopenharmony_ciAllowing zone reclaim to write out pages stops processes that are 103762306a36Sopenharmony_ciwriting large amounts of data from dirtying pages on other nodes. Zone 103862306a36Sopenharmony_cireclaim will write out dirty pages if a zone fills up and so effectively 103962306a36Sopenharmony_cithrottle the process. This may decrease the performance of a single process 104062306a36Sopenharmony_cisince it cannot use all of system memory to buffer the outgoing writes 104162306a36Sopenharmony_cianymore but it preserve the memory on other nodes so that the performance 104262306a36Sopenharmony_ciof other processes running on other nodes will not be affected. 104362306a36Sopenharmony_ci 104462306a36Sopenharmony_ciAllowing regular swap effectively restricts allocations to the local 104562306a36Sopenharmony_cinode unless explicitly overridden by memory policies or cpuset 104662306a36Sopenharmony_ciconfigurations. 1047