162306a36Sopenharmony_ci.. _cgroup-v2: 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci================ 462306a36Sopenharmony_ciControl Group v2 562306a36Sopenharmony_ci================ 662306a36Sopenharmony_ci 762306a36Sopenharmony_ci:Date: October, 2015 862306a36Sopenharmony_ci:Author: Tejun Heo <tj@kernel.org> 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciThis is the authoritative documentation on the design, interface and 1162306a36Sopenharmony_ciconventions of cgroup v2. It describes all userland-visible aspects 1262306a36Sopenharmony_ciof cgroup including core and specific controller behaviors. All 1362306a36Sopenharmony_cifuture changes must be reflected in this document. Documentation for 1462306a36Sopenharmony_civ1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ci.. CONTENTS 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ci 1. Introduction 1962306a36Sopenharmony_ci 1-1. Terminology 2062306a36Sopenharmony_ci 1-2. What is cgroup? 2162306a36Sopenharmony_ci 2. Basic Operations 2262306a36Sopenharmony_ci 2-1. Mounting 2362306a36Sopenharmony_ci 2-2. Organizing Processes and Threads 2462306a36Sopenharmony_ci 2-2-1. Processes 2562306a36Sopenharmony_ci 2-2-2. Threads 2662306a36Sopenharmony_ci 2-3. [Un]populated Notification 2762306a36Sopenharmony_ci 2-4. Controlling Controllers 2862306a36Sopenharmony_ci 2-4-1. Enabling and Disabling 2962306a36Sopenharmony_ci 2-4-2. Top-down Constraint 3062306a36Sopenharmony_ci 2-4-3. No Internal Process Constraint 3162306a36Sopenharmony_ci 2-5. Delegation 3262306a36Sopenharmony_ci 2-5-1. Model of Delegation 3362306a36Sopenharmony_ci 2-5-2. Delegation Containment 3462306a36Sopenharmony_ci 2-6. Guidelines 3562306a36Sopenharmony_ci 2-6-1. Organize Once and Control 3662306a36Sopenharmony_ci 2-6-2. Avoid Name Collisions 3762306a36Sopenharmony_ci 3. Resource Distribution Models 3862306a36Sopenharmony_ci 3-1. Weights 3962306a36Sopenharmony_ci 3-2. Limits 4062306a36Sopenharmony_ci 3-3. Protections 4162306a36Sopenharmony_ci 3-4. Allocations 4262306a36Sopenharmony_ci 4. Interface Files 4362306a36Sopenharmony_ci 4-1. Format 4462306a36Sopenharmony_ci 4-2. Conventions 4562306a36Sopenharmony_ci 4-3. Core Interface Files 4662306a36Sopenharmony_ci 5. Controllers 4762306a36Sopenharmony_ci 5-1. CPU 4862306a36Sopenharmony_ci 5-1-1. CPU Interface Files 4962306a36Sopenharmony_ci 5-2. Memory 5062306a36Sopenharmony_ci 5-2-1. Memory Interface Files 5162306a36Sopenharmony_ci 5-2-2. Usage Guidelines 5262306a36Sopenharmony_ci 5-2-3. Memory Ownership 5362306a36Sopenharmony_ci 5-3. IO 5462306a36Sopenharmony_ci 5-3-1. IO Interface Files 5562306a36Sopenharmony_ci 5-3-2. Writeback 5662306a36Sopenharmony_ci 5-3-3. IO Latency 5762306a36Sopenharmony_ci 5-3-3-1. How IO Latency Throttling Works 5862306a36Sopenharmony_ci 5-3-3-2. IO Latency Interface Files 5962306a36Sopenharmony_ci 5-3-4. IO Priority 6062306a36Sopenharmony_ci 5-4. PID 6162306a36Sopenharmony_ci 5-4-1. PID Interface Files 6262306a36Sopenharmony_ci 5-5. Cpuset 6362306a36Sopenharmony_ci 5.5-1. Cpuset Interface Files 6462306a36Sopenharmony_ci 5-6. Device 6562306a36Sopenharmony_ci 5-7. RDMA 6662306a36Sopenharmony_ci 5-7-1. RDMA Interface Files 6762306a36Sopenharmony_ci 5-8. HugeTLB 6862306a36Sopenharmony_ci 5.8-1. HugeTLB Interface Files 6962306a36Sopenharmony_ci 5-9. Misc 7062306a36Sopenharmony_ci 5.9-1 Miscellaneous cgroup Interface Files 7162306a36Sopenharmony_ci 5.9-2 Migration and Ownership 7262306a36Sopenharmony_ci 5-10. Others 7362306a36Sopenharmony_ci 5-10-1. perf_event 7462306a36Sopenharmony_ci 5-N. Non-normative information 7562306a36Sopenharmony_ci 5-N-1. CPU controller root cgroup process behaviour 7662306a36Sopenharmony_ci 5-N-2. IO controller root cgroup process behaviour 7762306a36Sopenharmony_ci 6. Namespace 7862306a36Sopenharmony_ci 6-1. Basics 7962306a36Sopenharmony_ci 6-2. The Root and Views 8062306a36Sopenharmony_ci 6-3. Migration and setns(2) 8162306a36Sopenharmony_ci 6-4. Interaction with Other Namespaces 8262306a36Sopenharmony_ci P. Information on Kernel Programming 8362306a36Sopenharmony_ci P-1. Filesystem Support for Writeback 8462306a36Sopenharmony_ci D. Deprecated v1 Core Features 8562306a36Sopenharmony_ci R. Issues with v1 and Rationales for v2 8662306a36Sopenharmony_ci R-1. Multiple Hierarchies 8762306a36Sopenharmony_ci R-2. Thread Granularity 8862306a36Sopenharmony_ci R-3. Competition Between Inner Nodes and Threads 8962306a36Sopenharmony_ci R-4. Other Interface Issues 9062306a36Sopenharmony_ci R-5. Controller Issues and Remedies 9162306a36Sopenharmony_ci R-5-1. Memory 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ciIntroduction 9562306a36Sopenharmony_ci============ 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ciTerminology 9862306a36Sopenharmony_ci----------- 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ci"cgroup" stands for "control group" and is never capitalized. The 10162306a36Sopenharmony_cisingular form is used to designate the whole feature and also as a 10262306a36Sopenharmony_ciqualifier as in "cgroup controllers". When explicitly referring to 10362306a36Sopenharmony_cimultiple individual control groups, the plural form "cgroups" is used. 10462306a36Sopenharmony_ci 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciWhat is cgroup? 10762306a36Sopenharmony_ci--------------- 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_cicgroup is a mechanism to organize processes hierarchically and 11062306a36Sopenharmony_cidistribute system resources along the hierarchy in a controlled and 11162306a36Sopenharmony_ciconfigurable manner. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_cicgroup is largely composed of two parts - the core and controllers. 11462306a36Sopenharmony_cicgroup core is primarily responsible for hierarchically organizing 11562306a36Sopenharmony_ciprocesses. A cgroup controller is usually responsible for 11662306a36Sopenharmony_cidistributing a specific type of system resource along the hierarchy 11762306a36Sopenharmony_cialthough there are utility controllers which serve purposes other than 11862306a36Sopenharmony_ciresource distribution. 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_cicgroups form a tree structure and every process in the system belongs 12162306a36Sopenharmony_cito one and only one cgroup. All threads of a process belong to the 12262306a36Sopenharmony_cisame cgroup. On creation, all processes are put in the cgroup that 12362306a36Sopenharmony_cithe parent process belongs to at the time. A process can be migrated 12462306a36Sopenharmony_cito another cgroup. Migration of a process doesn't affect already 12562306a36Sopenharmony_ciexisting descendant processes. 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ciFollowing certain structural constraints, controllers may be enabled or 12862306a36Sopenharmony_cidisabled selectively on a cgroup. All controller behaviors are 12962306a36Sopenharmony_cihierarchical - if a controller is enabled on a cgroup, it affects all 13062306a36Sopenharmony_ciprocesses which belong to the cgroups consisting the inclusive 13162306a36Sopenharmony_cisub-hierarchy of the cgroup. When a controller is enabled on a nested 13262306a36Sopenharmony_cicgroup, it always restricts the resource distribution further. The 13362306a36Sopenharmony_cirestrictions set closer to the root in the hierarchy can not be 13462306a36Sopenharmony_cioverridden from further away. 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ciBasic Operations 13862306a36Sopenharmony_ci================ 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciMounting 14162306a36Sopenharmony_ci-------- 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ciUnlike v1, cgroup v2 has only single hierarchy. The cgroup v2 14462306a36Sopenharmony_cihierarchy can be mounted with the following mount command:: 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ci # mount -t cgroup2 none $MOUNT_POINT 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_cicgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All 14962306a36Sopenharmony_cicontrollers which support v2 and are not bound to a v1 hierarchy are 15062306a36Sopenharmony_ciautomatically bound to the v2 hierarchy and show up at the root. 15162306a36Sopenharmony_ciControllers which are not in active use in the v2 hierarchy can be 15262306a36Sopenharmony_cibound to other hierarchies. This allows mixing v2 hierarchy with the 15362306a36Sopenharmony_cilegacy v1 multiple hierarchies in a fully backward compatible way. 15462306a36Sopenharmony_ci 15562306a36Sopenharmony_ciA controller can be moved across hierarchies only after the controller 15662306a36Sopenharmony_ciis no longer referenced in its current hierarchy. Because per-cgroup 15762306a36Sopenharmony_cicontroller states are destroyed asynchronously and controllers may 15862306a36Sopenharmony_cihave lingering references, a controller may not show up immediately on 15962306a36Sopenharmony_cithe v2 hierarchy after the final umount of the previous hierarchy. 16062306a36Sopenharmony_ciSimilarly, a controller should be fully disabled to be moved out of 16162306a36Sopenharmony_cithe unified hierarchy and it may take some time for the disabled 16262306a36Sopenharmony_cicontroller to become available for other hierarchies; furthermore, due 16362306a36Sopenharmony_cito inter-controller dependencies, other controllers may need to be 16462306a36Sopenharmony_cidisabled too. 16562306a36Sopenharmony_ci 16662306a36Sopenharmony_ciWhile useful for development and manual configurations, moving 16762306a36Sopenharmony_cicontrollers dynamically between the v2 and other hierarchies is 16862306a36Sopenharmony_cistrongly discouraged for production use. It is recommended to decide 16962306a36Sopenharmony_cithe hierarchies and controller associations before starting using the 17062306a36Sopenharmony_cicontrollers after system boot. 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciDuring transition to v2, system management software might still 17362306a36Sopenharmony_ciautomount the v1 cgroup filesystem and so hijack all controllers 17462306a36Sopenharmony_ciduring boot, before manual intervention is possible. To make testing 17562306a36Sopenharmony_ciand experimenting easier, the kernel parameter cgroup_no_v1= allows 17662306a36Sopenharmony_cidisabling controllers in v1 and make them always available in v2. 17762306a36Sopenharmony_ci 17862306a36Sopenharmony_cicgroup v2 currently supports the following mount options. 17962306a36Sopenharmony_ci 18062306a36Sopenharmony_ci nsdelegate 18162306a36Sopenharmony_ci Consider cgroup namespaces as delegation boundaries. This 18262306a36Sopenharmony_ci option is system wide and can only be set on mount or modified 18362306a36Sopenharmony_ci through remount from the init namespace. The mount option is 18462306a36Sopenharmony_ci ignored on non-init namespace mounts. Please refer to the 18562306a36Sopenharmony_ci Delegation section for details. 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ci favordynmods 18862306a36Sopenharmony_ci Reduce the latencies of dynamic cgroup modifications such as 18962306a36Sopenharmony_ci task migrations and controller on/offs at the cost of making 19062306a36Sopenharmony_ci hot path operations such as forks and exits more expensive. 19162306a36Sopenharmony_ci The static usage pattern of creating a cgroup, enabling 19262306a36Sopenharmony_ci controllers, and then seeding it with CLONE_INTO_CGROUP is 19362306a36Sopenharmony_ci not affected by this option. 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ci memory_localevents 19662306a36Sopenharmony_ci Only populate memory.events with data for the current cgroup, 19762306a36Sopenharmony_ci and not any subtrees. This is legacy behaviour, the default 19862306a36Sopenharmony_ci behaviour without this option is to include subtree counts. 19962306a36Sopenharmony_ci This option is system wide and can only be set on mount or 20062306a36Sopenharmony_ci modified through remount from the init namespace. The mount 20162306a36Sopenharmony_ci option is ignored on non-init namespace mounts. 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ci memory_recursiveprot 20462306a36Sopenharmony_ci Recursively apply memory.min and memory.low protection to 20562306a36Sopenharmony_ci entire subtrees, without requiring explicit downward 20662306a36Sopenharmony_ci propagation into leaf cgroups. This allows protecting entire 20762306a36Sopenharmony_ci subtrees from one another, while retaining free competition 20862306a36Sopenharmony_ci within those subtrees. This should have been the default 20962306a36Sopenharmony_ci behavior but is a mount-option to avoid regressing setups 21062306a36Sopenharmony_ci relying on the original semantics (e.g. specifying bogusly 21162306a36Sopenharmony_ci high 'bypass' protection values at higher tree levels). 21262306a36Sopenharmony_ci 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ciOrganizing Processes and Threads 21562306a36Sopenharmony_ci-------------------------------- 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_ciProcesses 21862306a36Sopenharmony_ci~~~~~~~~~ 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciInitially, only the root cgroup exists to which all processes belong. 22162306a36Sopenharmony_ciA child cgroup can be created by creating a sub-directory:: 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ci # mkdir $CGROUP_NAME 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ciA given cgroup may have multiple child cgroups forming a tree 22662306a36Sopenharmony_cistructure. Each cgroup has a read-writable interface file 22762306a36Sopenharmony_ci"cgroup.procs". When read, it lists the PIDs of all processes which 22862306a36Sopenharmony_cibelong to the cgroup one-per-line. The PIDs are not ordered and the 22962306a36Sopenharmony_cisame PID may show up more than once if the process got moved to 23062306a36Sopenharmony_cianother cgroup and then back or the PID got recycled while reading. 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ciA process can be migrated into a cgroup by writing its PID to the 23362306a36Sopenharmony_citarget cgroup's "cgroup.procs" file. Only one process can be migrated 23462306a36Sopenharmony_cion a single write(2) call. If a process is composed of multiple 23562306a36Sopenharmony_cithreads, writing the PID of any thread migrates all threads of the 23662306a36Sopenharmony_ciprocess. 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ciWhen a process forks a child process, the new process is born into the 23962306a36Sopenharmony_cicgroup that the forking process belongs to at the time of the 24062306a36Sopenharmony_cioperation. After exit, a process stays associated with the cgroup 24162306a36Sopenharmony_cithat it belonged to at the time of exit until it's reaped; however, a 24262306a36Sopenharmony_cizombie process does not appear in "cgroup.procs" and thus can't be 24362306a36Sopenharmony_cimoved to another cgroup. 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciA cgroup which doesn't have any children or live processes can be 24662306a36Sopenharmony_cidestroyed by removing the directory. Note that a cgroup which doesn't 24762306a36Sopenharmony_cihave any children and is associated only with zombie processes is 24862306a36Sopenharmony_ciconsidered empty and can be removed:: 24962306a36Sopenharmony_ci 25062306a36Sopenharmony_ci # rmdir $CGROUP_NAME 25162306a36Sopenharmony_ci 25262306a36Sopenharmony_ci"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 25362306a36Sopenharmony_cicgroup is in use in the system, this file may contain multiple lines, 25462306a36Sopenharmony_cione for each hierarchy. The entry for cgroup v2 is always in the 25562306a36Sopenharmony_ciformat "0::$PATH":: 25662306a36Sopenharmony_ci 25762306a36Sopenharmony_ci # cat /proc/842/cgroup 25862306a36Sopenharmony_ci ... 25962306a36Sopenharmony_ci 0::/test-cgroup/test-cgroup-nested 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ciIf the process becomes a zombie and the cgroup it was associated with 26262306a36Sopenharmony_ciis removed subsequently, " (deleted)" is appended to the path:: 26362306a36Sopenharmony_ci 26462306a36Sopenharmony_ci # cat /proc/842/cgroup 26562306a36Sopenharmony_ci ... 26662306a36Sopenharmony_ci 0::/test-cgroup/test-cgroup-nested (deleted) 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_ciThreads 27062306a36Sopenharmony_ci~~~~~~~ 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_cicgroup v2 supports thread granularity for a subset of controllers to 27362306a36Sopenharmony_cisupport use cases requiring hierarchical resource distribution across 27462306a36Sopenharmony_cithe threads of a group of processes. By default, all threads of a 27562306a36Sopenharmony_ciprocess belong to the same cgroup, which also serves as the resource 27662306a36Sopenharmony_cidomain to host resource consumptions which are not specific to a 27762306a36Sopenharmony_ciprocess or thread. The thread mode allows threads to be spread across 27862306a36Sopenharmony_cia subtree while still maintaining the common resource domain for them. 27962306a36Sopenharmony_ci 28062306a36Sopenharmony_ciControllers which support thread mode are called threaded controllers. 28162306a36Sopenharmony_ciThe ones which don't are called domain controllers. 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ciMarking a cgroup threaded makes it join the resource domain of its 28462306a36Sopenharmony_ciparent as a threaded cgroup. The parent may be another threaded 28562306a36Sopenharmony_cicgroup whose resource domain is further up in the hierarchy. The root 28662306a36Sopenharmony_ciof a threaded subtree, that is, the nearest ancestor which is not 28762306a36Sopenharmony_cithreaded, is called threaded domain or thread root interchangeably and 28862306a36Sopenharmony_ciserves as the resource domain for the entire subtree. 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ciInside a threaded subtree, threads of a process can be put in 29162306a36Sopenharmony_cidifferent cgroups and are not subject to the no internal process 29262306a36Sopenharmony_ciconstraint - threaded controllers can be enabled on non-leaf cgroups 29362306a36Sopenharmony_ciwhether they have threads in them or not. 29462306a36Sopenharmony_ci 29562306a36Sopenharmony_ciAs the threaded domain cgroup hosts all the domain resource 29662306a36Sopenharmony_ciconsumptions of the subtree, it is considered to have internal 29762306a36Sopenharmony_ciresource consumptions whether there are processes in it or not and 29862306a36Sopenharmony_cican't have populated child cgroups which aren't threaded. Because the 29962306a36Sopenharmony_ciroot cgroup is not subject to no internal process constraint, it can 30062306a36Sopenharmony_ciserve both as a threaded domain and a parent to domain cgroups. 30162306a36Sopenharmony_ci 30262306a36Sopenharmony_ciThe current operation mode or type of the cgroup is shown in the 30362306a36Sopenharmony_ci"cgroup.type" file which indicates whether the cgroup is a normal 30462306a36Sopenharmony_cidomain, a domain which is serving as the domain of a threaded subtree, 30562306a36Sopenharmony_cior a threaded cgroup. 30662306a36Sopenharmony_ci 30762306a36Sopenharmony_ciOn creation, a cgroup is always a domain cgroup and can be made 30862306a36Sopenharmony_cithreaded by writing "threaded" to the "cgroup.type" file. The 30962306a36Sopenharmony_cioperation is single direction:: 31062306a36Sopenharmony_ci 31162306a36Sopenharmony_ci # echo threaded > cgroup.type 31262306a36Sopenharmony_ci 31362306a36Sopenharmony_ciOnce threaded, the cgroup can't be made a domain again. To enable the 31462306a36Sopenharmony_cithread mode, the following conditions must be met. 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ci- As the cgroup will join the parent's resource domain. The parent 31762306a36Sopenharmony_ci must either be a valid (threaded) domain or a threaded cgroup. 31862306a36Sopenharmony_ci 31962306a36Sopenharmony_ci- When the parent is an unthreaded domain, it must not have any domain 32062306a36Sopenharmony_ci controllers enabled or populated domain children. The root is 32162306a36Sopenharmony_ci exempt from this requirement. 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ciTopology-wise, a cgroup can be in an invalid state. Please consider 32462306a36Sopenharmony_cithe following topology:: 32562306a36Sopenharmony_ci 32662306a36Sopenharmony_ci A (threaded domain) - B (threaded) - C (domain, just created) 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ciC is created as a domain but isn't connected to a parent which can 32962306a36Sopenharmony_cihost child domains. C can't be used until it is turned into a 33062306a36Sopenharmony_cithreaded cgroup. "cgroup.type" file will report "domain (invalid)" in 33162306a36Sopenharmony_cithese cases. Operations which fail due to invalid topology use 33262306a36Sopenharmony_ciEOPNOTSUPP as the errno. 33362306a36Sopenharmony_ci 33462306a36Sopenharmony_ciA domain cgroup is turned into a threaded domain when one of its child 33562306a36Sopenharmony_cicgroup becomes threaded or threaded controllers are enabled in the 33662306a36Sopenharmony_ci"cgroup.subtree_control" file while there are processes in the cgroup. 33762306a36Sopenharmony_ciA threaded domain reverts to a normal domain when the conditions 33862306a36Sopenharmony_ciclear. 33962306a36Sopenharmony_ci 34062306a36Sopenharmony_ciWhen read, "cgroup.threads" contains the list of the thread IDs of all 34162306a36Sopenharmony_cithreads in the cgroup. Except that the operations are per-thread 34262306a36Sopenharmony_ciinstead of per-process, "cgroup.threads" has the same format and 34362306a36Sopenharmony_cibehaves the same way as "cgroup.procs". While "cgroup.threads" can be 34462306a36Sopenharmony_ciwritten to in any cgroup, as it can only move threads inside the same 34562306a36Sopenharmony_cithreaded domain, its operations are confined inside each threaded 34662306a36Sopenharmony_cisubtree. 34762306a36Sopenharmony_ci 34862306a36Sopenharmony_ciThe threaded domain cgroup serves as the resource domain for the whole 34962306a36Sopenharmony_cisubtree, and, while the threads can be scattered across the subtree, 35062306a36Sopenharmony_ciall the processes are considered to be in the threaded domain cgroup. 35162306a36Sopenharmony_ci"cgroup.procs" in a threaded domain cgroup contains the PIDs of all 35262306a36Sopenharmony_ciprocesses in the subtree and is not readable in the subtree proper. 35362306a36Sopenharmony_ciHowever, "cgroup.procs" can be written to from anywhere in the subtree 35462306a36Sopenharmony_cito migrate all threads of the matching process to the cgroup. 35562306a36Sopenharmony_ci 35662306a36Sopenharmony_ciOnly threaded controllers can be enabled in a threaded subtree. When 35762306a36Sopenharmony_cia threaded controller is enabled inside a threaded subtree, it only 35862306a36Sopenharmony_ciaccounts for and controls resource consumptions associated with the 35962306a36Sopenharmony_cithreads in the cgroup and its descendants. All consumptions which 36062306a36Sopenharmony_ciaren't tied to a specific thread belong to the threaded domain cgroup. 36162306a36Sopenharmony_ci 36262306a36Sopenharmony_ciBecause a threaded subtree is exempt from no internal process 36362306a36Sopenharmony_ciconstraint, a threaded controller must be able to handle competition 36462306a36Sopenharmony_cibetween threads in a non-leaf cgroup and its child cgroups. Each 36562306a36Sopenharmony_cithreaded controller defines how such competitions are handled. 36662306a36Sopenharmony_ci 36762306a36Sopenharmony_ci 36862306a36Sopenharmony_ci[Un]populated Notification 36962306a36Sopenharmony_ci-------------------------- 37062306a36Sopenharmony_ci 37162306a36Sopenharmony_ciEach non-root cgroup has a "cgroup.events" file which contains 37262306a36Sopenharmony_ci"populated" field indicating whether the cgroup's sub-hierarchy has 37362306a36Sopenharmony_cilive processes in it. Its value is 0 if there is no live process in 37462306a36Sopenharmony_cithe cgroup and its descendants; otherwise, 1. poll and [id]notify 37562306a36Sopenharmony_cievents are triggered when the value changes. This can be used, for 37662306a36Sopenharmony_ciexample, to start a clean-up operation after all processes of a given 37762306a36Sopenharmony_cisub-hierarchy have exited. The populated state updates and 37862306a36Sopenharmony_cinotifications are recursive. Consider the following sub-hierarchy 37962306a36Sopenharmony_ciwhere the numbers in the parentheses represent the numbers of processes 38062306a36Sopenharmony_ciin each cgroup:: 38162306a36Sopenharmony_ci 38262306a36Sopenharmony_ci A(4) - B(0) - C(1) 38362306a36Sopenharmony_ci \ D(0) 38462306a36Sopenharmony_ci 38562306a36Sopenharmony_ciA, B and C's "populated" fields would be 1 while D's 0. After the one 38662306a36Sopenharmony_ciprocess in C exits, B and C's "populated" fields would flip to "0" and 38762306a36Sopenharmony_cifile modified events will be generated on the "cgroup.events" files of 38862306a36Sopenharmony_ciboth cgroups. 38962306a36Sopenharmony_ci 39062306a36Sopenharmony_ci 39162306a36Sopenharmony_ciControlling Controllers 39262306a36Sopenharmony_ci----------------------- 39362306a36Sopenharmony_ci 39462306a36Sopenharmony_ciEnabling and Disabling 39562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 39662306a36Sopenharmony_ci 39762306a36Sopenharmony_ciEach cgroup has a "cgroup.controllers" file which lists all 39862306a36Sopenharmony_cicontrollers available for the cgroup to enable:: 39962306a36Sopenharmony_ci 40062306a36Sopenharmony_ci # cat cgroup.controllers 40162306a36Sopenharmony_ci cpu io memory 40262306a36Sopenharmony_ci 40362306a36Sopenharmony_ciNo controller is enabled by default. Controllers can be enabled and 40462306a36Sopenharmony_cidisabled by writing to the "cgroup.subtree_control" file:: 40562306a36Sopenharmony_ci 40662306a36Sopenharmony_ci # echo "+cpu +memory -io" > cgroup.subtree_control 40762306a36Sopenharmony_ci 40862306a36Sopenharmony_ciOnly controllers which are listed in "cgroup.controllers" can be 40962306a36Sopenharmony_cienabled. When multiple operations are specified as above, either they 41062306a36Sopenharmony_ciall succeed or fail. If multiple operations on the same controller 41162306a36Sopenharmony_ciare specified, the last one is effective. 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ciEnabling a controller in a cgroup indicates that the distribution of 41462306a36Sopenharmony_cithe target resource across its immediate children will be controlled. 41562306a36Sopenharmony_ciConsider the following sub-hierarchy. The enabled controllers are 41662306a36Sopenharmony_cilisted in parentheses:: 41762306a36Sopenharmony_ci 41862306a36Sopenharmony_ci A(cpu,memory) - B(memory) - C() 41962306a36Sopenharmony_ci \ D() 42062306a36Sopenharmony_ci 42162306a36Sopenharmony_ciAs A has "cpu" and "memory" enabled, A will control the distribution 42262306a36Sopenharmony_ciof CPU cycles and memory to its children, in this case, B. As B has 42362306a36Sopenharmony_ci"memory" enabled but not "CPU", C and D will compete freely on CPU 42462306a36Sopenharmony_cicycles but their division of memory available to B will be controlled. 42562306a36Sopenharmony_ci 42662306a36Sopenharmony_ciAs a controller regulates the distribution of the target resource to 42762306a36Sopenharmony_cithe cgroup's children, enabling it creates the controller's interface 42862306a36Sopenharmony_cifiles in the child cgroups. In the above example, enabling "cpu" on B 42962306a36Sopenharmony_ciwould create the "cpu." prefixed controller interface files in C and 43062306a36Sopenharmony_ciD. Likewise, disabling "memory" from B would remove the "memory." 43162306a36Sopenharmony_ciprefixed controller interface files from C and D. This means that the 43262306a36Sopenharmony_cicontroller interface files - anything which doesn't start with 43362306a36Sopenharmony_ci"cgroup." are owned by the parent rather than the cgroup itself. 43462306a36Sopenharmony_ci 43562306a36Sopenharmony_ci 43662306a36Sopenharmony_ciTop-down Constraint 43762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 43862306a36Sopenharmony_ci 43962306a36Sopenharmony_ciResources are distributed top-down and a cgroup can further distribute 44062306a36Sopenharmony_cia resource only if the resource has been distributed to it from the 44162306a36Sopenharmony_ciparent. This means that all non-root "cgroup.subtree_control" files 44262306a36Sopenharmony_cican only contain controllers which are enabled in the parent's 44362306a36Sopenharmony_ci"cgroup.subtree_control" file. A controller can be enabled only if 44462306a36Sopenharmony_cithe parent has the controller enabled and a controller can't be 44562306a36Sopenharmony_cidisabled if one or more children have it enabled. 44662306a36Sopenharmony_ci 44762306a36Sopenharmony_ci 44862306a36Sopenharmony_ciNo Internal Process Constraint 44962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 45062306a36Sopenharmony_ci 45162306a36Sopenharmony_ciNon-root cgroups can distribute domain resources to their children 45262306a36Sopenharmony_cionly when they don't have any processes of their own. In other words, 45362306a36Sopenharmony_cionly domain cgroups which don't contain any processes can have domain 45462306a36Sopenharmony_cicontrollers enabled in their "cgroup.subtree_control" files. 45562306a36Sopenharmony_ci 45662306a36Sopenharmony_ciThis guarantees that, when a domain controller is looking at the part 45762306a36Sopenharmony_ciof the hierarchy which has it enabled, processes are always only on 45862306a36Sopenharmony_cithe leaves. This rules out situations where child cgroups compete 45962306a36Sopenharmony_ciagainst internal processes of the parent. 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ciThe root cgroup is exempt from this restriction. Root contains 46262306a36Sopenharmony_ciprocesses and anonymous resource consumption which can't be associated 46362306a36Sopenharmony_ciwith any other cgroups and requires special treatment from most 46462306a36Sopenharmony_cicontrollers. How resource consumption in the root cgroup is governed 46562306a36Sopenharmony_ciis up to each controller (for more information on this topic please 46662306a36Sopenharmony_cirefer to the Non-normative information section in the Controllers 46762306a36Sopenharmony_cichapter). 46862306a36Sopenharmony_ci 46962306a36Sopenharmony_ciNote that the restriction doesn't get in the way if there is no 47062306a36Sopenharmony_cienabled controller in the cgroup's "cgroup.subtree_control". This is 47162306a36Sopenharmony_ciimportant as otherwise it wouldn't be possible to create children of a 47262306a36Sopenharmony_cipopulated cgroup. To control resource distribution of a cgroup, the 47362306a36Sopenharmony_cicgroup must create children and transfer all its processes to the 47462306a36Sopenharmony_cichildren before enabling controllers in its "cgroup.subtree_control" 47562306a36Sopenharmony_cifile. 47662306a36Sopenharmony_ci 47762306a36Sopenharmony_ci 47862306a36Sopenharmony_ciDelegation 47962306a36Sopenharmony_ci---------- 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ciModel of Delegation 48262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 48362306a36Sopenharmony_ci 48462306a36Sopenharmony_ciA cgroup can be delegated in two ways. First, to a less privileged 48562306a36Sopenharmony_ciuser by granting write access of the directory and its "cgroup.procs", 48662306a36Sopenharmony_ci"cgroup.threads" and "cgroup.subtree_control" files to the user. 48762306a36Sopenharmony_ciSecond, if the "nsdelegate" mount option is set, automatically to a 48862306a36Sopenharmony_cicgroup namespace on namespace creation. 48962306a36Sopenharmony_ci 49062306a36Sopenharmony_ciBecause the resource control interface files in a given directory 49162306a36Sopenharmony_cicontrol the distribution of the parent's resources, the delegatee 49262306a36Sopenharmony_cishouldn't be allowed to write to them. For the first method, this is 49362306a36Sopenharmony_ciachieved by not granting access to these files. For the second, the 49462306a36Sopenharmony_cikernel rejects writes to all files other than "cgroup.procs" and 49562306a36Sopenharmony_ci"cgroup.subtree_control" on a namespace root from inside the 49662306a36Sopenharmony_cinamespace. 49762306a36Sopenharmony_ci 49862306a36Sopenharmony_ciThe end results are equivalent for both delegation types. Once 49962306a36Sopenharmony_cidelegated, the user can build sub-hierarchy under the directory, 50062306a36Sopenharmony_ciorganize processes inside it as it sees fit and further distribute the 50162306a36Sopenharmony_ciresources it received from the parent. The limits and other settings 50262306a36Sopenharmony_ciof all resource controllers are hierarchical and regardless of what 50362306a36Sopenharmony_cihappens in the delegated sub-hierarchy, nothing can escape the 50462306a36Sopenharmony_ciresource restrictions imposed by the parent. 50562306a36Sopenharmony_ci 50662306a36Sopenharmony_ciCurrently, cgroup doesn't impose any restrictions on the number of 50762306a36Sopenharmony_cicgroups in or nesting depth of a delegated sub-hierarchy; however, 50862306a36Sopenharmony_cithis may be limited explicitly in the future. 50962306a36Sopenharmony_ci 51062306a36Sopenharmony_ci 51162306a36Sopenharmony_ciDelegation Containment 51262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 51362306a36Sopenharmony_ci 51462306a36Sopenharmony_ciA delegated sub-hierarchy is contained in the sense that processes 51562306a36Sopenharmony_cican't be moved into or out of the sub-hierarchy by the delegatee. 51662306a36Sopenharmony_ci 51762306a36Sopenharmony_ciFor delegations to a less privileged user, this is achieved by 51862306a36Sopenharmony_cirequiring the following conditions for a process with a non-root euid 51962306a36Sopenharmony_cito migrate a target process into a cgroup by writing its PID to the 52062306a36Sopenharmony_ci"cgroup.procs" file. 52162306a36Sopenharmony_ci 52262306a36Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file. 52362306a36Sopenharmony_ci 52462306a36Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file of the 52562306a36Sopenharmony_ci common ancestor of the source and destination cgroups. 52662306a36Sopenharmony_ci 52762306a36Sopenharmony_ciThe above two constraints ensure that while a delegatee may migrate 52862306a36Sopenharmony_ciprocesses around freely in the delegated sub-hierarchy it can't pull 52962306a36Sopenharmony_ciin from or push out to outside the sub-hierarchy. 53062306a36Sopenharmony_ci 53162306a36Sopenharmony_ciFor an example, let's assume cgroups C0 and C1 have been delegated to 53262306a36Sopenharmony_ciuser U0 who created C00, C01 under C0 and C10 under C1 as follows and 53362306a36Sopenharmony_ciall processes under C0 and C1 belong to U0:: 53462306a36Sopenharmony_ci 53562306a36Sopenharmony_ci ~~~~~~~~~~~~~ - C0 - C00 53662306a36Sopenharmony_ci ~ cgroup ~ \ C01 53762306a36Sopenharmony_ci ~ hierarchy ~ 53862306a36Sopenharmony_ci ~~~~~~~~~~~~~ - C1 - C10 53962306a36Sopenharmony_ci 54062306a36Sopenharmony_ciLet's also say U0 wants to write the PID of a process which is 54162306a36Sopenharmony_cicurrently in C10 into "C00/cgroup.procs". U0 has write access to the 54262306a36Sopenharmony_cifile; however, the common ancestor of the source cgroup C10 and the 54362306a36Sopenharmony_cidestination cgroup C00 is above the points of delegation and U0 would 54462306a36Sopenharmony_cinot have write access to its "cgroup.procs" files and thus the write 54562306a36Sopenharmony_ciwill be denied with -EACCES. 54662306a36Sopenharmony_ci 54762306a36Sopenharmony_ciFor delegations to namespaces, containment is achieved by requiring 54862306a36Sopenharmony_cithat both the source and destination cgroups are reachable from the 54962306a36Sopenharmony_cinamespace of the process which is attempting the migration. If either 55062306a36Sopenharmony_ciis not reachable, the migration is rejected with -ENOENT. 55162306a36Sopenharmony_ci 55262306a36Sopenharmony_ci 55362306a36Sopenharmony_ciGuidelines 55462306a36Sopenharmony_ci---------- 55562306a36Sopenharmony_ci 55662306a36Sopenharmony_ciOrganize Once and Control 55762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 55862306a36Sopenharmony_ci 55962306a36Sopenharmony_ciMigrating a process across cgroups is a relatively expensive operation 56062306a36Sopenharmony_ciand stateful resources such as memory are not moved together with the 56162306a36Sopenharmony_ciprocess. This is an explicit design decision as there often exist 56262306a36Sopenharmony_ciinherent trade-offs between migration and various hot paths in terms 56362306a36Sopenharmony_ciof synchronization cost. 56462306a36Sopenharmony_ci 56562306a36Sopenharmony_ciAs such, migrating processes across cgroups frequently as a means to 56662306a36Sopenharmony_ciapply different resource restrictions is discouraged. A workload 56762306a36Sopenharmony_cishould be assigned to a cgroup according to the system's logical and 56862306a36Sopenharmony_ciresource structure once on start-up. Dynamic adjustments to resource 56962306a36Sopenharmony_cidistribution can be made by changing controller configuration through 57062306a36Sopenharmony_cithe interface files. 57162306a36Sopenharmony_ci 57262306a36Sopenharmony_ci 57362306a36Sopenharmony_ciAvoid Name Collisions 57462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 57562306a36Sopenharmony_ci 57662306a36Sopenharmony_ciInterface files for a cgroup and its children cgroups occupy the same 57762306a36Sopenharmony_cidirectory and it is possible to create children cgroups which collide 57862306a36Sopenharmony_ciwith interface files. 57962306a36Sopenharmony_ci 58062306a36Sopenharmony_ciAll cgroup core interface files are prefixed with "cgroup." and each 58162306a36Sopenharmony_cicontroller's interface files are prefixed with the controller name and 58262306a36Sopenharmony_cia dot. A controller's name is composed of lower case alphabets and 58362306a36Sopenharmony_ci'_'s but never begins with an '_' so it can be used as the prefix 58462306a36Sopenharmony_cicharacter for collision avoidance. Also, interface file names won't 58562306a36Sopenharmony_cistart or end with terms which are often used in categorizing workloads 58662306a36Sopenharmony_cisuch as job, service, slice, unit or workload. 58762306a36Sopenharmony_ci 58862306a36Sopenharmony_cicgroup doesn't do anything to prevent name collisions and it's the 58962306a36Sopenharmony_ciuser's responsibility to avoid them. 59062306a36Sopenharmony_ci 59162306a36Sopenharmony_ci 59262306a36Sopenharmony_ciResource Distribution Models 59362306a36Sopenharmony_ci============================ 59462306a36Sopenharmony_ci 59562306a36Sopenharmony_cicgroup controllers implement several resource distribution schemes 59662306a36Sopenharmony_cidepending on the resource type and expected use cases. This section 59762306a36Sopenharmony_cidescribes major schemes in use along with their expected behaviors. 59862306a36Sopenharmony_ci 59962306a36Sopenharmony_ci 60062306a36Sopenharmony_ciWeights 60162306a36Sopenharmony_ci------- 60262306a36Sopenharmony_ci 60362306a36Sopenharmony_ciA parent's resource is distributed by adding up the weights of all 60462306a36Sopenharmony_ciactive children and giving each the fraction matching the ratio of its 60562306a36Sopenharmony_ciweight against the sum. As only children which can make use of the 60662306a36Sopenharmony_ciresource at the moment participate in the distribution, this is 60762306a36Sopenharmony_ciwork-conserving. Due to the dynamic nature, this model is usually 60862306a36Sopenharmony_ciused for stateless resources. 60962306a36Sopenharmony_ci 61062306a36Sopenharmony_ciAll weights are in the range [1, 10000] with the default at 100. This 61162306a36Sopenharmony_ciallows symmetric multiplicative biases in both directions at fine 61262306a36Sopenharmony_cienough granularity while staying in the intuitive range. 61362306a36Sopenharmony_ci 61462306a36Sopenharmony_ciAs long as the weight is in range, all configuration combinations are 61562306a36Sopenharmony_civalid and there is no reason to reject configuration changes or 61662306a36Sopenharmony_ciprocess migrations. 61762306a36Sopenharmony_ci 61862306a36Sopenharmony_ci"cpu.weight" proportionally distributes CPU cycles to active children 61962306a36Sopenharmony_ciand is an example of this type. 62062306a36Sopenharmony_ci 62162306a36Sopenharmony_ci 62262306a36Sopenharmony_ci.. _cgroupv2-limits-distributor: 62362306a36Sopenharmony_ci 62462306a36Sopenharmony_ciLimits 62562306a36Sopenharmony_ci------ 62662306a36Sopenharmony_ci 62762306a36Sopenharmony_ciA child can only consume up to the configured amount of the resource. 62862306a36Sopenharmony_ciLimits can be over-committed - the sum of the limits of children can 62962306a36Sopenharmony_ciexceed the amount of resource available to the parent. 63062306a36Sopenharmony_ci 63162306a36Sopenharmony_ciLimits are in the range [0, max] and defaults to "max", which is noop. 63262306a36Sopenharmony_ci 63362306a36Sopenharmony_ciAs limits can be over-committed, all configuration combinations are 63462306a36Sopenharmony_civalid and there is no reason to reject configuration changes or 63562306a36Sopenharmony_ciprocess migrations. 63662306a36Sopenharmony_ci 63762306a36Sopenharmony_ci"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume 63862306a36Sopenharmony_cion an IO device and is an example of this type. 63962306a36Sopenharmony_ci 64062306a36Sopenharmony_ci.. _cgroupv2-protections-distributor: 64162306a36Sopenharmony_ci 64262306a36Sopenharmony_ciProtections 64362306a36Sopenharmony_ci----------- 64462306a36Sopenharmony_ci 64562306a36Sopenharmony_ciA cgroup is protected up to the configured amount of the resource 64662306a36Sopenharmony_cias long as the usages of all its ancestors are under their 64762306a36Sopenharmony_ciprotected levels. Protections can be hard guarantees or best effort 64862306a36Sopenharmony_cisoft boundaries. Protections can also be over-committed in which case 64962306a36Sopenharmony_cionly up to the amount available to the parent is protected among 65062306a36Sopenharmony_cichildren. 65162306a36Sopenharmony_ci 65262306a36Sopenharmony_ciProtections are in the range [0, max] and defaults to 0, which is 65362306a36Sopenharmony_cinoop. 65462306a36Sopenharmony_ci 65562306a36Sopenharmony_ciAs protections can be over-committed, all configuration combinations 65662306a36Sopenharmony_ciare valid and there is no reason to reject configuration changes or 65762306a36Sopenharmony_ciprocess migrations. 65862306a36Sopenharmony_ci 65962306a36Sopenharmony_ci"memory.low" implements best-effort memory protection and is an 66062306a36Sopenharmony_ciexample of this type. 66162306a36Sopenharmony_ci 66262306a36Sopenharmony_ci 66362306a36Sopenharmony_ciAllocations 66462306a36Sopenharmony_ci----------- 66562306a36Sopenharmony_ci 66662306a36Sopenharmony_ciA cgroup is exclusively allocated a certain amount of a finite 66762306a36Sopenharmony_ciresource. Allocations can't be over-committed - the sum of the 66862306a36Sopenharmony_ciallocations of children can not exceed the amount of resource 66962306a36Sopenharmony_ciavailable to the parent. 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ciAllocations are in the range [0, max] and defaults to 0, which is no 67262306a36Sopenharmony_ciresource. 67362306a36Sopenharmony_ci 67462306a36Sopenharmony_ciAs allocations can't be over-committed, some configuration 67562306a36Sopenharmony_cicombinations are invalid and should be rejected. Also, if the 67662306a36Sopenharmony_ciresource is mandatory for execution of processes, process migrations 67762306a36Sopenharmony_cimay be rejected. 67862306a36Sopenharmony_ci 67962306a36Sopenharmony_ci"cpu.rt.max" hard-allocates realtime slices and is an example of this 68062306a36Sopenharmony_citype. 68162306a36Sopenharmony_ci 68262306a36Sopenharmony_ci 68362306a36Sopenharmony_ciInterface Files 68462306a36Sopenharmony_ci=============== 68562306a36Sopenharmony_ci 68662306a36Sopenharmony_ciFormat 68762306a36Sopenharmony_ci------ 68862306a36Sopenharmony_ci 68962306a36Sopenharmony_ciAll interface files should be in one of the following formats whenever 69062306a36Sopenharmony_cipossible:: 69162306a36Sopenharmony_ci 69262306a36Sopenharmony_ci New-line separated values 69362306a36Sopenharmony_ci (when only one value can be written at once) 69462306a36Sopenharmony_ci 69562306a36Sopenharmony_ci VAL0\n 69662306a36Sopenharmony_ci VAL1\n 69762306a36Sopenharmony_ci ... 69862306a36Sopenharmony_ci 69962306a36Sopenharmony_ci Space separated values 70062306a36Sopenharmony_ci (when read-only or multiple values can be written at once) 70162306a36Sopenharmony_ci 70262306a36Sopenharmony_ci VAL0 VAL1 ...\n 70362306a36Sopenharmony_ci 70462306a36Sopenharmony_ci Flat keyed 70562306a36Sopenharmony_ci 70662306a36Sopenharmony_ci KEY0 VAL0\n 70762306a36Sopenharmony_ci KEY1 VAL1\n 70862306a36Sopenharmony_ci ... 70962306a36Sopenharmony_ci 71062306a36Sopenharmony_ci Nested keyed 71162306a36Sopenharmony_ci 71262306a36Sopenharmony_ci KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... 71362306a36Sopenharmony_ci KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... 71462306a36Sopenharmony_ci ... 71562306a36Sopenharmony_ci 71662306a36Sopenharmony_ciFor a writable file, the format for writing should generally match 71762306a36Sopenharmony_cireading; however, controllers may allow omitting later fields or 71862306a36Sopenharmony_ciimplement restricted shortcuts for most common use cases. 71962306a36Sopenharmony_ci 72062306a36Sopenharmony_ciFor both flat and nested keyed files, only the values for a single key 72162306a36Sopenharmony_cican be written at a time. For nested keyed files, the sub key pairs 72262306a36Sopenharmony_cimay be specified in any order and not all pairs have to be specified. 72362306a36Sopenharmony_ci 72462306a36Sopenharmony_ci 72562306a36Sopenharmony_ciConventions 72662306a36Sopenharmony_ci----------- 72762306a36Sopenharmony_ci 72862306a36Sopenharmony_ci- Settings for a single feature should be contained in a single file. 72962306a36Sopenharmony_ci 73062306a36Sopenharmony_ci- The root cgroup should be exempt from resource control and thus 73162306a36Sopenharmony_ci shouldn't have resource control interface files. 73262306a36Sopenharmony_ci 73362306a36Sopenharmony_ci- The default time unit is microseconds. If a different unit is ever 73462306a36Sopenharmony_ci used, an explicit unit suffix must be present. 73562306a36Sopenharmony_ci 73662306a36Sopenharmony_ci- A parts-per quantity should use a percentage decimal with at least 73762306a36Sopenharmony_ci two digit fractional part - e.g. 13.40. 73862306a36Sopenharmony_ci 73962306a36Sopenharmony_ci- If a controller implements weight based resource distribution, its 74062306a36Sopenharmony_ci interface file should be named "weight" and have the range [1, 74162306a36Sopenharmony_ci 10000] with 100 as the default. The values are chosen to allow 74262306a36Sopenharmony_ci enough and symmetric bias in both directions while keeping it 74362306a36Sopenharmony_ci intuitive (the default is 100%). 74462306a36Sopenharmony_ci 74562306a36Sopenharmony_ci- If a controller implements an absolute resource guarantee and/or 74662306a36Sopenharmony_ci limit, the interface files should be named "min" and "max" 74762306a36Sopenharmony_ci respectively. If a controller implements best effort resource 74862306a36Sopenharmony_ci guarantee and/or limit, the interface files should be named "low" 74962306a36Sopenharmony_ci and "high" respectively. 75062306a36Sopenharmony_ci 75162306a36Sopenharmony_ci In the above four control files, the special token "max" should be 75262306a36Sopenharmony_ci used to represent upward infinity for both reading and writing. 75362306a36Sopenharmony_ci 75462306a36Sopenharmony_ci- If a setting has a configurable default value and keyed specific 75562306a36Sopenharmony_ci overrides, the default entry should be keyed with "default" and 75662306a36Sopenharmony_ci appear as the first entry in the file. 75762306a36Sopenharmony_ci 75862306a36Sopenharmony_ci The default value can be updated by writing either "default $VAL" or 75962306a36Sopenharmony_ci "$VAL". 76062306a36Sopenharmony_ci 76162306a36Sopenharmony_ci When writing to update a specific override, "default" can be used as 76262306a36Sopenharmony_ci the value to indicate removal of the override. Override entries 76362306a36Sopenharmony_ci with "default" as the value must not appear when read. 76462306a36Sopenharmony_ci 76562306a36Sopenharmony_ci For example, a setting which is keyed by major:minor device numbers 76662306a36Sopenharmony_ci with integer values may look like the following:: 76762306a36Sopenharmony_ci 76862306a36Sopenharmony_ci # cat cgroup-example-interface-file 76962306a36Sopenharmony_ci default 150 77062306a36Sopenharmony_ci 8:0 300 77162306a36Sopenharmony_ci 77262306a36Sopenharmony_ci The default value can be updated by:: 77362306a36Sopenharmony_ci 77462306a36Sopenharmony_ci # echo 125 > cgroup-example-interface-file 77562306a36Sopenharmony_ci 77662306a36Sopenharmony_ci or:: 77762306a36Sopenharmony_ci 77862306a36Sopenharmony_ci # echo "default 125" > cgroup-example-interface-file 77962306a36Sopenharmony_ci 78062306a36Sopenharmony_ci An override can be set by:: 78162306a36Sopenharmony_ci 78262306a36Sopenharmony_ci # echo "8:16 170" > cgroup-example-interface-file 78362306a36Sopenharmony_ci 78462306a36Sopenharmony_ci and cleared by:: 78562306a36Sopenharmony_ci 78662306a36Sopenharmony_ci # echo "8:0 default" > cgroup-example-interface-file 78762306a36Sopenharmony_ci # cat cgroup-example-interface-file 78862306a36Sopenharmony_ci default 125 78962306a36Sopenharmony_ci 8:16 170 79062306a36Sopenharmony_ci 79162306a36Sopenharmony_ci- For events which are not very high frequency, an interface file 79262306a36Sopenharmony_ci "events" should be created which lists event key value pairs. 79362306a36Sopenharmony_ci Whenever a notifiable event happens, file modified event should be 79462306a36Sopenharmony_ci generated on the file. 79562306a36Sopenharmony_ci 79662306a36Sopenharmony_ci 79762306a36Sopenharmony_ciCore Interface Files 79862306a36Sopenharmony_ci-------------------- 79962306a36Sopenharmony_ci 80062306a36Sopenharmony_ciAll cgroup core files are prefixed with "cgroup." 80162306a36Sopenharmony_ci 80262306a36Sopenharmony_ci cgroup.type 80362306a36Sopenharmony_ci A read-write single value file which exists on non-root 80462306a36Sopenharmony_ci cgroups. 80562306a36Sopenharmony_ci 80662306a36Sopenharmony_ci When read, it indicates the current type of the cgroup, which 80762306a36Sopenharmony_ci can be one of the following values. 80862306a36Sopenharmony_ci 80962306a36Sopenharmony_ci - "domain" : A normal valid domain cgroup. 81062306a36Sopenharmony_ci 81162306a36Sopenharmony_ci - "domain threaded" : A threaded domain cgroup which is 81262306a36Sopenharmony_ci serving as the root of a threaded subtree. 81362306a36Sopenharmony_ci 81462306a36Sopenharmony_ci - "domain invalid" : A cgroup which is in an invalid state. 81562306a36Sopenharmony_ci It can't be populated or have controllers enabled. It may 81662306a36Sopenharmony_ci be allowed to become a threaded cgroup. 81762306a36Sopenharmony_ci 81862306a36Sopenharmony_ci - "threaded" : A threaded cgroup which is a member of a 81962306a36Sopenharmony_ci threaded subtree. 82062306a36Sopenharmony_ci 82162306a36Sopenharmony_ci A cgroup can be turned into a threaded cgroup by writing 82262306a36Sopenharmony_ci "threaded" to this file. 82362306a36Sopenharmony_ci 82462306a36Sopenharmony_ci cgroup.procs 82562306a36Sopenharmony_ci A read-write new-line separated values file which exists on 82662306a36Sopenharmony_ci all cgroups. 82762306a36Sopenharmony_ci 82862306a36Sopenharmony_ci When read, it lists the PIDs of all processes which belong to 82962306a36Sopenharmony_ci the cgroup one-per-line. The PIDs are not ordered and the 83062306a36Sopenharmony_ci same PID may show up more than once if the process got moved 83162306a36Sopenharmony_ci to another cgroup and then back or the PID got recycled while 83262306a36Sopenharmony_ci reading. 83362306a36Sopenharmony_ci 83462306a36Sopenharmony_ci A PID can be written to migrate the process associated with 83562306a36Sopenharmony_ci the PID to the cgroup. The writer should match all of the 83662306a36Sopenharmony_ci following conditions. 83762306a36Sopenharmony_ci 83862306a36Sopenharmony_ci - It must have write access to the "cgroup.procs" file. 83962306a36Sopenharmony_ci 84062306a36Sopenharmony_ci - It must have write access to the "cgroup.procs" file of the 84162306a36Sopenharmony_ci common ancestor of the source and destination cgroups. 84262306a36Sopenharmony_ci 84362306a36Sopenharmony_ci When delegating a sub-hierarchy, write access to this file 84462306a36Sopenharmony_ci should be granted along with the containing directory. 84562306a36Sopenharmony_ci 84662306a36Sopenharmony_ci In a threaded cgroup, reading this file fails with EOPNOTSUPP 84762306a36Sopenharmony_ci as all the processes belong to the thread root. Writing is 84862306a36Sopenharmony_ci supported and moves every thread of the process to the cgroup. 84962306a36Sopenharmony_ci 85062306a36Sopenharmony_ci cgroup.threads 85162306a36Sopenharmony_ci A read-write new-line separated values file which exists on 85262306a36Sopenharmony_ci all cgroups. 85362306a36Sopenharmony_ci 85462306a36Sopenharmony_ci When read, it lists the TIDs of all threads which belong to 85562306a36Sopenharmony_ci the cgroup one-per-line. The TIDs are not ordered and the 85662306a36Sopenharmony_ci same TID may show up more than once if the thread got moved to 85762306a36Sopenharmony_ci another cgroup and then back or the TID got recycled while 85862306a36Sopenharmony_ci reading. 85962306a36Sopenharmony_ci 86062306a36Sopenharmony_ci A TID can be written to migrate the thread associated with the 86162306a36Sopenharmony_ci TID to the cgroup. The writer should match all of the 86262306a36Sopenharmony_ci following conditions. 86362306a36Sopenharmony_ci 86462306a36Sopenharmony_ci - It must have write access to the "cgroup.threads" file. 86562306a36Sopenharmony_ci 86662306a36Sopenharmony_ci - The cgroup that the thread is currently in must be in the 86762306a36Sopenharmony_ci same resource domain as the destination cgroup. 86862306a36Sopenharmony_ci 86962306a36Sopenharmony_ci - It must have write access to the "cgroup.procs" file of the 87062306a36Sopenharmony_ci common ancestor of the source and destination cgroups. 87162306a36Sopenharmony_ci 87262306a36Sopenharmony_ci When delegating a sub-hierarchy, write access to this file 87362306a36Sopenharmony_ci should be granted along with the containing directory. 87462306a36Sopenharmony_ci 87562306a36Sopenharmony_ci cgroup.controllers 87662306a36Sopenharmony_ci A read-only space separated values file which exists on all 87762306a36Sopenharmony_ci cgroups. 87862306a36Sopenharmony_ci 87962306a36Sopenharmony_ci It shows space separated list of all controllers available to 88062306a36Sopenharmony_ci the cgroup. The controllers are not ordered. 88162306a36Sopenharmony_ci 88262306a36Sopenharmony_ci cgroup.subtree_control 88362306a36Sopenharmony_ci A read-write space separated values file which exists on all 88462306a36Sopenharmony_ci cgroups. Starts out empty. 88562306a36Sopenharmony_ci 88662306a36Sopenharmony_ci When read, it shows space separated list of the controllers 88762306a36Sopenharmony_ci which are enabled to control resource distribution from the 88862306a36Sopenharmony_ci cgroup to its children. 88962306a36Sopenharmony_ci 89062306a36Sopenharmony_ci Space separated list of controllers prefixed with '+' or '-' 89162306a36Sopenharmony_ci can be written to enable or disable controllers. A controller 89262306a36Sopenharmony_ci name prefixed with '+' enables the controller and '-' 89362306a36Sopenharmony_ci disables. If a controller appears more than once on the list, 89462306a36Sopenharmony_ci the last one is effective. When multiple enable and disable 89562306a36Sopenharmony_ci operations are specified, either all succeed or all fail. 89662306a36Sopenharmony_ci 89762306a36Sopenharmony_ci cgroup.events 89862306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 89962306a36Sopenharmony_ci The following entries are defined. Unless specified 90062306a36Sopenharmony_ci otherwise, a value change in this file generates a file 90162306a36Sopenharmony_ci modified event. 90262306a36Sopenharmony_ci 90362306a36Sopenharmony_ci populated 90462306a36Sopenharmony_ci 1 if the cgroup or its descendants contains any live 90562306a36Sopenharmony_ci processes; otherwise, 0. 90662306a36Sopenharmony_ci frozen 90762306a36Sopenharmony_ci 1 if the cgroup is frozen; otherwise, 0. 90862306a36Sopenharmony_ci 90962306a36Sopenharmony_ci cgroup.max.descendants 91062306a36Sopenharmony_ci A read-write single value files. The default is "max". 91162306a36Sopenharmony_ci 91262306a36Sopenharmony_ci Maximum allowed number of descent cgroups. 91362306a36Sopenharmony_ci If the actual number of descendants is equal or larger, 91462306a36Sopenharmony_ci an attempt to create a new cgroup in the hierarchy will fail. 91562306a36Sopenharmony_ci 91662306a36Sopenharmony_ci cgroup.max.depth 91762306a36Sopenharmony_ci A read-write single value files. The default is "max". 91862306a36Sopenharmony_ci 91962306a36Sopenharmony_ci Maximum allowed descent depth below the current cgroup. 92062306a36Sopenharmony_ci If the actual descent depth is equal or larger, 92162306a36Sopenharmony_ci an attempt to create a new child cgroup will fail. 92262306a36Sopenharmony_ci 92362306a36Sopenharmony_ci cgroup.stat 92462306a36Sopenharmony_ci A read-only flat-keyed file with the following entries: 92562306a36Sopenharmony_ci 92662306a36Sopenharmony_ci nr_descendants 92762306a36Sopenharmony_ci Total number of visible descendant cgroups. 92862306a36Sopenharmony_ci 92962306a36Sopenharmony_ci nr_dying_descendants 93062306a36Sopenharmony_ci Total number of dying descendant cgroups. A cgroup becomes 93162306a36Sopenharmony_ci dying after being deleted by a user. The cgroup will remain 93262306a36Sopenharmony_ci in dying state for some time undefined time (which can depend 93362306a36Sopenharmony_ci on system load) before being completely destroyed. 93462306a36Sopenharmony_ci 93562306a36Sopenharmony_ci A process can't enter a dying cgroup under any circumstances, 93662306a36Sopenharmony_ci a dying cgroup can't revive. 93762306a36Sopenharmony_ci 93862306a36Sopenharmony_ci A dying cgroup can consume system resources not exceeding 93962306a36Sopenharmony_ci limits, which were active at the moment of cgroup deletion. 94062306a36Sopenharmony_ci 94162306a36Sopenharmony_ci cgroup.freeze 94262306a36Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 94362306a36Sopenharmony_ci Allowed values are "0" and "1". The default is "0". 94462306a36Sopenharmony_ci 94562306a36Sopenharmony_ci Writing "1" to the file causes freezing of the cgroup and all 94662306a36Sopenharmony_ci descendant cgroups. This means that all belonging processes will 94762306a36Sopenharmony_ci be stopped and will not run until the cgroup will be explicitly 94862306a36Sopenharmony_ci unfrozen. Freezing of the cgroup may take some time; when this action 94962306a36Sopenharmony_ci is completed, the "frozen" value in the cgroup.events control file 95062306a36Sopenharmony_ci will be updated to "1" and the corresponding notification will be 95162306a36Sopenharmony_ci issued. 95262306a36Sopenharmony_ci 95362306a36Sopenharmony_ci A cgroup can be frozen either by its own settings, or by settings 95462306a36Sopenharmony_ci of any ancestor cgroups. If any of ancestor cgroups is frozen, the 95562306a36Sopenharmony_ci cgroup will remain frozen. 95662306a36Sopenharmony_ci 95762306a36Sopenharmony_ci Processes in the frozen cgroup can be killed by a fatal signal. 95862306a36Sopenharmony_ci They also can enter and leave a frozen cgroup: either by an explicit 95962306a36Sopenharmony_ci move by a user, or if freezing of the cgroup races with fork(). 96062306a36Sopenharmony_ci If a process is moved to a frozen cgroup, it stops. If a process is 96162306a36Sopenharmony_ci moved out of a frozen cgroup, it becomes running. 96262306a36Sopenharmony_ci 96362306a36Sopenharmony_ci Frozen status of a cgroup doesn't affect any cgroup tree operations: 96462306a36Sopenharmony_ci it's possible to delete a frozen (and empty) cgroup, as well as 96562306a36Sopenharmony_ci create new sub-cgroups. 96662306a36Sopenharmony_ci 96762306a36Sopenharmony_ci cgroup.kill 96862306a36Sopenharmony_ci A write-only single value file which exists in non-root cgroups. 96962306a36Sopenharmony_ci The only allowed value is "1". 97062306a36Sopenharmony_ci 97162306a36Sopenharmony_ci Writing "1" to the file causes the cgroup and all descendant cgroups to 97262306a36Sopenharmony_ci be killed. This means that all processes located in the affected cgroup 97362306a36Sopenharmony_ci tree will be killed via SIGKILL. 97462306a36Sopenharmony_ci 97562306a36Sopenharmony_ci Killing a cgroup tree will deal with concurrent forks appropriately and 97662306a36Sopenharmony_ci is protected against migrations. 97762306a36Sopenharmony_ci 97862306a36Sopenharmony_ci In a threaded cgroup, writing this file fails with EOPNOTSUPP as 97962306a36Sopenharmony_ci killing cgroups is a process directed operation, i.e. it affects 98062306a36Sopenharmony_ci the whole thread-group. 98162306a36Sopenharmony_ci 98262306a36Sopenharmony_ci cgroup.pressure 98362306a36Sopenharmony_ci A read-write single value file that allowed values are "0" and "1". 98462306a36Sopenharmony_ci The default is "1". 98562306a36Sopenharmony_ci 98662306a36Sopenharmony_ci Writing "0" to the file will disable the cgroup PSI accounting. 98762306a36Sopenharmony_ci Writing "1" to the file will re-enable the cgroup PSI accounting. 98862306a36Sopenharmony_ci 98962306a36Sopenharmony_ci This control attribute is not hierarchical, so disable or enable PSI 99062306a36Sopenharmony_ci accounting in a cgroup does not affect PSI accounting in descendants 99162306a36Sopenharmony_ci and doesn't need pass enablement via ancestors from root. 99262306a36Sopenharmony_ci 99362306a36Sopenharmony_ci The reason this control attribute exists is that PSI accounts stalls for 99462306a36Sopenharmony_ci each cgroup separately and aggregates it at each level of the hierarchy. 99562306a36Sopenharmony_ci This may cause non-negligible overhead for some workloads when under 99662306a36Sopenharmony_ci deep level of the hierarchy, in which case this control attribute can 99762306a36Sopenharmony_ci be used to disable PSI accounting in the non-leaf cgroups. 99862306a36Sopenharmony_ci 99962306a36Sopenharmony_ci irq.pressure 100062306a36Sopenharmony_ci A read-write nested-keyed file. 100162306a36Sopenharmony_ci 100262306a36Sopenharmony_ci Shows pressure stall information for IRQ/SOFTIRQ. See 100362306a36Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 100462306a36Sopenharmony_ci 100562306a36Sopenharmony_ciControllers 100662306a36Sopenharmony_ci=========== 100762306a36Sopenharmony_ci 100862306a36Sopenharmony_ci.. _cgroup-v2-cpu: 100962306a36Sopenharmony_ci 101062306a36Sopenharmony_ciCPU 101162306a36Sopenharmony_ci--- 101262306a36Sopenharmony_ci 101362306a36Sopenharmony_ciThe "cpu" controllers regulates distribution of CPU cycles. This 101462306a36Sopenharmony_cicontroller implements weight and absolute bandwidth limit models for 101562306a36Sopenharmony_cinormal scheduling policy and absolute bandwidth allocation model for 101662306a36Sopenharmony_cirealtime scheduling policy. 101762306a36Sopenharmony_ci 101862306a36Sopenharmony_ciIn all the above models, cycles distribution is defined only on a temporal 101962306a36Sopenharmony_cibase and it does not account for the frequency at which tasks are executed. 102062306a36Sopenharmony_ciThe (optional) utilization clamping support allows to hint the schedutil 102162306a36Sopenharmony_cicpufreq governor about the minimum desired frequency which should always be 102262306a36Sopenharmony_ciprovided by a CPU, as well as the maximum desired frequency, which should not 102362306a36Sopenharmony_cibe exceeded by a CPU. 102462306a36Sopenharmony_ci 102562306a36Sopenharmony_ciWARNING: cgroup2 doesn't yet support control of realtime processes and 102662306a36Sopenharmony_cithe cpu controller can only be enabled when all RT processes are in 102762306a36Sopenharmony_cithe root cgroup. Be aware that system management software may already 102862306a36Sopenharmony_cihave placed RT processes into nonroot cgroups during the system boot 102962306a36Sopenharmony_ciprocess, and these processes may need to be moved to the root cgroup 103062306a36Sopenharmony_cibefore the cpu controller can be enabled. 103162306a36Sopenharmony_ci 103262306a36Sopenharmony_ci 103362306a36Sopenharmony_ciCPU Interface Files 103462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 103562306a36Sopenharmony_ci 103662306a36Sopenharmony_ciAll time durations are in microseconds. 103762306a36Sopenharmony_ci 103862306a36Sopenharmony_ci cpu.stat 103962306a36Sopenharmony_ci A read-only flat-keyed file. 104062306a36Sopenharmony_ci This file exists whether the controller is enabled or not. 104162306a36Sopenharmony_ci 104262306a36Sopenharmony_ci It always reports the following three stats: 104362306a36Sopenharmony_ci 104462306a36Sopenharmony_ci - usage_usec 104562306a36Sopenharmony_ci - user_usec 104662306a36Sopenharmony_ci - system_usec 104762306a36Sopenharmony_ci 104862306a36Sopenharmony_ci and the following five when the controller is enabled: 104962306a36Sopenharmony_ci 105062306a36Sopenharmony_ci - nr_periods 105162306a36Sopenharmony_ci - nr_throttled 105262306a36Sopenharmony_ci - throttled_usec 105362306a36Sopenharmony_ci - nr_bursts 105462306a36Sopenharmony_ci - burst_usec 105562306a36Sopenharmony_ci 105662306a36Sopenharmony_ci cpu.weight 105762306a36Sopenharmony_ci A read-write single value file which exists on non-root 105862306a36Sopenharmony_ci cgroups. The default is "100". 105962306a36Sopenharmony_ci 106062306a36Sopenharmony_ci The weight in the range [1, 10000]. 106162306a36Sopenharmony_ci 106262306a36Sopenharmony_ci cpu.weight.nice 106362306a36Sopenharmony_ci A read-write single value file which exists on non-root 106462306a36Sopenharmony_ci cgroups. The default is "0". 106562306a36Sopenharmony_ci 106662306a36Sopenharmony_ci The nice value is in the range [-20, 19]. 106762306a36Sopenharmony_ci 106862306a36Sopenharmony_ci This interface file is an alternative interface for 106962306a36Sopenharmony_ci "cpu.weight" and allows reading and setting weight using the 107062306a36Sopenharmony_ci same values used by nice(2). Because the range is smaller and 107162306a36Sopenharmony_ci granularity is coarser for the nice values, the read value is 107262306a36Sopenharmony_ci the closest approximation of the current weight. 107362306a36Sopenharmony_ci 107462306a36Sopenharmony_ci cpu.max 107562306a36Sopenharmony_ci A read-write two value file which exists on non-root cgroups. 107662306a36Sopenharmony_ci The default is "max 100000". 107762306a36Sopenharmony_ci 107862306a36Sopenharmony_ci The maximum bandwidth limit. It's in the following format:: 107962306a36Sopenharmony_ci 108062306a36Sopenharmony_ci $MAX $PERIOD 108162306a36Sopenharmony_ci 108262306a36Sopenharmony_ci which indicates that the group may consume up to $MAX in each 108362306a36Sopenharmony_ci $PERIOD duration. "max" for $MAX indicates no limit. If only 108462306a36Sopenharmony_ci one number is written, $MAX is updated. 108562306a36Sopenharmony_ci 108662306a36Sopenharmony_ci cpu.max.burst 108762306a36Sopenharmony_ci A read-write single value file which exists on non-root 108862306a36Sopenharmony_ci cgroups. The default is "0". 108962306a36Sopenharmony_ci 109062306a36Sopenharmony_ci The burst in the range [0, $MAX]. 109162306a36Sopenharmony_ci 109262306a36Sopenharmony_ci cpu.pressure 109362306a36Sopenharmony_ci A read-write nested-keyed file. 109462306a36Sopenharmony_ci 109562306a36Sopenharmony_ci Shows pressure stall information for CPU. See 109662306a36Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 109762306a36Sopenharmony_ci 109862306a36Sopenharmony_ci cpu.uclamp.min 109962306a36Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 110062306a36Sopenharmony_ci The default is "0", i.e. no utilization boosting. 110162306a36Sopenharmony_ci 110262306a36Sopenharmony_ci The requested minimum utilization (protection) as a percentage 110362306a36Sopenharmony_ci rational number, e.g. 12.34 for 12.34%. 110462306a36Sopenharmony_ci 110562306a36Sopenharmony_ci This interface allows reading and setting minimum utilization clamp 110662306a36Sopenharmony_ci values similar to the sched_setattr(2). This minimum utilization 110762306a36Sopenharmony_ci value is used to clamp the task specific minimum utilization clamp. 110862306a36Sopenharmony_ci 110962306a36Sopenharmony_ci The requested minimum utilization (protection) is always capped by 111062306a36Sopenharmony_ci the current value for the maximum utilization (limit), i.e. 111162306a36Sopenharmony_ci `cpu.uclamp.max`. 111262306a36Sopenharmony_ci 111362306a36Sopenharmony_ci cpu.uclamp.max 111462306a36Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 111562306a36Sopenharmony_ci The default is "max". i.e. no utilization capping 111662306a36Sopenharmony_ci 111762306a36Sopenharmony_ci The requested maximum utilization (limit) as a percentage rational 111862306a36Sopenharmony_ci number, e.g. 98.76 for 98.76%. 111962306a36Sopenharmony_ci 112062306a36Sopenharmony_ci This interface allows reading and setting maximum utilization clamp 112162306a36Sopenharmony_ci values similar to the sched_setattr(2). This maximum utilization 112262306a36Sopenharmony_ci value is used to clamp the task specific maximum utilization clamp. 112362306a36Sopenharmony_ci 112462306a36Sopenharmony_ci 112562306a36Sopenharmony_ci 112662306a36Sopenharmony_ciMemory 112762306a36Sopenharmony_ci------ 112862306a36Sopenharmony_ci 112962306a36Sopenharmony_ciThe "memory" controller regulates distribution of memory. Memory is 113062306a36Sopenharmony_cistateful and implements both limit and protection models. Due to the 113162306a36Sopenharmony_ciintertwining between memory usage and reclaim pressure and the 113262306a36Sopenharmony_cistateful nature of memory, the distribution model is relatively 113362306a36Sopenharmony_cicomplex. 113462306a36Sopenharmony_ci 113562306a36Sopenharmony_ciWhile not completely water-tight, all major memory usages by a given 113662306a36Sopenharmony_cicgroup are tracked so that the total memory consumption can be 113762306a36Sopenharmony_ciaccounted and controlled to a reasonable extent. Currently, the 113862306a36Sopenharmony_cifollowing types of memory usages are tracked. 113962306a36Sopenharmony_ci 114062306a36Sopenharmony_ci- Userland memory - page cache and anonymous memory. 114162306a36Sopenharmony_ci 114262306a36Sopenharmony_ci- Kernel data structures such as dentries and inodes. 114362306a36Sopenharmony_ci 114462306a36Sopenharmony_ci- TCP socket buffers. 114562306a36Sopenharmony_ci 114662306a36Sopenharmony_ciThe above list may expand in the future for better coverage. 114762306a36Sopenharmony_ci 114862306a36Sopenharmony_ci 114962306a36Sopenharmony_ciMemory Interface Files 115062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 115162306a36Sopenharmony_ci 115262306a36Sopenharmony_ciAll memory amounts are in bytes. If a value which is not aligned to 115362306a36Sopenharmony_ciPAGE_SIZE is written, the value may be rounded up to the closest 115462306a36Sopenharmony_ciPAGE_SIZE multiple when read back. 115562306a36Sopenharmony_ci 115662306a36Sopenharmony_ci memory.current 115762306a36Sopenharmony_ci A read-only single value file which exists on non-root 115862306a36Sopenharmony_ci cgroups. 115962306a36Sopenharmony_ci 116062306a36Sopenharmony_ci The total amount of memory currently being used by the cgroup 116162306a36Sopenharmony_ci and its descendants. 116262306a36Sopenharmony_ci 116362306a36Sopenharmony_ci memory.min 116462306a36Sopenharmony_ci A read-write single value file which exists on non-root 116562306a36Sopenharmony_ci cgroups. The default is "0". 116662306a36Sopenharmony_ci 116762306a36Sopenharmony_ci Hard memory protection. If the memory usage of a cgroup 116862306a36Sopenharmony_ci is within its effective min boundary, the cgroup's memory 116962306a36Sopenharmony_ci won't be reclaimed under any conditions. If there is no 117062306a36Sopenharmony_ci unprotected reclaimable memory available, OOM killer 117162306a36Sopenharmony_ci is invoked. Above the effective min boundary (or 117262306a36Sopenharmony_ci effective low boundary if it is higher), pages are reclaimed 117362306a36Sopenharmony_ci proportionally to the overage, reducing reclaim pressure for 117462306a36Sopenharmony_ci smaller overages. 117562306a36Sopenharmony_ci 117662306a36Sopenharmony_ci Effective min boundary is limited by memory.min values of 117762306a36Sopenharmony_ci all ancestor cgroups. If there is memory.min overcommitment 117862306a36Sopenharmony_ci (child cgroup or cgroups are requiring more protected memory 117962306a36Sopenharmony_ci than parent will allow), then each child cgroup will get 118062306a36Sopenharmony_ci the part of parent's protection proportional to its 118162306a36Sopenharmony_ci actual memory usage below memory.min. 118262306a36Sopenharmony_ci 118362306a36Sopenharmony_ci Putting more memory than generally available under this 118462306a36Sopenharmony_ci protection is discouraged and may lead to constant OOMs. 118562306a36Sopenharmony_ci 118662306a36Sopenharmony_ci If a memory cgroup is not populated with processes, 118762306a36Sopenharmony_ci its memory.min is ignored. 118862306a36Sopenharmony_ci 118962306a36Sopenharmony_ci memory.low 119062306a36Sopenharmony_ci A read-write single value file which exists on non-root 119162306a36Sopenharmony_ci cgroups. The default is "0". 119262306a36Sopenharmony_ci 119362306a36Sopenharmony_ci Best-effort memory protection. If the memory usage of a 119462306a36Sopenharmony_ci cgroup is within its effective low boundary, the cgroup's 119562306a36Sopenharmony_ci memory won't be reclaimed unless there is no reclaimable 119662306a36Sopenharmony_ci memory available in unprotected cgroups. 119762306a36Sopenharmony_ci Above the effective low boundary (or 119862306a36Sopenharmony_ci effective min boundary if it is higher), pages are reclaimed 119962306a36Sopenharmony_ci proportionally to the overage, reducing reclaim pressure for 120062306a36Sopenharmony_ci smaller overages. 120162306a36Sopenharmony_ci 120262306a36Sopenharmony_ci Effective low boundary is limited by memory.low values of 120362306a36Sopenharmony_ci all ancestor cgroups. If there is memory.low overcommitment 120462306a36Sopenharmony_ci (child cgroup or cgroups are requiring more protected memory 120562306a36Sopenharmony_ci than parent will allow), then each child cgroup will get 120662306a36Sopenharmony_ci the part of parent's protection proportional to its 120762306a36Sopenharmony_ci actual memory usage below memory.low. 120862306a36Sopenharmony_ci 120962306a36Sopenharmony_ci Putting more memory than generally available under this 121062306a36Sopenharmony_ci protection is discouraged. 121162306a36Sopenharmony_ci 121262306a36Sopenharmony_ci memory.high 121362306a36Sopenharmony_ci A read-write single value file which exists on non-root 121462306a36Sopenharmony_ci cgroups. The default is "max". 121562306a36Sopenharmony_ci 121662306a36Sopenharmony_ci Memory usage throttle limit. If a cgroup's usage goes 121762306a36Sopenharmony_ci over the high boundary, the processes of the cgroup are 121862306a36Sopenharmony_ci throttled and put under heavy reclaim pressure. 121962306a36Sopenharmony_ci 122062306a36Sopenharmony_ci Going over the high limit never invokes the OOM killer and 122162306a36Sopenharmony_ci under extreme conditions the limit may be breached. The high 122262306a36Sopenharmony_ci limit should be used in scenarios where an external process 122362306a36Sopenharmony_ci monitors the limited cgroup to alleviate heavy reclaim 122462306a36Sopenharmony_ci pressure. 122562306a36Sopenharmony_ci 122662306a36Sopenharmony_ci memory.max 122762306a36Sopenharmony_ci A read-write single value file which exists on non-root 122862306a36Sopenharmony_ci cgroups. The default is "max". 122962306a36Sopenharmony_ci 123062306a36Sopenharmony_ci Memory usage hard limit. This is the main mechanism to limit 123162306a36Sopenharmony_ci memory usage of a cgroup. If a cgroup's memory usage reaches 123262306a36Sopenharmony_ci this limit and can't be reduced, the OOM killer is invoked in 123362306a36Sopenharmony_ci the cgroup. Under certain circumstances, the usage may go 123462306a36Sopenharmony_ci over the limit temporarily. 123562306a36Sopenharmony_ci 123662306a36Sopenharmony_ci In default configuration regular 0-order allocations always 123762306a36Sopenharmony_ci succeed unless OOM killer chooses current task as a victim. 123862306a36Sopenharmony_ci 123962306a36Sopenharmony_ci Some kinds of allocations don't invoke the OOM killer. 124062306a36Sopenharmony_ci Caller could retry them differently, return into userspace 124162306a36Sopenharmony_ci as -ENOMEM or silently ignore in cases like disk readahead. 124262306a36Sopenharmony_ci 124362306a36Sopenharmony_ci memory.reclaim 124462306a36Sopenharmony_ci A write-only nested-keyed file which exists for all cgroups. 124562306a36Sopenharmony_ci 124662306a36Sopenharmony_ci This is a simple interface to trigger memory reclaim in the 124762306a36Sopenharmony_ci target cgroup. 124862306a36Sopenharmony_ci 124962306a36Sopenharmony_ci This file accepts a single key, the number of bytes to reclaim. 125062306a36Sopenharmony_ci No nested keys are currently supported. 125162306a36Sopenharmony_ci 125262306a36Sopenharmony_ci Example:: 125362306a36Sopenharmony_ci 125462306a36Sopenharmony_ci echo "1G" > memory.reclaim 125562306a36Sopenharmony_ci 125662306a36Sopenharmony_ci The interface can be later extended with nested keys to 125762306a36Sopenharmony_ci configure the reclaim behavior. For example, specify the 125862306a36Sopenharmony_ci type of memory to reclaim from (anon, file, ..). 125962306a36Sopenharmony_ci 126062306a36Sopenharmony_ci Please note that the kernel can over or under reclaim from 126162306a36Sopenharmony_ci the target cgroup. If less bytes are reclaimed than the 126262306a36Sopenharmony_ci specified amount, -EAGAIN is returned. 126362306a36Sopenharmony_ci 126462306a36Sopenharmony_ci Please note that the proactive reclaim (triggered by this 126562306a36Sopenharmony_ci interface) is not meant to indicate memory pressure on the 126662306a36Sopenharmony_ci memory cgroup. Therefore socket memory balancing triggered by 126762306a36Sopenharmony_ci the memory reclaim normally is not exercised in this case. 126862306a36Sopenharmony_ci This means that the networking layer will not adapt based on 126962306a36Sopenharmony_ci reclaim induced by memory.reclaim. 127062306a36Sopenharmony_ci 127162306a36Sopenharmony_ci memory.peak 127262306a36Sopenharmony_ci A read-only single value file which exists on non-root 127362306a36Sopenharmony_ci cgroups. 127462306a36Sopenharmony_ci 127562306a36Sopenharmony_ci The max memory usage recorded for the cgroup and its 127662306a36Sopenharmony_ci descendants since the creation of the cgroup. 127762306a36Sopenharmony_ci 127862306a36Sopenharmony_ci memory.oom.group 127962306a36Sopenharmony_ci A read-write single value file which exists on non-root 128062306a36Sopenharmony_ci cgroups. The default value is "0". 128162306a36Sopenharmony_ci 128262306a36Sopenharmony_ci Determines whether the cgroup should be treated as 128362306a36Sopenharmony_ci an indivisible workload by the OOM killer. If set, 128462306a36Sopenharmony_ci all tasks belonging to the cgroup or to its descendants 128562306a36Sopenharmony_ci (if the memory cgroup is not a leaf cgroup) are killed 128662306a36Sopenharmony_ci together or not at all. This can be used to avoid 128762306a36Sopenharmony_ci partial kills to guarantee workload integrity. 128862306a36Sopenharmony_ci 128962306a36Sopenharmony_ci Tasks with the OOM protection (oom_score_adj set to -1000) 129062306a36Sopenharmony_ci are treated as an exception and are never killed. 129162306a36Sopenharmony_ci 129262306a36Sopenharmony_ci If the OOM killer is invoked in a cgroup, it's not going 129362306a36Sopenharmony_ci to kill any tasks outside of this cgroup, regardless 129462306a36Sopenharmony_ci memory.oom.group values of ancestor cgroups. 129562306a36Sopenharmony_ci 129662306a36Sopenharmony_ci memory.events 129762306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 129862306a36Sopenharmony_ci The following entries are defined. Unless specified 129962306a36Sopenharmony_ci otherwise, a value change in this file generates a file 130062306a36Sopenharmony_ci modified event. 130162306a36Sopenharmony_ci 130262306a36Sopenharmony_ci Note that all fields in this file are hierarchical and the 130362306a36Sopenharmony_ci file modified event can be generated due to an event down the 130462306a36Sopenharmony_ci hierarchy. For the local events at the cgroup level see 130562306a36Sopenharmony_ci memory.events.local. 130662306a36Sopenharmony_ci 130762306a36Sopenharmony_ci low 130862306a36Sopenharmony_ci The number of times the cgroup is reclaimed due to 130962306a36Sopenharmony_ci high memory pressure even though its usage is under 131062306a36Sopenharmony_ci the low boundary. This usually indicates that the low 131162306a36Sopenharmony_ci boundary is over-committed. 131262306a36Sopenharmony_ci 131362306a36Sopenharmony_ci high 131462306a36Sopenharmony_ci The number of times processes of the cgroup are 131562306a36Sopenharmony_ci throttled and routed to perform direct memory reclaim 131662306a36Sopenharmony_ci because the high memory boundary was exceeded. For a 131762306a36Sopenharmony_ci cgroup whose memory usage is capped by the high limit 131862306a36Sopenharmony_ci rather than global memory pressure, this event's 131962306a36Sopenharmony_ci occurrences are expected. 132062306a36Sopenharmony_ci 132162306a36Sopenharmony_ci max 132262306a36Sopenharmony_ci The number of times the cgroup's memory usage was 132362306a36Sopenharmony_ci about to go over the max boundary. If direct reclaim 132462306a36Sopenharmony_ci fails to bring it down, the cgroup goes to OOM state. 132562306a36Sopenharmony_ci 132662306a36Sopenharmony_ci oom 132762306a36Sopenharmony_ci The number of time the cgroup's memory usage was 132862306a36Sopenharmony_ci reached the limit and allocation was about to fail. 132962306a36Sopenharmony_ci 133062306a36Sopenharmony_ci This event is not raised if the OOM killer is not 133162306a36Sopenharmony_ci considered as an option, e.g. for failed high-order 133262306a36Sopenharmony_ci allocations or if caller asked to not retry attempts. 133362306a36Sopenharmony_ci 133462306a36Sopenharmony_ci oom_kill 133562306a36Sopenharmony_ci The number of processes belonging to this cgroup 133662306a36Sopenharmony_ci killed by any kind of OOM killer. 133762306a36Sopenharmony_ci 133862306a36Sopenharmony_ci oom_group_kill 133962306a36Sopenharmony_ci The number of times a group OOM has occurred. 134062306a36Sopenharmony_ci 134162306a36Sopenharmony_ci memory.events.local 134262306a36Sopenharmony_ci Similar to memory.events but the fields in the file are local 134362306a36Sopenharmony_ci to the cgroup i.e. not hierarchical. The file modified event 134462306a36Sopenharmony_ci generated on this file reflects only the local events. 134562306a36Sopenharmony_ci 134662306a36Sopenharmony_ci memory.stat 134762306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 134862306a36Sopenharmony_ci 134962306a36Sopenharmony_ci This breaks down the cgroup's memory footprint into different 135062306a36Sopenharmony_ci types of memory, type-specific details, and other information 135162306a36Sopenharmony_ci on the state and past events of the memory management system. 135262306a36Sopenharmony_ci 135362306a36Sopenharmony_ci All memory amounts are in bytes. 135462306a36Sopenharmony_ci 135562306a36Sopenharmony_ci The entries are ordered to be human readable, and new entries 135662306a36Sopenharmony_ci can show up in the middle. Don't rely on items remaining in a 135762306a36Sopenharmony_ci fixed position; use the keys to look up specific values! 135862306a36Sopenharmony_ci 135962306a36Sopenharmony_ci If the entry has no per-node counter (or not show in the 136062306a36Sopenharmony_ci memory.numa_stat). We use 'npn' (non-per-node) as the tag 136162306a36Sopenharmony_ci to indicate that it will not show in the memory.numa_stat. 136262306a36Sopenharmony_ci 136362306a36Sopenharmony_ci anon 136462306a36Sopenharmony_ci Amount of memory used in anonymous mappings such as 136562306a36Sopenharmony_ci brk(), sbrk(), and mmap(MAP_ANONYMOUS) 136662306a36Sopenharmony_ci 136762306a36Sopenharmony_ci file 136862306a36Sopenharmony_ci Amount of memory used to cache filesystem data, 136962306a36Sopenharmony_ci including tmpfs and shared memory. 137062306a36Sopenharmony_ci 137162306a36Sopenharmony_ci kernel (npn) 137262306a36Sopenharmony_ci Amount of total kernel memory, including 137362306a36Sopenharmony_ci (kernel_stack, pagetables, percpu, vmalloc, slab) in 137462306a36Sopenharmony_ci addition to other kernel memory use cases. 137562306a36Sopenharmony_ci 137662306a36Sopenharmony_ci kernel_stack 137762306a36Sopenharmony_ci Amount of memory allocated to kernel stacks. 137862306a36Sopenharmony_ci 137962306a36Sopenharmony_ci pagetables 138062306a36Sopenharmony_ci Amount of memory allocated for page tables. 138162306a36Sopenharmony_ci 138262306a36Sopenharmony_ci sec_pagetables 138362306a36Sopenharmony_ci Amount of memory allocated for secondary page tables, 138462306a36Sopenharmony_ci this currently includes KVM mmu allocations on x86 138562306a36Sopenharmony_ci and arm64. 138662306a36Sopenharmony_ci 138762306a36Sopenharmony_ci percpu (npn) 138862306a36Sopenharmony_ci Amount of memory used for storing per-cpu kernel 138962306a36Sopenharmony_ci data structures. 139062306a36Sopenharmony_ci 139162306a36Sopenharmony_ci sock (npn) 139262306a36Sopenharmony_ci Amount of memory used in network transmission buffers 139362306a36Sopenharmony_ci 139462306a36Sopenharmony_ci vmalloc (npn) 139562306a36Sopenharmony_ci Amount of memory used for vmap backed memory. 139662306a36Sopenharmony_ci 139762306a36Sopenharmony_ci shmem 139862306a36Sopenharmony_ci Amount of cached filesystem data that is swap-backed, 139962306a36Sopenharmony_ci such as tmpfs, shm segments, shared anonymous mmap()s 140062306a36Sopenharmony_ci 140162306a36Sopenharmony_ci zswap 140262306a36Sopenharmony_ci Amount of memory consumed by the zswap compression backend. 140362306a36Sopenharmony_ci 140462306a36Sopenharmony_ci zswapped 140562306a36Sopenharmony_ci Amount of application memory swapped out to zswap. 140662306a36Sopenharmony_ci 140762306a36Sopenharmony_ci file_mapped 140862306a36Sopenharmony_ci Amount of cached filesystem data mapped with mmap() 140962306a36Sopenharmony_ci 141062306a36Sopenharmony_ci file_dirty 141162306a36Sopenharmony_ci Amount of cached filesystem data that was modified but 141262306a36Sopenharmony_ci not yet written back to disk 141362306a36Sopenharmony_ci 141462306a36Sopenharmony_ci file_writeback 141562306a36Sopenharmony_ci Amount of cached filesystem data that was modified and 141662306a36Sopenharmony_ci is currently being written back to disk 141762306a36Sopenharmony_ci 141862306a36Sopenharmony_ci swapcached 141962306a36Sopenharmony_ci Amount of swap cached in memory. The swapcache is accounted 142062306a36Sopenharmony_ci against both memory and swap usage. 142162306a36Sopenharmony_ci 142262306a36Sopenharmony_ci anon_thp 142362306a36Sopenharmony_ci Amount of memory used in anonymous mappings backed by 142462306a36Sopenharmony_ci transparent hugepages 142562306a36Sopenharmony_ci 142662306a36Sopenharmony_ci file_thp 142762306a36Sopenharmony_ci Amount of cached filesystem data backed by transparent 142862306a36Sopenharmony_ci hugepages 142962306a36Sopenharmony_ci 143062306a36Sopenharmony_ci shmem_thp 143162306a36Sopenharmony_ci Amount of shm, tmpfs, shared anonymous mmap()s backed by 143262306a36Sopenharmony_ci transparent hugepages 143362306a36Sopenharmony_ci 143462306a36Sopenharmony_ci inactive_anon, active_anon, inactive_file, active_file, unevictable 143562306a36Sopenharmony_ci Amount of memory, swap-backed and filesystem-backed, 143662306a36Sopenharmony_ci on the internal memory management lists used by the 143762306a36Sopenharmony_ci page reclaim algorithm. 143862306a36Sopenharmony_ci 143962306a36Sopenharmony_ci As these represent internal list state (eg. shmem pages are on anon 144062306a36Sopenharmony_ci memory management lists), inactive_foo + active_foo may not be equal to 144162306a36Sopenharmony_ci the value for the foo counter, since the foo counter is type-based, not 144262306a36Sopenharmony_ci list-based. 144362306a36Sopenharmony_ci 144462306a36Sopenharmony_ci slab_reclaimable 144562306a36Sopenharmony_ci Part of "slab" that might be reclaimed, such as 144662306a36Sopenharmony_ci dentries and inodes. 144762306a36Sopenharmony_ci 144862306a36Sopenharmony_ci slab_unreclaimable 144962306a36Sopenharmony_ci Part of "slab" that cannot be reclaimed on memory 145062306a36Sopenharmony_ci pressure. 145162306a36Sopenharmony_ci 145262306a36Sopenharmony_ci slab (npn) 145362306a36Sopenharmony_ci Amount of memory used for storing in-kernel data 145462306a36Sopenharmony_ci structures. 145562306a36Sopenharmony_ci 145662306a36Sopenharmony_ci workingset_refault_anon 145762306a36Sopenharmony_ci Number of refaults of previously evicted anonymous pages. 145862306a36Sopenharmony_ci 145962306a36Sopenharmony_ci workingset_refault_file 146062306a36Sopenharmony_ci Number of refaults of previously evicted file pages. 146162306a36Sopenharmony_ci 146262306a36Sopenharmony_ci workingset_activate_anon 146362306a36Sopenharmony_ci Number of refaulted anonymous pages that were immediately 146462306a36Sopenharmony_ci activated. 146562306a36Sopenharmony_ci 146662306a36Sopenharmony_ci workingset_activate_file 146762306a36Sopenharmony_ci Number of refaulted file pages that were immediately activated. 146862306a36Sopenharmony_ci 146962306a36Sopenharmony_ci workingset_restore_anon 147062306a36Sopenharmony_ci Number of restored anonymous pages which have been detected as 147162306a36Sopenharmony_ci an active workingset before they got reclaimed. 147262306a36Sopenharmony_ci 147362306a36Sopenharmony_ci workingset_restore_file 147462306a36Sopenharmony_ci Number of restored file pages which have been detected as an 147562306a36Sopenharmony_ci active workingset before they got reclaimed. 147662306a36Sopenharmony_ci 147762306a36Sopenharmony_ci workingset_nodereclaim 147862306a36Sopenharmony_ci Number of times a shadow node has been reclaimed 147962306a36Sopenharmony_ci 148062306a36Sopenharmony_ci pgscan (npn) 148162306a36Sopenharmony_ci Amount of scanned pages (in an inactive LRU list) 148262306a36Sopenharmony_ci 148362306a36Sopenharmony_ci pgsteal (npn) 148462306a36Sopenharmony_ci Amount of reclaimed pages 148562306a36Sopenharmony_ci 148662306a36Sopenharmony_ci pgscan_kswapd (npn) 148762306a36Sopenharmony_ci Amount of scanned pages by kswapd (in an inactive LRU list) 148862306a36Sopenharmony_ci 148962306a36Sopenharmony_ci pgscan_direct (npn) 149062306a36Sopenharmony_ci Amount of scanned pages directly (in an inactive LRU list) 149162306a36Sopenharmony_ci 149262306a36Sopenharmony_ci pgscan_khugepaged (npn) 149362306a36Sopenharmony_ci Amount of scanned pages by khugepaged (in an inactive LRU list) 149462306a36Sopenharmony_ci 149562306a36Sopenharmony_ci pgsteal_kswapd (npn) 149662306a36Sopenharmony_ci Amount of reclaimed pages by kswapd 149762306a36Sopenharmony_ci 149862306a36Sopenharmony_ci pgsteal_direct (npn) 149962306a36Sopenharmony_ci Amount of reclaimed pages directly 150062306a36Sopenharmony_ci 150162306a36Sopenharmony_ci pgsteal_khugepaged (npn) 150262306a36Sopenharmony_ci Amount of reclaimed pages by khugepaged 150362306a36Sopenharmony_ci 150462306a36Sopenharmony_ci pgfault (npn) 150562306a36Sopenharmony_ci Total number of page faults incurred 150662306a36Sopenharmony_ci 150762306a36Sopenharmony_ci pgmajfault (npn) 150862306a36Sopenharmony_ci Number of major page faults incurred 150962306a36Sopenharmony_ci 151062306a36Sopenharmony_ci pgrefill (npn) 151162306a36Sopenharmony_ci Amount of scanned pages (in an active LRU list) 151262306a36Sopenharmony_ci 151362306a36Sopenharmony_ci pgactivate (npn) 151462306a36Sopenharmony_ci Amount of pages moved to the active LRU list 151562306a36Sopenharmony_ci 151662306a36Sopenharmony_ci pgdeactivate (npn) 151762306a36Sopenharmony_ci Amount of pages moved to the inactive LRU list 151862306a36Sopenharmony_ci 151962306a36Sopenharmony_ci pglazyfree (npn) 152062306a36Sopenharmony_ci Amount of pages postponed to be freed under memory pressure 152162306a36Sopenharmony_ci 152262306a36Sopenharmony_ci pglazyfreed (npn) 152362306a36Sopenharmony_ci Amount of reclaimed lazyfree pages 152462306a36Sopenharmony_ci 152562306a36Sopenharmony_ci thp_fault_alloc (npn) 152662306a36Sopenharmony_ci Number of transparent hugepages which were allocated to satisfy 152762306a36Sopenharmony_ci a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE 152862306a36Sopenharmony_ci is not set. 152962306a36Sopenharmony_ci 153062306a36Sopenharmony_ci thp_collapse_alloc (npn) 153162306a36Sopenharmony_ci Number of transparent hugepages which were allocated to allow 153262306a36Sopenharmony_ci collapsing an existing range of pages. This counter is not 153362306a36Sopenharmony_ci present when CONFIG_TRANSPARENT_HUGEPAGE is not set. 153462306a36Sopenharmony_ci 153562306a36Sopenharmony_ci memory.numa_stat 153662306a36Sopenharmony_ci A read-only nested-keyed file which exists on non-root cgroups. 153762306a36Sopenharmony_ci 153862306a36Sopenharmony_ci This breaks down the cgroup's memory footprint into different 153962306a36Sopenharmony_ci types of memory, type-specific details, and other information 154062306a36Sopenharmony_ci per node on the state of the memory management system. 154162306a36Sopenharmony_ci 154262306a36Sopenharmony_ci This is useful for providing visibility into the NUMA locality 154362306a36Sopenharmony_ci information within an memcg since the pages are allowed to be 154462306a36Sopenharmony_ci allocated from any physical node. One of the use case is evaluating 154562306a36Sopenharmony_ci application performance by combining this information with the 154662306a36Sopenharmony_ci application's CPU allocation. 154762306a36Sopenharmony_ci 154862306a36Sopenharmony_ci All memory amounts are in bytes. 154962306a36Sopenharmony_ci 155062306a36Sopenharmony_ci The output format of memory.numa_stat is:: 155162306a36Sopenharmony_ci 155262306a36Sopenharmony_ci type N0=<bytes in node 0> N1=<bytes in node 1> ... 155362306a36Sopenharmony_ci 155462306a36Sopenharmony_ci The entries are ordered to be human readable, and new entries 155562306a36Sopenharmony_ci can show up in the middle. Don't rely on items remaining in a 155662306a36Sopenharmony_ci fixed position; use the keys to look up specific values! 155762306a36Sopenharmony_ci 155862306a36Sopenharmony_ci The entries can refer to the memory.stat. 155962306a36Sopenharmony_ci 156062306a36Sopenharmony_ci memory.swap.current 156162306a36Sopenharmony_ci A read-only single value file which exists on non-root 156262306a36Sopenharmony_ci cgroups. 156362306a36Sopenharmony_ci 156462306a36Sopenharmony_ci The total amount of swap currently being used by the cgroup 156562306a36Sopenharmony_ci and its descendants. 156662306a36Sopenharmony_ci 156762306a36Sopenharmony_ci memory.swap.high 156862306a36Sopenharmony_ci A read-write single value file which exists on non-root 156962306a36Sopenharmony_ci cgroups. The default is "max". 157062306a36Sopenharmony_ci 157162306a36Sopenharmony_ci Swap usage throttle limit. If a cgroup's swap usage exceeds 157262306a36Sopenharmony_ci this limit, all its further allocations will be throttled to 157362306a36Sopenharmony_ci allow userspace to implement custom out-of-memory procedures. 157462306a36Sopenharmony_ci 157562306a36Sopenharmony_ci This limit marks a point of no return for the cgroup. It is NOT 157662306a36Sopenharmony_ci designed to manage the amount of swapping a workload does 157762306a36Sopenharmony_ci during regular operation. Compare to memory.swap.max, which 157862306a36Sopenharmony_ci prohibits swapping past a set amount, but lets the cgroup 157962306a36Sopenharmony_ci continue unimpeded as long as other memory can be reclaimed. 158062306a36Sopenharmony_ci 158162306a36Sopenharmony_ci Healthy workloads are not expected to reach this limit. 158262306a36Sopenharmony_ci 158362306a36Sopenharmony_ci memory.swap.peak 158462306a36Sopenharmony_ci A read-only single value file which exists on non-root 158562306a36Sopenharmony_ci cgroups. 158662306a36Sopenharmony_ci 158762306a36Sopenharmony_ci The max swap usage recorded for the cgroup and its 158862306a36Sopenharmony_ci descendants since the creation of the cgroup. 158962306a36Sopenharmony_ci 159062306a36Sopenharmony_ci memory.swap.max 159162306a36Sopenharmony_ci A read-write single value file which exists on non-root 159262306a36Sopenharmony_ci cgroups. The default is "max". 159362306a36Sopenharmony_ci 159462306a36Sopenharmony_ci Swap usage hard limit. If a cgroup's swap usage reaches this 159562306a36Sopenharmony_ci limit, anonymous memory of the cgroup will not be swapped out. 159662306a36Sopenharmony_ci 159762306a36Sopenharmony_ci memory.swap.events 159862306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 159962306a36Sopenharmony_ci The following entries are defined. Unless specified 160062306a36Sopenharmony_ci otherwise, a value change in this file generates a file 160162306a36Sopenharmony_ci modified event. 160262306a36Sopenharmony_ci 160362306a36Sopenharmony_ci high 160462306a36Sopenharmony_ci The number of times the cgroup's swap usage was over 160562306a36Sopenharmony_ci the high threshold. 160662306a36Sopenharmony_ci 160762306a36Sopenharmony_ci max 160862306a36Sopenharmony_ci The number of times the cgroup's swap usage was about 160962306a36Sopenharmony_ci to go over the max boundary and swap allocation 161062306a36Sopenharmony_ci failed. 161162306a36Sopenharmony_ci 161262306a36Sopenharmony_ci fail 161362306a36Sopenharmony_ci The number of times swap allocation failed either 161462306a36Sopenharmony_ci because of running out of swap system-wide or max 161562306a36Sopenharmony_ci limit. 161662306a36Sopenharmony_ci 161762306a36Sopenharmony_ci When reduced under the current usage, the existing swap 161862306a36Sopenharmony_ci entries are reclaimed gradually and the swap usage may stay 161962306a36Sopenharmony_ci higher than the limit for an extended period of time. This 162062306a36Sopenharmony_ci reduces the impact on the workload and memory management. 162162306a36Sopenharmony_ci 162262306a36Sopenharmony_ci memory.zswap.current 162362306a36Sopenharmony_ci A read-only single value file which exists on non-root 162462306a36Sopenharmony_ci cgroups. 162562306a36Sopenharmony_ci 162662306a36Sopenharmony_ci The total amount of memory consumed by the zswap compression 162762306a36Sopenharmony_ci backend. 162862306a36Sopenharmony_ci 162962306a36Sopenharmony_ci memory.zswap.max 163062306a36Sopenharmony_ci A read-write single value file which exists on non-root 163162306a36Sopenharmony_ci cgroups. The default is "max". 163262306a36Sopenharmony_ci 163362306a36Sopenharmony_ci Zswap usage hard limit. If a cgroup's zswap pool reaches this 163462306a36Sopenharmony_ci limit, it will refuse to take any more stores before existing 163562306a36Sopenharmony_ci entries fault back in or are written out to disk. 163662306a36Sopenharmony_ci 163762306a36Sopenharmony_ci memory.pressure 163862306a36Sopenharmony_ci A read-only nested-keyed file. 163962306a36Sopenharmony_ci 164062306a36Sopenharmony_ci Shows pressure stall information for memory. See 164162306a36Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 164262306a36Sopenharmony_ci 164362306a36Sopenharmony_ci 164462306a36Sopenharmony_ciUsage Guidelines 164562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~ 164662306a36Sopenharmony_ci 164762306a36Sopenharmony_ci"memory.high" is the main mechanism to control memory usage. 164862306a36Sopenharmony_ciOver-committing on high limit (sum of high limits > available memory) 164962306a36Sopenharmony_ciand letting global memory pressure to distribute memory according to 165062306a36Sopenharmony_ciusage is a viable strategy. 165162306a36Sopenharmony_ci 165262306a36Sopenharmony_ciBecause breach of the high limit doesn't trigger the OOM killer but 165362306a36Sopenharmony_cithrottles the offending cgroup, a management agent has ample 165462306a36Sopenharmony_ciopportunities to monitor and take appropriate actions such as granting 165562306a36Sopenharmony_cimore memory or terminating the workload. 165662306a36Sopenharmony_ci 165762306a36Sopenharmony_ciDetermining whether a cgroup has enough memory is not trivial as 165862306a36Sopenharmony_cimemory usage doesn't indicate whether the workload can benefit from 165962306a36Sopenharmony_cimore memory. For example, a workload which writes data received from 166062306a36Sopenharmony_cinetwork to a file can use all available memory but can also operate as 166162306a36Sopenharmony_ciperformant with a small amount of memory. A measure of memory 166262306a36Sopenharmony_cipressure - how much the workload is being impacted due to lack of 166362306a36Sopenharmony_cimemory - is necessary to determine whether a workload needs more 166462306a36Sopenharmony_cimemory; unfortunately, memory pressure monitoring mechanism isn't 166562306a36Sopenharmony_ciimplemented yet. 166662306a36Sopenharmony_ci 166762306a36Sopenharmony_ci 166862306a36Sopenharmony_ciMemory Ownership 166962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~ 167062306a36Sopenharmony_ci 167162306a36Sopenharmony_ciA memory area is charged to the cgroup which instantiated it and stays 167262306a36Sopenharmony_cicharged to the cgroup until the area is released. Migrating a process 167362306a36Sopenharmony_cito a different cgroup doesn't move the memory usages that it 167462306a36Sopenharmony_ciinstantiated while in the previous cgroup to the new cgroup. 167562306a36Sopenharmony_ci 167662306a36Sopenharmony_ciA memory area may be used by processes belonging to different cgroups. 167762306a36Sopenharmony_ciTo which cgroup the area will be charged is in-deterministic; however, 167862306a36Sopenharmony_ciover time, the memory area is likely to end up in a cgroup which has 167962306a36Sopenharmony_cienough memory allowance to avoid high reclaim pressure. 168062306a36Sopenharmony_ci 168162306a36Sopenharmony_ciIf a cgroup sweeps a considerable amount of memory which is expected 168262306a36Sopenharmony_cito be accessed repeatedly by other cgroups, it may make sense to use 168362306a36Sopenharmony_ciPOSIX_FADV_DONTNEED to relinquish the ownership of memory areas 168462306a36Sopenharmony_cibelonging to the affected files to ensure correct memory ownership. 168562306a36Sopenharmony_ci 168662306a36Sopenharmony_ci 168762306a36Sopenharmony_ciIO 168862306a36Sopenharmony_ci-- 168962306a36Sopenharmony_ci 169062306a36Sopenharmony_ciThe "io" controller regulates the distribution of IO resources. This 169162306a36Sopenharmony_cicontroller implements both weight based and absolute bandwidth or IOPS 169262306a36Sopenharmony_cilimit distribution; however, weight based distribution is available 169362306a36Sopenharmony_cionly if cfq-iosched is in use and neither scheme is available for 169462306a36Sopenharmony_ciblk-mq devices. 169562306a36Sopenharmony_ci 169662306a36Sopenharmony_ci 169762306a36Sopenharmony_ciIO Interface Files 169862306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 169962306a36Sopenharmony_ci 170062306a36Sopenharmony_ci io.stat 170162306a36Sopenharmony_ci A read-only nested-keyed file. 170262306a36Sopenharmony_ci 170362306a36Sopenharmony_ci Lines are keyed by $MAJ:$MIN device numbers and not ordered. 170462306a36Sopenharmony_ci The following nested keys are defined. 170562306a36Sopenharmony_ci 170662306a36Sopenharmony_ci ====== ===================== 170762306a36Sopenharmony_ci rbytes Bytes read 170862306a36Sopenharmony_ci wbytes Bytes written 170962306a36Sopenharmony_ci rios Number of read IOs 171062306a36Sopenharmony_ci wios Number of write IOs 171162306a36Sopenharmony_ci dbytes Bytes discarded 171262306a36Sopenharmony_ci dios Number of discard IOs 171362306a36Sopenharmony_ci ====== ===================== 171462306a36Sopenharmony_ci 171562306a36Sopenharmony_ci An example read output follows:: 171662306a36Sopenharmony_ci 171762306a36Sopenharmony_ci 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 171862306a36Sopenharmony_ci 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 171962306a36Sopenharmony_ci 172062306a36Sopenharmony_ci io.cost.qos 172162306a36Sopenharmony_ci A read-write nested-keyed file which exists only on the root 172262306a36Sopenharmony_ci cgroup. 172362306a36Sopenharmony_ci 172462306a36Sopenharmony_ci This file configures the Quality of Service of the IO cost 172562306a36Sopenharmony_ci model based controller (CONFIG_BLK_CGROUP_IOCOST) which 172662306a36Sopenharmony_ci currently implements "io.weight" proportional control. Lines 172762306a36Sopenharmony_ci are keyed by $MAJ:$MIN device numbers and not ordered. The 172862306a36Sopenharmony_ci line for a given device is populated on the first write for 172962306a36Sopenharmony_ci the device on "io.cost.qos" or "io.cost.model". The following 173062306a36Sopenharmony_ci nested keys are defined. 173162306a36Sopenharmony_ci 173262306a36Sopenharmony_ci ====== ===================================== 173362306a36Sopenharmony_ci enable Weight-based control enable 173462306a36Sopenharmony_ci ctrl "auto" or "user" 173562306a36Sopenharmony_ci rpct Read latency percentile [0, 100] 173662306a36Sopenharmony_ci rlat Read latency threshold 173762306a36Sopenharmony_ci wpct Write latency percentile [0, 100] 173862306a36Sopenharmony_ci wlat Write latency threshold 173962306a36Sopenharmony_ci min Minimum scaling percentage [1, 10000] 174062306a36Sopenharmony_ci max Maximum scaling percentage [1, 10000] 174162306a36Sopenharmony_ci ====== ===================================== 174262306a36Sopenharmony_ci 174362306a36Sopenharmony_ci The controller is disabled by default and can be enabled by 174462306a36Sopenharmony_ci setting "enable" to 1. "rpct" and "wpct" parameters default 174562306a36Sopenharmony_ci to zero and the controller uses internal device saturation 174662306a36Sopenharmony_ci state to adjust the overall IO rate between "min" and "max". 174762306a36Sopenharmony_ci 174862306a36Sopenharmony_ci When a better control quality is needed, latency QoS 174962306a36Sopenharmony_ci parameters can be configured. For example:: 175062306a36Sopenharmony_ci 175162306a36Sopenharmony_ci 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 175262306a36Sopenharmony_ci 175362306a36Sopenharmony_ci shows that on sdb, the controller is enabled, will consider 175462306a36Sopenharmony_ci the device saturated if the 95th percentile of read completion 175562306a36Sopenharmony_ci latencies is above 75ms or write 150ms, and adjust the overall 175662306a36Sopenharmony_ci IO issue rate between 50% and 150% accordingly. 175762306a36Sopenharmony_ci 175862306a36Sopenharmony_ci The lower the saturation point, the better the latency QoS at 175962306a36Sopenharmony_ci the cost of aggregate bandwidth. The narrower the allowed 176062306a36Sopenharmony_ci adjustment range between "min" and "max", the more conformant 176162306a36Sopenharmony_ci to the cost model the IO behavior. Note that the IO issue 176262306a36Sopenharmony_ci base rate may be far off from 100% and setting "min" and "max" 176362306a36Sopenharmony_ci blindly can lead to a significant loss of device capacity or 176462306a36Sopenharmony_ci control quality. "min" and "max" are useful for regulating 176562306a36Sopenharmony_ci devices which show wide temporary behavior changes - e.g. a 176662306a36Sopenharmony_ci ssd which accepts writes at the line speed for a while and 176762306a36Sopenharmony_ci then completely stalls for multiple seconds. 176862306a36Sopenharmony_ci 176962306a36Sopenharmony_ci When "ctrl" is "auto", the parameters are controlled by the 177062306a36Sopenharmony_ci kernel and may change automatically. Setting "ctrl" to "user" 177162306a36Sopenharmony_ci or setting any of the percentile and latency parameters puts 177262306a36Sopenharmony_ci it into "user" mode and disables the automatic changes. The 177362306a36Sopenharmony_ci automatic mode can be restored by setting "ctrl" to "auto". 177462306a36Sopenharmony_ci 177562306a36Sopenharmony_ci io.cost.model 177662306a36Sopenharmony_ci A read-write nested-keyed file which exists only on the root 177762306a36Sopenharmony_ci cgroup. 177862306a36Sopenharmony_ci 177962306a36Sopenharmony_ci This file configures the cost model of the IO cost model based 178062306a36Sopenharmony_ci controller (CONFIG_BLK_CGROUP_IOCOST) which currently 178162306a36Sopenharmony_ci implements "io.weight" proportional control. Lines are keyed 178262306a36Sopenharmony_ci by $MAJ:$MIN device numbers and not ordered. The line for a 178362306a36Sopenharmony_ci given device is populated on the first write for the device on 178462306a36Sopenharmony_ci "io.cost.qos" or "io.cost.model". The following nested keys 178562306a36Sopenharmony_ci are defined. 178662306a36Sopenharmony_ci 178762306a36Sopenharmony_ci ===== ================================ 178862306a36Sopenharmony_ci ctrl "auto" or "user" 178962306a36Sopenharmony_ci model The cost model in use - "linear" 179062306a36Sopenharmony_ci ===== ================================ 179162306a36Sopenharmony_ci 179262306a36Sopenharmony_ci When "ctrl" is "auto", the kernel may change all parameters 179362306a36Sopenharmony_ci dynamically. When "ctrl" is set to "user" or any other 179462306a36Sopenharmony_ci parameters are written to, "ctrl" become "user" and the 179562306a36Sopenharmony_ci automatic changes are disabled. 179662306a36Sopenharmony_ci 179762306a36Sopenharmony_ci When "model" is "linear", the following model parameters are 179862306a36Sopenharmony_ci defined. 179962306a36Sopenharmony_ci 180062306a36Sopenharmony_ci ============= ======================================== 180162306a36Sopenharmony_ci [r|w]bps The maximum sequential IO throughput 180262306a36Sopenharmony_ci [r|w]seqiops The maximum 4k sequential IOs per second 180362306a36Sopenharmony_ci [r|w]randiops The maximum 4k random IOs per second 180462306a36Sopenharmony_ci ============= ======================================== 180562306a36Sopenharmony_ci 180662306a36Sopenharmony_ci From the above, the builtin linear model determines the base 180762306a36Sopenharmony_ci costs of a sequential and random IO and the cost coefficient 180862306a36Sopenharmony_ci for the IO size. While simple, this model can cover most 180962306a36Sopenharmony_ci common device classes acceptably. 181062306a36Sopenharmony_ci 181162306a36Sopenharmony_ci The IO cost model isn't expected to be accurate in absolute 181262306a36Sopenharmony_ci sense and is scaled to the device behavior dynamically. 181362306a36Sopenharmony_ci 181462306a36Sopenharmony_ci If needed, tools/cgroup/iocost_coef_gen.py can be used to 181562306a36Sopenharmony_ci generate device-specific coefficients. 181662306a36Sopenharmony_ci 181762306a36Sopenharmony_ci io.weight 181862306a36Sopenharmony_ci A read-write flat-keyed file which exists on non-root cgroups. 181962306a36Sopenharmony_ci The default is "default 100". 182062306a36Sopenharmony_ci 182162306a36Sopenharmony_ci The first line is the default weight applied to devices 182262306a36Sopenharmony_ci without specific override. The rest are overrides keyed by 182362306a36Sopenharmony_ci $MAJ:$MIN device numbers and not ordered. The weights are in 182462306a36Sopenharmony_ci the range [1, 10000] and specifies the relative amount IO time 182562306a36Sopenharmony_ci the cgroup can use in relation to its siblings. 182662306a36Sopenharmony_ci 182762306a36Sopenharmony_ci The default weight can be updated by writing either "default 182862306a36Sopenharmony_ci $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 182962306a36Sopenharmony_ci "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 183062306a36Sopenharmony_ci 183162306a36Sopenharmony_ci An example read output follows:: 183262306a36Sopenharmony_ci 183362306a36Sopenharmony_ci default 100 183462306a36Sopenharmony_ci 8:16 200 183562306a36Sopenharmony_ci 8:0 50 183662306a36Sopenharmony_ci 183762306a36Sopenharmony_ci io.max 183862306a36Sopenharmony_ci A read-write nested-keyed file which exists on non-root 183962306a36Sopenharmony_ci cgroups. 184062306a36Sopenharmony_ci 184162306a36Sopenharmony_ci BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN 184262306a36Sopenharmony_ci device numbers and not ordered. The following nested keys are 184362306a36Sopenharmony_ci defined. 184462306a36Sopenharmony_ci 184562306a36Sopenharmony_ci ===== ================================== 184662306a36Sopenharmony_ci rbps Max read bytes per second 184762306a36Sopenharmony_ci wbps Max write bytes per second 184862306a36Sopenharmony_ci riops Max read IO operations per second 184962306a36Sopenharmony_ci wiops Max write IO operations per second 185062306a36Sopenharmony_ci ===== ================================== 185162306a36Sopenharmony_ci 185262306a36Sopenharmony_ci When writing, any number of nested key-value pairs can be 185362306a36Sopenharmony_ci specified in any order. "max" can be specified as the value 185462306a36Sopenharmony_ci to remove a specific limit. If the same key is specified 185562306a36Sopenharmony_ci multiple times, the outcome is undefined. 185662306a36Sopenharmony_ci 185762306a36Sopenharmony_ci BPS and IOPS are measured in each IO direction and IOs are 185862306a36Sopenharmony_ci delayed if limit is reached. Temporary bursts are allowed. 185962306a36Sopenharmony_ci 186062306a36Sopenharmony_ci Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: 186162306a36Sopenharmony_ci 186262306a36Sopenharmony_ci echo "8:16 rbps=2097152 wiops=120" > io.max 186362306a36Sopenharmony_ci 186462306a36Sopenharmony_ci Reading returns the following:: 186562306a36Sopenharmony_ci 186662306a36Sopenharmony_ci 8:16 rbps=2097152 wbps=max riops=max wiops=120 186762306a36Sopenharmony_ci 186862306a36Sopenharmony_ci Write IOPS limit can be removed by writing the following:: 186962306a36Sopenharmony_ci 187062306a36Sopenharmony_ci echo "8:16 wiops=max" > io.max 187162306a36Sopenharmony_ci 187262306a36Sopenharmony_ci Reading now returns the following:: 187362306a36Sopenharmony_ci 187462306a36Sopenharmony_ci 8:16 rbps=2097152 wbps=max riops=max wiops=max 187562306a36Sopenharmony_ci 187662306a36Sopenharmony_ci io.pressure 187762306a36Sopenharmony_ci A read-only nested-keyed file. 187862306a36Sopenharmony_ci 187962306a36Sopenharmony_ci Shows pressure stall information for IO. See 188062306a36Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 188162306a36Sopenharmony_ci 188262306a36Sopenharmony_ci 188362306a36Sopenharmony_ciWriteback 188462306a36Sopenharmony_ci~~~~~~~~~ 188562306a36Sopenharmony_ci 188662306a36Sopenharmony_ciPage cache is dirtied through buffered writes and shared mmaps and 188762306a36Sopenharmony_ciwritten asynchronously to the backing filesystem by the writeback 188862306a36Sopenharmony_cimechanism. Writeback sits between the memory and IO domains and 188962306a36Sopenharmony_ciregulates the proportion of dirty memory by balancing dirtying and 189062306a36Sopenharmony_ciwrite IOs. 189162306a36Sopenharmony_ci 189262306a36Sopenharmony_ciThe io controller, in conjunction with the memory controller, 189362306a36Sopenharmony_ciimplements control of page cache writeback IOs. The memory controller 189462306a36Sopenharmony_cidefines the memory domain that dirty memory ratio is calculated and 189562306a36Sopenharmony_cimaintained for and the io controller defines the io domain which 189662306a36Sopenharmony_ciwrites out dirty pages for the memory domain. Both system-wide and 189762306a36Sopenharmony_ciper-cgroup dirty memory states are examined and the more restrictive 189862306a36Sopenharmony_ciof the two is enforced. 189962306a36Sopenharmony_ci 190062306a36Sopenharmony_cicgroup writeback requires explicit support from the underlying 190162306a36Sopenharmony_cifilesystem. Currently, cgroup writeback is implemented on ext2, ext4, 190262306a36Sopenharmony_cibtrfs, f2fs, and xfs. On other filesystems, all writeback IOs are 190362306a36Sopenharmony_ciattributed to the root cgroup. 190462306a36Sopenharmony_ci 190562306a36Sopenharmony_ciThere are inherent differences in memory and writeback management 190662306a36Sopenharmony_ciwhich affects how cgroup ownership is tracked. Memory is tracked per 190762306a36Sopenharmony_cipage while writeback per inode. For the purpose of writeback, an 190862306a36Sopenharmony_ciinode is assigned to a cgroup and all IO requests to write dirty pages 190962306a36Sopenharmony_cifrom the inode are attributed to that cgroup. 191062306a36Sopenharmony_ci 191162306a36Sopenharmony_ciAs cgroup ownership for memory is tracked per page, there can be pages 191262306a36Sopenharmony_ciwhich are associated with different cgroups than the one the inode is 191362306a36Sopenharmony_ciassociated with. These are called foreign pages. The writeback 191462306a36Sopenharmony_ciconstantly keeps track of foreign pages and, if a particular foreign 191562306a36Sopenharmony_cicgroup becomes the majority over a certain period of time, switches 191662306a36Sopenharmony_cithe ownership of the inode to that cgroup. 191762306a36Sopenharmony_ci 191862306a36Sopenharmony_ciWhile this model is enough for most use cases where a given inode is 191962306a36Sopenharmony_cimostly dirtied by a single cgroup even when the main writing cgroup 192062306a36Sopenharmony_cichanges over time, use cases where multiple cgroups write to a single 192162306a36Sopenharmony_ciinode simultaneously are not supported well. In such circumstances, a 192262306a36Sopenharmony_cisignificant portion of IOs are likely to be attributed incorrectly. 192362306a36Sopenharmony_ciAs memory controller assigns page ownership on the first use and 192462306a36Sopenharmony_cidoesn't update it until the page is released, even if writeback 192562306a36Sopenharmony_cistrictly follows page ownership, multiple cgroups dirtying overlapping 192662306a36Sopenharmony_ciareas wouldn't work as expected. It's recommended to avoid such usage 192762306a36Sopenharmony_cipatterns. 192862306a36Sopenharmony_ci 192962306a36Sopenharmony_ciThe sysctl knobs which affect writeback behavior are applied to cgroup 193062306a36Sopenharmony_ciwriteback as follows. 193162306a36Sopenharmony_ci 193262306a36Sopenharmony_ci vm.dirty_background_ratio, vm.dirty_ratio 193362306a36Sopenharmony_ci These ratios apply the same to cgroup writeback with the 193462306a36Sopenharmony_ci amount of available memory capped by limits imposed by the 193562306a36Sopenharmony_ci memory controller and system-wide clean memory. 193662306a36Sopenharmony_ci 193762306a36Sopenharmony_ci vm.dirty_background_bytes, vm.dirty_bytes 193862306a36Sopenharmony_ci For cgroup writeback, this is calculated into ratio against 193962306a36Sopenharmony_ci total available memory and applied the same way as 194062306a36Sopenharmony_ci vm.dirty[_background]_ratio. 194162306a36Sopenharmony_ci 194262306a36Sopenharmony_ci 194362306a36Sopenharmony_ciIO Latency 194462306a36Sopenharmony_ci~~~~~~~~~~ 194562306a36Sopenharmony_ci 194662306a36Sopenharmony_ciThis is a cgroup v2 controller for IO workload protection. You provide a group 194762306a36Sopenharmony_ciwith a latency target, and if the average latency exceeds that target the 194862306a36Sopenharmony_cicontroller will throttle any peers that have a lower latency target than the 194962306a36Sopenharmony_ciprotected workload. 195062306a36Sopenharmony_ci 195162306a36Sopenharmony_ciThe limits are only applied at the peer level in the hierarchy. This means that 195262306a36Sopenharmony_ciin the diagram below, only groups A, B, and C will influence each other, and 195362306a36Sopenharmony_cigroups D and F will influence each other. Group G will influence nobody:: 195462306a36Sopenharmony_ci 195562306a36Sopenharmony_ci [root] 195662306a36Sopenharmony_ci / | \ 195762306a36Sopenharmony_ci A B C 195862306a36Sopenharmony_ci / \ | 195962306a36Sopenharmony_ci D F G 196062306a36Sopenharmony_ci 196162306a36Sopenharmony_ci 196262306a36Sopenharmony_ciSo the ideal way to configure this is to set io.latency in groups A, B, and C. 196362306a36Sopenharmony_ciGenerally you do not want to set a value lower than the latency your device 196462306a36Sopenharmony_cisupports. Experiment to find the value that works best for your workload. 196562306a36Sopenharmony_ciStart at higher than the expected latency for your device and watch the 196662306a36Sopenharmony_ciavg_lat value in io.stat for your workload group to get an idea of the 196762306a36Sopenharmony_cilatency you see during normal operation. Use the avg_lat value as a basis for 196862306a36Sopenharmony_ciyour real setting, setting at 10-15% higher than the value in io.stat. 196962306a36Sopenharmony_ci 197062306a36Sopenharmony_ciHow IO Latency Throttling Works 197162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 197262306a36Sopenharmony_ci 197362306a36Sopenharmony_ciio.latency is work conserving; so as long as everybody is meeting their latency 197462306a36Sopenharmony_citarget the controller doesn't do anything. Once a group starts missing its 197562306a36Sopenharmony_citarget it begins throttling any peer group that has a higher target than itself. 197662306a36Sopenharmony_ciThis throttling takes 2 forms: 197762306a36Sopenharmony_ci 197862306a36Sopenharmony_ci- Queue depth throttling. This is the number of outstanding IO's a group is 197962306a36Sopenharmony_ci allowed to have. We will clamp down relatively quickly, starting at no limit 198062306a36Sopenharmony_ci and going all the way down to 1 IO at a time. 198162306a36Sopenharmony_ci 198262306a36Sopenharmony_ci- Artificial delay induction. There are certain types of IO that cannot be 198362306a36Sopenharmony_ci throttled without possibly adversely affecting higher priority groups. This 198462306a36Sopenharmony_ci includes swapping and metadata IO. These types of IO are allowed to occur 198562306a36Sopenharmony_ci normally, however they are "charged" to the originating group. If the 198662306a36Sopenharmony_ci originating group is being throttled you will see the use_delay and delay 198762306a36Sopenharmony_ci fields in io.stat increase. The delay value is how many microseconds that are 198862306a36Sopenharmony_ci being added to any process that runs in this group. Because this number can 198962306a36Sopenharmony_ci grow quite large if there is a lot of swapping or metadata IO occurring we 199062306a36Sopenharmony_ci limit the individual delay events to 1 second at a time. 199162306a36Sopenharmony_ci 199262306a36Sopenharmony_ciOnce the victimized group starts meeting its latency target again it will start 199362306a36Sopenharmony_ciunthrottling any peer groups that were throttled previously. If the victimized 199462306a36Sopenharmony_cigroup simply stops doing IO the global counter will unthrottle appropriately. 199562306a36Sopenharmony_ci 199662306a36Sopenharmony_ciIO Latency Interface Files 199762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 199862306a36Sopenharmony_ci 199962306a36Sopenharmony_ci io.latency 200062306a36Sopenharmony_ci This takes a similar format as the other controllers. 200162306a36Sopenharmony_ci 200262306a36Sopenharmony_ci "MAJOR:MINOR target=<target time in microseconds>" 200362306a36Sopenharmony_ci 200462306a36Sopenharmony_ci io.stat 200562306a36Sopenharmony_ci If the controller is enabled you will see extra stats in io.stat in 200662306a36Sopenharmony_ci addition to the normal ones. 200762306a36Sopenharmony_ci 200862306a36Sopenharmony_ci depth 200962306a36Sopenharmony_ci This is the current queue depth for the group. 201062306a36Sopenharmony_ci 201162306a36Sopenharmony_ci avg_lat 201262306a36Sopenharmony_ci This is an exponential moving average with a decay rate of 1/exp 201362306a36Sopenharmony_ci bound by the sampling interval. The decay rate interval can be 201462306a36Sopenharmony_ci calculated by multiplying the win value in io.stat by the 201562306a36Sopenharmony_ci corresponding number of samples based on the win value. 201662306a36Sopenharmony_ci 201762306a36Sopenharmony_ci win 201862306a36Sopenharmony_ci The sampling window size in milliseconds. This is the minimum 201962306a36Sopenharmony_ci duration of time between evaluation events. Windows only elapse 202062306a36Sopenharmony_ci with IO activity. Idle periods extend the most recent window. 202162306a36Sopenharmony_ci 202262306a36Sopenharmony_ciIO Priority 202362306a36Sopenharmony_ci~~~~~~~~~~~ 202462306a36Sopenharmony_ci 202562306a36Sopenharmony_ciA single attribute controls the behavior of the I/O priority cgroup policy, 202662306a36Sopenharmony_cinamely the blkio.prio.class attribute. The following values are accepted for 202762306a36Sopenharmony_cithat attribute: 202862306a36Sopenharmony_ci 202962306a36Sopenharmony_ci no-change 203062306a36Sopenharmony_ci Do not modify the I/O priority class. 203162306a36Sopenharmony_ci 203262306a36Sopenharmony_ci promote-to-rt 203362306a36Sopenharmony_ci For requests that have a non-RT I/O priority class, change it into RT. 203462306a36Sopenharmony_ci Also change the priority level of these requests to 4. Do not modify 203562306a36Sopenharmony_ci the I/O priority of requests that have priority class RT. 203662306a36Sopenharmony_ci 203762306a36Sopenharmony_ci restrict-to-be 203862306a36Sopenharmony_ci For requests that do not have an I/O priority class or that have I/O 203962306a36Sopenharmony_ci priority class RT, change it into BE. Also change the priority level 204062306a36Sopenharmony_ci of these requests to 0. Do not modify the I/O priority class of 204162306a36Sopenharmony_ci requests that have priority class IDLE. 204262306a36Sopenharmony_ci 204362306a36Sopenharmony_ci idle 204462306a36Sopenharmony_ci Change the I/O priority class of all requests into IDLE, the lowest 204562306a36Sopenharmony_ci I/O priority class. 204662306a36Sopenharmony_ci 204762306a36Sopenharmony_ci none-to-rt 204862306a36Sopenharmony_ci Deprecated. Just an alias for promote-to-rt. 204962306a36Sopenharmony_ci 205062306a36Sopenharmony_ciThe following numerical values are associated with the I/O priority policies: 205162306a36Sopenharmony_ci 205262306a36Sopenharmony_ci+----------------+---+ 205362306a36Sopenharmony_ci| no-change | 0 | 205462306a36Sopenharmony_ci+----------------+---+ 205562306a36Sopenharmony_ci| rt-to-be | 2 | 205662306a36Sopenharmony_ci+----------------+---+ 205762306a36Sopenharmony_ci| all-to-idle | 3 | 205862306a36Sopenharmony_ci+----------------+---+ 205962306a36Sopenharmony_ci 206062306a36Sopenharmony_ciThe numerical value that corresponds to each I/O priority class is as follows: 206162306a36Sopenharmony_ci 206262306a36Sopenharmony_ci+-------------------------------+---+ 206362306a36Sopenharmony_ci| IOPRIO_CLASS_NONE | 0 | 206462306a36Sopenharmony_ci+-------------------------------+---+ 206562306a36Sopenharmony_ci| IOPRIO_CLASS_RT (real-time) | 1 | 206662306a36Sopenharmony_ci+-------------------------------+---+ 206762306a36Sopenharmony_ci| IOPRIO_CLASS_BE (best effort) | 2 | 206862306a36Sopenharmony_ci+-------------------------------+---+ 206962306a36Sopenharmony_ci| IOPRIO_CLASS_IDLE | 3 | 207062306a36Sopenharmony_ci+-------------------------------+---+ 207162306a36Sopenharmony_ci 207262306a36Sopenharmony_ciThe algorithm to set the I/O priority class for a request is as follows: 207362306a36Sopenharmony_ci 207462306a36Sopenharmony_ci- If I/O priority class policy is promote-to-rt, change the request I/O 207562306a36Sopenharmony_ci priority class to IOPRIO_CLASS_RT and change the request I/O priority 207662306a36Sopenharmony_ci level to 4. 207762306a36Sopenharmony_ci- If I/O priorityt class is not promote-to-rt, translate the I/O priority 207862306a36Sopenharmony_ci class policy into a number, then change the request I/O priority class 207962306a36Sopenharmony_ci into the maximum of the I/O priority class policy number and the numerical 208062306a36Sopenharmony_ci I/O priority class. 208162306a36Sopenharmony_ci 208262306a36Sopenharmony_ciPID 208362306a36Sopenharmony_ci--- 208462306a36Sopenharmony_ci 208562306a36Sopenharmony_ciThe process number controller is used to allow a cgroup to stop any 208662306a36Sopenharmony_cinew tasks from being fork()'d or clone()'d after a specified limit is 208762306a36Sopenharmony_cireached. 208862306a36Sopenharmony_ci 208962306a36Sopenharmony_ciThe number of tasks in a cgroup can be exhausted in ways which other 209062306a36Sopenharmony_cicontrollers cannot prevent, thus warranting its own controller. For 209162306a36Sopenharmony_ciexample, a fork bomb is likely to exhaust the number of tasks before 209262306a36Sopenharmony_cihitting memory restrictions. 209362306a36Sopenharmony_ci 209462306a36Sopenharmony_ciNote that PIDs used in this controller refer to TIDs, process IDs as 209562306a36Sopenharmony_ciused by the kernel. 209662306a36Sopenharmony_ci 209762306a36Sopenharmony_ci 209862306a36Sopenharmony_ciPID Interface Files 209962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 210062306a36Sopenharmony_ci 210162306a36Sopenharmony_ci pids.max 210262306a36Sopenharmony_ci A read-write single value file which exists on non-root 210362306a36Sopenharmony_ci cgroups. The default is "max". 210462306a36Sopenharmony_ci 210562306a36Sopenharmony_ci Hard limit of number of processes. 210662306a36Sopenharmony_ci 210762306a36Sopenharmony_ci pids.current 210862306a36Sopenharmony_ci A read-only single value file which exists on all cgroups. 210962306a36Sopenharmony_ci 211062306a36Sopenharmony_ci The number of processes currently in the cgroup and its 211162306a36Sopenharmony_ci descendants. 211262306a36Sopenharmony_ci 211362306a36Sopenharmony_ciOrganisational operations are not blocked by cgroup policies, so it is 211462306a36Sopenharmony_cipossible to have pids.current > pids.max. This can be done by either 211562306a36Sopenharmony_cisetting the limit to be smaller than pids.current, or attaching enough 211662306a36Sopenharmony_ciprocesses to the cgroup such that pids.current is larger than 211762306a36Sopenharmony_cipids.max. However, it is not possible to violate a cgroup PID policy 211862306a36Sopenharmony_cithrough fork() or clone(). These will return -EAGAIN if the creation 211962306a36Sopenharmony_ciof a new process would cause a cgroup policy to be violated. 212062306a36Sopenharmony_ci 212162306a36Sopenharmony_ci 212262306a36Sopenharmony_ciCpuset 212362306a36Sopenharmony_ci------ 212462306a36Sopenharmony_ci 212562306a36Sopenharmony_ciThe "cpuset" controller provides a mechanism for constraining 212662306a36Sopenharmony_cithe CPU and memory node placement of tasks to only the resources 212762306a36Sopenharmony_cispecified in the cpuset interface files in a task's current cgroup. 212862306a36Sopenharmony_ciThis is especially valuable on large NUMA systems where placing jobs 212962306a36Sopenharmony_cion properly sized subsets of the systems with careful processor and 213062306a36Sopenharmony_cimemory placement to reduce cross-node memory access and contention 213162306a36Sopenharmony_cican improve overall system performance. 213262306a36Sopenharmony_ci 213362306a36Sopenharmony_ciThe "cpuset" controller is hierarchical. That means the controller 213462306a36Sopenharmony_cicannot use CPUs or memory nodes not allowed in its parent. 213562306a36Sopenharmony_ci 213662306a36Sopenharmony_ci 213762306a36Sopenharmony_ciCpuset Interface Files 213862306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 213962306a36Sopenharmony_ci 214062306a36Sopenharmony_ci cpuset.cpus 214162306a36Sopenharmony_ci A read-write multiple values file which exists on non-root 214262306a36Sopenharmony_ci cpuset-enabled cgroups. 214362306a36Sopenharmony_ci 214462306a36Sopenharmony_ci It lists the requested CPUs to be used by tasks within this 214562306a36Sopenharmony_ci cgroup. The actual list of CPUs to be granted, however, is 214662306a36Sopenharmony_ci subjected to constraints imposed by its parent and can differ 214762306a36Sopenharmony_ci from the requested CPUs. 214862306a36Sopenharmony_ci 214962306a36Sopenharmony_ci The CPU numbers are comma-separated numbers or ranges. 215062306a36Sopenharmony_ci For example:: 215162306a36Sopenharmony_ci 215262306a36Sopenharmony_ci # cat cpuset.cpus 215362306a36Sopenharmony_ci 0-4,6,8-10 215462306a36Sopenharmony_ci 215562306a36Sopenharmony_ci An empty value indicates that the cgroup is using the same 215662306a36Sopenharmony_ci setting as the nearest cgroup ancestor with a non-empty 215762306a36Sopenharmony_ci "cpuset.cpus" or all the available CPUs if none is found. 215862306a36Sopenharmony_ci 215962306a36Sopenharmony_ci The value of "cpuset.cpus" stays constant until the next update 216062306a36Sopenharmony_ci and won't be affected by any CPU hotplug events. 216162306a36Sopenharmony_ci 216262306a36Sopenharmony_ci cpuset.cpus.effective 216362306a36Sopenharmony_ci A read-only multiple values file which exists on all 216462306a36Sopenharmony_ci cpuset-enabled cgroups. 216562306a36Sopenharmony_ci 216662306a36Sopenharmony_ci It lists the onlined CPUs that are actually granted to this 216762306a36Sopenharmony_ci cgroup by its parent. These CPUs are allowed to be used by 216862306a36Sopenharmony_ci tasks within the current cgroup. 216962306a36Sopenharmony_ci 217062306a36Sopenharmony_ci If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows 217162306a36Sopenharmony_ci all the CPUs from the parent cgroup that can be available to 217262306a36Sopenharmony_ci be used by this cgroup. Otherwise, it should be a subset of 217362306a36Sopenharmony_ci "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" 217462306a36Sopenharmony_ci can be granted. In this case, it will be treated just like an 217562306a36Sopenharmony_ci empty "cpuset.cpus". 217662306a36Sopenharmony_ci 217762306a36Sopenharmony_ci Its value will be affected by CPU hotplug events. 217862306a36Sopenharmony_ci 217962306a36Sopenharmony_ci cpuset.mems 218062306a36Sopenharmony_ci A read-write multiple values file which exists on non-root 218162306a36Sopenharmony_ci cpuset-enabled cgroups. 218262306a36Sopenharmony_ci 218362306a36Sopenharmony_ci It lists the requested memory nodes to be used by tasks within 218462306a36Sopenharmony_ci this cgroup. The actual list of memory nodes granted, however, 218562306a36Sopenharmony_ci is subjected to constraints imposed by its parent and can differ 218662306a36Sopenharmony_ci from the requested memory nodes. 218762306a36Sopenharmony_ci 218862306a36Sopenharmony_ci The memory node numbers are comma-separated numbers or ranges. 218962306a36Sopenharmony_ci For example:: 219062306a36Sopenharmony_ci 219162306a36Sopenharmony_ci # cat cpuset.mems 219262306a36Sopenharmony_ci 0-1,3 219362306a36Sopenharmony_ci 219462306a36Sopenharmony_ci An empty value indicates that the cgroup is using the same 219562306a36Sopenharmony_ci setting as the nearest cgroup ancestor with a non-empty 219662306a36Sopenharmony_ci "cpuset.mems" or all the available memory nodes if none 219762306a36Sopenharmony_ci is found. 219862306a36Sopenharmony_ci 219962306a36Sopenharmony_ci The value of "cpuset.mems" stays constant until the next update 220062306a36Sopenharmony_ci and won't be affected by any memory nodes hotplug events. 220162306a36Sopenharmony_ci 220262306a36Sopenharmony_ci Setting a non-empty value to "cpuset.mems" causes memory of 220362306a36Sopenharmony_ci tasks within the cgroup to be migrated to the designated nodes if 220462306a36Sopenharmony_ci they are currently using memory outside of the designated nodes. 220562306a36Sopenharmony_ci 220662306a36Sopenharmony_ci There is a cost for this memory migration. The migration 220762306a36Sopenharmony_ci may not be complete and some memory pages may be left behind. 220862306a36Sopenharmony_ci So it is recommended that "cpuset.mems" should be set properly 220962306a36Sopenharmony_ci before spawning new tasks into the cpuset. Even if there is 221062306a36Sopenharmony_ci a need to change "cpuset.mems" with active tasks, it shouldn't 221162306a36Sopenharmony_ci be done frequently. 221262306a36Sopenharmony_ci 221362306a36Sopenharmony_ci cpuset.mems.effective 221462306a36Sopenharmony_ci A read-only multiple values file which exists on all 221562306a36Sopenharmony_ci cpuset-enabled cgroups. 221662306a36Sopenharmony_ci 221762306a36Sopenharmony_ci It lists the onlined memory nodes that are actually granted to 221862306a36Sopenharmony_ci this cgroup by its parent. These memory nodes are allowed to 221962306a36Sopenharmony_ci be used by tasks within the current cgroup. 222062306a36Sopenharmony_ci 222162306a36Sopenharmony_ci If "cpuset.mems" is empty, it shows all the memory nodes from the 222262306a36Sopenharmony_ci parent cgroup that will be available to be used by this cgroup. 222362306a36Sopenharmony_ci Otherwise, it should be a subset of "cpuset.mems" unless none of 222462306a36Sopenharmony_ci the memory nodes listed in "cpuset.mems" can be granted. In this 222562306a36Sopenharmony_ci case, it will be treated just like an empty "cpuset.mems". 222662306a36Sopenharmony_ci 222762306a36Sopenharmony_ci Its value will be affected by memory nodes hotplug events. 222862306a36Sopenharmony_ci 222962306a36Sopenharmony_ci cpuset.cpus.partition 223062306a36Sopenharmony_ci A read-write single value file which exists on non-root 223162306a36Sopenharmony_ci cpuset-enabled cgroups. This flag is owned by the parent cgroup 223262306a36Sopenharmony_ci and is not delegatable. 223362306a36Sopenharmony_ci 223462306a36Sopenharmony_ci It accepts only the following input values when written to. 223562306a36Sopenharmony_ci 223662306a36Sopenharmony_ci ========== ===================================== 223762306a36Sopenharmony_ci "member" Non-root member of a partition 223862306a36Sopenharmony_ci "root" Partition root 223962306a36Sopenharmony_ci "isolated" Partition root without load balancing 224062306a36Sopenharmony_ci ========== ===================================== 224162306a36Sopenharmony_ci 224262306a36Sopenharmony_ci The root cgroup is always a partition root and its state 224362306a36Sopenharmony_ci cannot be changed. All other non-root cgroups start out as 224462306a36Sopenharmony_ci "member". 224562306a36Sopenharmony_ci 224662306a36Sopenharmony_ci When set to "root", the current cgroup is the root of a new 224762306a36Sopenharmony_ci partition or scheduling domain that comprises itself and all 224862306a36Sopenharmony_ci its descendants except those that are separate partition roots 224962306a36Sopenharmony_ci themselves and their descendants. 225062306a36Sopenharmony_ci 225162306a36Sopenharmony_ci When set to "isolated", the CPUs in that partition root will 225262306a36Sopenharmony_ci be in an isolated state without any load balancing from the 225362306a36Sopenharmony_ci scheduler. Tasks placed in such a partition with multiple 225462306a36Sopenharmony_ci CPUs should be carefully distributed and bound to each of the 225562306a36Sopenharmony_ci individual CPUs for optimal performance. 225662306a36Sopenharmony_ci 225762306a36Sopenharmony_ci The value shown in "cpuset.cpus.effective" of a partition root 225862306a36Sopenharmony_ci is the CPUs that the partition root can dedicate to a potential 225962306a36Sopenharmony_ci new child partition root. The new child subtracts available 226062306a36Sopenharmony_ci CPUs from its parent "cpuset.cpus.effective". 226162306a36Sopenharmony_ci 226262306a36Sopenharmony_ci A partition root ("root" or "isolated") can be in one of the 226362306a36Sopenharmony_ci two possible states - valid or invalid. An invalid partition 226462306a36Sopenharmony_ci root is in a degraded state where some state information may 226562306a36Sopenharmony_ci be retained, but behaves more like a "member". 226662306a36Sopenharmony_ci 226762306a36Sopenharmony_ci All possible state transitions among "member", "root" and 226862306a36Sopenharmony_ci "isolated" are allowed. 226962306a36Sopenharmony_ci 227062306a36Sopenharmony_ci On read, the "cpuset.cpus.partition" file can show the following 227162306a36Sopenharmony_ci values. 227262306a36Sopenharmony_ci 227362306a36Sopenharmony_ci ============================= ===================================== 227462306a36Sopenharmony_ci "member" Non-root member of a partition 227562306a36Sopenharmony_ci "root" Partition root 227662306a36Sopenharmony_ci "isolated" Partition root without load balancing 227762306a36Sopenharmony_ci "root invalid (<reason>)" Invalid partition root 227862306a36Sopenharmony_ci "isolated invalid (<reason>)" Invalid isolated partition root 227962306a36Sopenharmony_ci ============================= ===================================== 228062306a36Sopenharmony_ci 228162306a36Sopenharmony_ci In the case of an invalid partition root, a descriptive string on 228262306a36Sopenharmony_ci why the partition is invalid is included within parentheses. 228362306a36Sopenharmony_ci 228462306a36Sopenharmony_ci For a partition root to become valid, the following conditions 228562306a36Sopenharmony_ci must be met. 228662306a36Sopenharmony_ci 228762306a36Sopenharmony_ci 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they 228862306a36Sopenharmony_ci are not shared by any of its siblings (exclusivity rule). 228962306a36Sopenharmony_ci 2) The parent cgroup is a valid partition root. 229062306a36Sopenharmony_ci 3) The "cpuset.cpus" is not empty and must contain at least 229162306a36Sopenharmony_ci one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. 229262306a36Sopenharmony_ci 4) The "cpuset.cpus.effective" cannot be empty unless there is 229362306a36Sopenharmony_ci no task associated with this partition. 229462306a36Sopenharmony_ci 229562306a36Sopenharmony_ci External events like hotplug or changes to "cpuset.cpus" can 229662306a36Sopenharmony_ci cause a valid partition root to become invalid and vice versa. 229762306a36Sopenharmony_ci Note that a task cannot be moved to a cgroup with empty 229862306a36Sopenharmony_ci "cpuset.cpus.effective". 229962306a36Sopenharmony_ci 230062306a36Sopenharmony_ci For a valid partition root with the sibling cpu exclusivity 230162306a36Sopenharmony_ci rule enabled, changes made to "cpuset.cpus" that violate the 230262306a36Sopenharmony_ci exclusivity rule will invalidate the partition as well as its 230362306a36Sopenharmony_ci sibling partitions with conflicting cpuset.cpus values. So 230462306a36Sopenharmony_ci care must be taking in changing "cpuset.cpus". 230562306a36Sopenharmony_ci 230662306a36Sopenharmony_ci A valid non-root parent partition may distribute out all its CPUs 230762306a36Sopenharmony_ci to its child partitions when there is no task associated with it. 230862306a36Sopenharmony_ci 230962306a36Sopenharmony_ci Care must be taken to change a valid partition root to 231062306a36Sopenharmony_ci "member" as all its child partitions, if present, will become 231162306a36Sopenharmony_ci invalid causing disruption to tasks running in those child 231262306a36Sopenharmony_ci partitions. These inactivated partitions could be recovered if 231362306a36Sopenharmony_ci their parent is switched back to a partition root with a proper 231462306a36Sopenharmony_ci set of "cpuset.cpus". 231562306a36Sopenharmony_ci 231662306a36Sopenharmony_ci Poll and inotify events are triggered whenever the state of 231762306a36Sopenharmony_ci "cpuset.cpus.partition" changes. That includes changes caused 231862306a36Sopenharmony_ci by write to "cpuset.cpus.partition", cpu hotplug or other 231962306a36Sopenharmony_ci changes that modify the validity status of the partition. 232062306a36Sopenharmony_ci This will allow user space agents to monitor unexpected changes 232162306a36Sopenharmony_ci to "cpuset.cpus.partition" without the need to do continuous 232262306a36Sopenharmony_ci polling. 232362306a36Sopenharmony_ci 232462306a36Sopenharmony_ci 232562306a36Sopenharmony_ciDevice controller 232662306a36Sopenharmony_ci----------------- 232762306a36Sopenharmony_ci 232862306a36Sopenharmony_ciDevice controller manages access to device files. It includes both 232962306a36Sopenharmony_cicreation of new device files (using mknod), and access to the 233062306a36Sopenharmony_ciexisting device files. 233162306a36Sopenharmony_ci 233262306a36Sopenharmony_ciCgroup v2 device controller has no interface files and is implemented 233362306a36Sopenharmony_cion top of cgroup BPF. To control access to device files, a user may 233462306a36Sopenharmony_cicreate bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach 233562306a36Sopenharmony_cithem to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a 233662306a36Sopenharmony_cidevice file, corresponding BPF programs will be executed, and depending 233762306a36Sopenharmony_cion the return value the attempt will succeed or fail with -EPERM. 233862306a36Sopenharmony_ci 233962306a36Sopenharmony_ciA BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the 234062306a36Sopenharmony_cibpf_cgroup_dev_ctx structure, which describes the device access attempt: 234162306a36Sopenharmony_ciaccess type (mknod/read/write) and device (type, major and minor numbers). 234262306a36Sopenharmony_ciIf the program returns 0, the attempt fails with -EPERM, otherwise it 234362306a36Sopenharmony_cisucceeds. 234462306a36Sopenharmony_ci 234562306a36Sopenharmony_ciAn example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in 234662306a36Sopenharmony_citools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree. 234762306a36Sopenharmony_ci 234862306a36Sopenharmony_ci 234962306a36Sopenharmony_ciRDMA 235062306a36Sopenharmony_ci---- 235162306a36Sopenharmony_ci 235262306a36Sopenharmony_ciThe "rdma" controller regulates the distribution and accounting of 235362306a36Sopenharmony_ciRDMA resources. 235462306a36Sopenharmony_ci 235562306a36Sopenharmony_ciRDMA Interface Files 235662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 235762306a36Sopenharmony_ci 235862306a36Sopenharmony_ci rdma.max 235962306a36Sopenharmony_ci A readwrite nested-keyed file that exists for all the cgroups 236062306a36Sopenharmony_ci except root that describes current configured resource limit 236162306a36Sopenharmony_ci for a RDMA/IB device. 236262306a36Sopenharmony_ci 236362306a36Sopenharmony_ci Lines are keyed by device name and are not ordered. 236462306a36Sopenharmony_ci Each line contains space separated resource name and its configured 236562306a36Sopenharmony_ci limit that can be distributed. 236662306a36Sopenharmony_ci 236762306a36Sopenharmony_ci The following nested keys are defined. 236862306a36Sopenharmony_ci 236962306a36Sopenharmony_ci ========== ============================= 237062306a36Sopenharmony_ci hca_handle Maximum number of HCA Handles 237162306a36Sopenharmony_ci hca_object Maximum number of HCA Objects 237262306a36Sopenharmony_ci ========== ============================= 237362306a36Sopenharmony_ci 237462306a36Sopenharmony_ci An example for mlx4 and ocrdma device follows:: 237562306a36Sopenharmony_ci 237662306a36Sopenharmony_ci mlx4_0 hca_handle=2 hca_object=2000 237762306a36Sopenharmony_ci ocrdma1 hca_handle=3 hca_object=max 237862306a36Sopenharmony_ci 237962306a36Sopenharmony_ci rdma.current 238062306a36Sopenharmony_ci A read-only file that describes current resource usage. 238162306a36Sopenharmony_ci It exists for all the cgroup except root. 238262306a36Sopenharmony_ci 238362306a36Sopenharmony_ci An example for mlx4 and ocrdma device follows:: 238462306a36Sopenharmony_ci 238562306a36Sopenharmony_ci mlx4_0 hca_handle=1 hca_object=20 238662306a36Sopenharmony_ci ocrdma1 hca_handle=1 hca_object=23 238762306a36Sopenharmony_ci 238862306a36Sopenharmony_ciHugeTLB 238962306a36Sopenharmony_ci------- 239062306a36Sopenharmony_ci 239162306a36Sopenharmony_ciThe HugeTLB controller allows to limit the HugeTLB usage per control group and 239262306a36Sopenharmony_cienforces the controller limit during page fault. 239362306a36Sopenharmony_ci 239462306a36Sopenharmony_ciHugeTLB Interface Files 239562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 239662306a36Sopenharmony_ci 239762306a36Sopenharmony_ci hugetlb.<hugepagesize>.current 239862306a36Sopenharmony_ci Show current usage for "hugepagesize" hugetlb. It exists for all 239962306a36Sopenharmony_ci the cgroup except root. 240062306a36Sopenharmony_ci 240162306a36Sopenharmony_ci hugetlb.<hugepagesize>.max 240262306a36Sopenharmony_ci Set/show the hard limit of "hugepagesize" hugetlb usage. 240362306a36Sopenharmony_ci The default value is "max". It exists for all the cgroup except root. 240462306a36Sopenharmony_ci 240562306a36Sopenharmony_ci hugetlb.<hugepagesize>.events 240662306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 240762306a36Sopenharmony_ci 240862306a36Sopenharmony_ci max 240962306a36Sopenharmony_ci The number of allocation failure due to HugeTLB limit 241062306a36Sopenharmony_ci 241162306a36Sopenharmony_ci hugetlb.<hugepagesize>.events.local 241262306a36Sopenharmony_ci Similar to hugetlb.<hugepagesize>.events but the fields in the file 241362306a36Sopenharmony_ci are local to the cgroup i.e. not hierarchical. The file modified event 241462306a36Sopenharmony_ci generated on this file reflects only the local events. 241562306a36Sopenharmony_ci 241662306a36Sopenharmony_ci hugetlb.<hugepagesize>.numa_stat 241762306a36Sopenharmony_ci Similar to memory.numa_stat, it shows the numa information of the 241862306a36Sopenharmony_ci hugetlb pages of <hugepagesize> in this cgroup. Only active in 241962306a36Sopenharmony_ci use hugetlb pages are included. The per-node values are in bytes. 242062306a36Sopenharmony_ci 242162306a36Sopenharmony_ciMisc 242262306a36Sopenharmony_ci---- 242362306a36Sopenharmony_ci 242462306a36Sopenharmony_ciThe Miscellaneous cgroup provides the resource limiting and tracking 242562306a36Sopenharmony_cimechanism for the scalar resources which cannot be abstracted like the other 242662306a36Sopenharmony_cicgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config 242762306a36Sopenharmony_cioption. 242862306a36Sopenharmony_ci 242962306a36Sopenharmony_ciA resource can be added to the controller via enum misc_res_type{} in the 243062306a36Sopenharmony_ciinclude/linux/misc_cgroup.h file and the corresponding name via misc_res_name[] 243162306a36Sopenharmony_ciin the kernel/cgroup/misc.c file. Provider of the resource must set its 243262306a36Sopenharmony_cicapacity prior to using the resource by calling misc_cg_set_capacity(). 243362306a36Sopenharmony_ci 243462306a36Sopenharmony_ciOnce a capacity is set then the resource usage can be updated using charge and 243562306a36Sopenharmony_ciuncharge APIs. All of the APIs to interact with misc controller are in 243662306a36Sopenharmony_ciinclude/linux/misc_cgroup.h. 243762306a36Sopenharmony_ci 243862306a36Sopenharmony_ciMisc Interface Files 243962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 244062306a36Sopenharmony_ci 244162306a36Sopenharmony_ciMiscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then: 244262306a36Sopenharmony_ci 244362306a36Sopenharmony_ci misc.capacity 244462306a36Sopenharmony_ci A read-only flat-keyed file shown only in the root cgroup. It shows 244562306a36Sopenharmony_ci miscellaneous scalar resources available on the platform along with 244662306a36Sopenharmony_ci their quantities:: 244762306a36Sopenharmony_ci 244862306a36Sopenharmony_ci $ cat misc.capacity 244962306a36Sopenharmony_ci res_a 50 245062306a36Sopenharmony_ci res_b 10 245162306a36Sopenharmony_ci 245262306a36Sopenharmony_ci misc.current 245362306a36Sopenharmony_ci A read-only flat-keyed file shown in the all cgroups. It shows 245462306a36Sopenharmony_ci the current usage of the resources in the cgroup and its children.:: 245562306a36Sopenharmony_ci 245662306a36Sopenharmony_ci $ cat misc.current 245762306a36Sopenharmony_ci res_a 3 245862306a36Sopenharmony_ci res_b 0 245962306a36Sopenharmony_ci 246062306a36Sopenharmony_ci misc.max 246162306a36Sopenharmony_ci A read-write flat-keyed file shown in the non root cgroups. Allowed 246262306a36Sopenharmony_ci maximum usage of the resources in the cgroup and its children.:: 246362306a36Sopenharmony_ci 246462306a36Sopenharmony_ci $ cat misc.max 246562306a36Sopenharmony_ci res_a max 246662306a36Sopenharmony_ci res_b 4 246762306a36Sopenharmony_ci 246862306a36Sopenharmony_ci Limit can be set by:: 246962306a36Sopenharmony_ci 247062306a36Sopenharmony_ci # echo res_a 1 > misc.max 247162306a36Sopenharmony_ci 247262306a36Sopenharmony_ci Limit can be set to max by:: 247362306a36Sopenharmony_ci 247462306a36Sopenharmony_ci # echo res_a max > misc.max 247562306a36Sopenharmony_ci 247662306a36Sopenharmony_ci Limits can be set higher than the capacity value in the misc.capacity 247762306a36Sopenharmony_ci file. 247862306a36Sopenharmony_ci 247962306a36Sopenharmony_ci misc.events 248062306a36Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. The 248162306a36Sopenharmony_ci following entries are defined. Unless specified otherwise, a value 248262306a36Sopenharmony_ci change in this file generates a file modified event. All fields in 248362306a36Sopenharmony_ci this file are hierarchical. 248462306a36Sopenharmony_ci 248562306a36Sopenharmony_ci max 248662306a36Sopenharmony_ci The number of times the cgroup's resource usage was 248762306a36Sopenharmony_ci about to go over the max boundary. 248862306a36Sopenharmony_ci 248962306a36Sopenharmony_ciMigration and Ownership 249062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 249162306a36Sopenharmony_ci 249262306a36Sopenharmony_ciA miscellaneous scalar resource is charged to the cgroup in which it is used 249362306a36Sopenharmony_cifirst, and stays charged to that cgroup until that resource is freed. Migrating 249462306a36Sopenharmony_cia process to a different cgroup does not move the charge to the destination 249562306a36Sopenharmony_cicgroup where the process has moved. 249662306a36Sopenharmony_ci 249762306a36Sopenharmony_ciOthers 249862306a36Sopenharmony_ci------ 249962306a36Sopenharmony_ci 250062306a36Sopenharmony_ciperf_event 250162306a36Sopenharmony_ci~~~~~~~~~~ 250262306a36Sopenharmony_ci 250362306a36Sopenharmony_ciperf_event controller, if not mounted on a legacy hierarchy, is 250462306a36Sopenharmony_ciautomatically enabled on the v2 hierarchy so that perf events can 250562306a36Sopenharmony_cialways be filtered by cgroup v2 path. The controller can still be 250662306a36Sopenharmony_cimoved to a legacy hierarchy after v2 hierarchy is populated. 250762306a36Sopenharmony_ci 250862306a36Sopenharmony_ci 250962306a36Sopenharmony_ciNon-normative information 251062306a36Sopenharmony_ci------------------------- 251162306a36Sopenharmony_ci 251262306a36Sopenharmony_ciThis section contains information that isn't considered to be a part of 251362306a36Sopenharmony_cithe stable kernel API and so is subject to change. 251462306a36Sopenharmony_ci 251562306a36Sopenharmony_ci 251662306a36Sopenharmony_ciCPU controller root cgroup process behaviour 251762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 251862306a36Sopenharmony_ci 251962306a36Sopenharmony_ciWhen distributing CPU cycles in the root cgroup each thread in this 252062306a36Sopenharmony_cicgroup is treated as if it was hosted in a separate child cgroup of the 252162306a36Sopenharmony_ciroot cgroup. This child cgroup weight is dependent on its thread nice 252262306a36Sopenharmony_cilevel. 252362306a36Sopenharmony_ci 252462306a36Sopenharmony_ciFor details of this mapping see sched_prio_to_weight array in 252562306a36Sopenharmony_cikernel/sched/core.c file (values from this array should be scaled 252662306a36Sopenharmony_ciappropriately so the neutral - nice 0 - value is 100 instead of 1024). 252762306a36Sopenharmony_ci 252862306a36Sopenharmony_ci 252962306a36Sopenharmony_ciIO controller root cgroup process behaviour 253062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 253162306a36Sopenharmony_ci 253262306a36Sopenharmony_ciRoot cgroup processes are hosted in an implicit leaf child node. 253362306a36Sopenharmony_ciWhen distributing IO resources this implicit child node is taken into 253462306a36Sopenharmony_ciaccount as if it was a normal child cgroup of the root cgroup with a 253562306a36Sopenharmony_ciweight value of 200. 253662306a36Sopenharmony_ci 253762306a36Sopenharmony_ci 253862306a36Sopenharmony_ciNamespace 253962306a36Sopenharmony_ci========= 254062306a36Sopenharmony_ci 254162306a36Sopenharmony_ciBasics 254262306a36Sopenharmony_ci------ 254362306a36Sopenharmony_ci 254462306a36Sopenharmony_cicgroup namespace provides a mechanism to virtualize the view of the 254562306a36Sopenharmony_ci"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 254662306a36Sopenharmony_ciflag can be used with clone(2) and unshare(2) to create a new cgroup 254762306a36Sopenharmony_cinamespace. The process running inside the cgroup namespace will have 254862306a36Sopenharmony_ciits "/proc/$PID/cgroup" output restricted to cgroupns root. The 254962306a36Sopenharmony_cicgroupns root is the cgroup of the process at the time of creation of 255062306a36Sopenharmony_cithe cgroup namespace. 255162306a36Sopenharmony_ci 255262306a36Sopenharmony_ciWithout cgroup namespace, the "/proc/$PID/cgroup" file shows the 255362306a36Sopenharmony_cicomplete path of the cgroup of a process. In a container setup where 255462306a36Sopenharmony_cia set of cgroups and namespaces are intended to isolate processes the 255562306a36Sopenharmony_ci"/proc/$PID/cgroup" file may leak potential system level information 255662306a36Sopenharmony_cito the isolated processes. For example:: 255762306a36Sopenharmony_ci 255862306a36Sopenharmony_ci # cat /proc/self/cgroup 255962306a36Sopenharmony_ci 0::/batchjobs/container_id1 256062306a36Sopenharmony_ci 256162306a36Sopenharmony_ciThe path '/batchjobs/container_id1' can be considered as system-data 256262306a36Sopenharmony_ciand undesirable to expose to the isolated processes. cgroup namespace 256362306a36Sopenharmony_cican be used to restrict visibility of this path. For example, before 256462306a36Sopenharmony_cicreating a cgroup namespace, one would see:: 256562306a36Sopenharmony_ci 256662306a36Sopenharmony_ci # ls -l /proc/self/ns/cgroup 256762306a36Sopenharmony_ci lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 256862306a36Sopenharmony_ci # cat /proc/self/cgroup 256962306a36Sopenharmony_ci 0::/batchjobs/container_id1 257062306a36Sopenharmony_ci 257162306a36Sopenharmony_ciAfter unsharing a new namespace, the view changes:: 257262306a36Sopenharmony_ci 257362306a36Sopenharmony_ci # ls -l /proc/self/ns/cgroup 257462306a36Sopenharmony_ci lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 257562306a36Sopenharmony_ci # cat /proc/self/cgroup 257662306a36Sopenharmony_ci 0::/ 257762306a36Sopenharmony_ci 257862306a36Sopenharmony_ciWhen some thread from a multi-threaded process unshares its cgroup 257962306a36Sopenharmony_cinamespace, the new cgroupns gets applied to the entire process (all 258062306a36Sopenharmony_cithe threads). This is natural for the v2 hierarchy; however, for the 258162306a36Sopenharmony_cilegacy hierarchies, this may be unexpected. 258262306a36Sopenharmony_ci 258362306a36Sopenharmony_ciA cgroup namespace is alive as long as there are processes inside or 258462306a36Sopenharmony_cimounts pinning it. When the last usage goes away, the cgroup 258562306a36Sopenharmony_cinamespace is destroyed. The cgroupns root and the actual cgroups 258662306a36Sopenharmony_ciremain. 258762306a36Sopenharmony_ci 258862306a36Sopenharmony_ci 258962306a36Sopenharmony_ciThe Root and Views 259062306a36Sopenharmony_ci------------------ 259162306a36Sopenharmony_ci 259262306a36Sopenharmony_ciThe 'cgroupns root' for a cgroup namespace is the cgroup in which the 259362306a36Sopenharmony_ciprocess calling unshare(2) is running. For example, if a process in 259462306a36Sopenharmony_ci/batchjobs/container_id1 cgroup calls unshare, cgroup 259562306a36Sopenharmony_ci/batchjobs/container_id1 becomes the cgroupns root. For the 259662306a36Sopenharmony_ciinit_cgroup_ns, this is the real root ('/') cgroup. 259762306a36Sopenharmony_ci 259862306a36Sopenharmony_ciThe cgroupns root cgroup does not change even if the namespace creator 259962306a36Sopenharmony_ciprocess later moves to a different cgroup:: 260062306a36Sopenharmony_ci 260162306a36Sopenharmony_ci # ~/unshare -c # unshare cgroupns in some cgroup 260262306a36Sopenharmony_ci # cat /proc/self/cgroup 260362306a36Sopenharmony_ci 0::/ 260462306a36Sopenharmony_ci # mkdir sub_cgrp_1 260562306a36Sopenharmony_ci # echo 0 > sub_cgrp_1/cgroup.procs 260662306a36Sopenharmony_ci # cat /proc/self/cgroup 260762306a36Sopenharmony_ci 0::/sub_cgrp_1 260862306a36Sopenharmony_ci 260962306a36Sopenharmony_ciEach process gets its namespace-specific view of "/proc/$PID/cgroup" 261062306a36Sopenharmony_ci 261162306a36Sopenharmony_ciProcesses running inside the cgroup namespace will be able to see 261262306a36Sopenharmony_cicgroup paths (in /proc/self/cgroup) only inside their root cgroup. 261362306a36Sopenharmony_ciFrom within an unshared cgroupns:: 261462306a36Sopenharmony_ci 261562306a36Sopenharmony_ci # sleep 100000 & 261662306a36Sopenharmony_ci [1] 7353 261762306a36Sopenharmony_ci # echo 7353 > sub_cgrp_1/cgroup.procs 261862306a36Sopenharmony_ci # cat /proc/7353/cgroup 261962306a36Sopenharmony_ci 0::/sub_cgrp_1 262062306a36Sopenharmony_ci 262162306a36Sopenharmony_ciFrom the initial cgroup namespace, the real cgroup path will be 262262306a36Sopenharmony_civisible:: 262362306a36Sopenharmony_ci 262462306a36Sopenharmony_ci $ cat /proc/7353/cgroup 262562306a36Sopenharmony_ci 0::/batchjobs/container_id1/sub_cgrp_1 262662306a36Sopenharmony_ci 262762306a36Sopenharmony_ciFrom a sibling cgroup namespace (that is, a namespace rooted at a 262862306a36Sopenharmony_cidifferent cgroup), the cgroup path relative to its own cgroup 262962306a36Sopenharmony_cinamespace root will be shown. For instance, if PID 7353's cgroup 263062306a36Sopenharmony_cinamespace root is at '/batchjobs/container_id2', then it will see:: 263162306a36Sopenharmony_ci 263262306a36Sopenharmony_ci # cat /proc/7353/cgroup 263362306a36Sopenharmony_ci 0::/../container_id2/sub_cgrp_1 263462306a36Sopenharmony_ci 263562306a36Sopenharmony_ciNote that the relative path always starts with '/' to indicate that 263662306a36Sopenharmony_ciits relative to the cgroup namespace root of the caller. 263762306a36Sopenharmony_ci 263862306a36Sopenharmony_ci 263962306a36Sopenharmony_ciMigration and setns(2) 264062306a36Sopenharmony_ci---------------------- 264162306a36Sopenharmony_ci 264262306a36Sopenharmony_ciProcesses inside a cgroup namespace can move into and out of the 264362306a36Sopenharmony_cinamespace root if they have proper access to external cgroups. For 264462306a36Sopenharmony_ciexample, from inside a namespace with cgroupns root at 264562306a36Sopenharmony_ci/batchjobs/container_id1, and assuming that the global hierarchy is 264662306a36Sopenharmony_cistill accessible inside cgroupns:: 264762306a36Sopenharmony_ci 264862306a36Sopenharmony_ci # cat /proc/7353/cgroup 264962306a36Sopenharmony_ci 0::/sub_cgrp_1 265062306a36Sopenharmony_ci # echo 7353 > batchjobs/container_id2/cgroup.procs 265162306a36Sopenharmony_ci # cat /proc/7353/cgroup 265262306a36Sopenharmony_ci 0::/../container_id2 265362306a36Sopenharmony_ci 265462306a36Sopenharmony_ciNote that this kind of setup is not encouraged. A task inside cgroup 265562306a36Sopenharmony_cinamespace should only be exposed to its own cgroupns hierarchy. 265662306a36Sopenharmony_ci 265762306a36Sopenharmony_cisetns(2) to another cgroup namespace is allowed when: 265862306a36Sopenharmony_ci 265962306a36Sopenharmony_ci(a) the process has CAP_SYS_ADMIN against its current user namespace 266062306a36Sopenharmony_ci(b) the process has CAP_SYS_ADMIN against the target cgroup 266162306a36Sopenharmony_ci namespace's userns 266262306a36Sopenharmony_ci 266362306a36Sopenharmony_ciNo implicit cgroup changes happen with attaching to another cgroup 266462306a36Sopenharmony_cinamespace. It is expected that the someone moves the attaching 266562306a36Sopenharmony_ciprocess under the target cgroup namespace root. 266662306a36Sopenharmony_ci 266762306a36Sopenharmony_ci 266862306a36Sopenharmony_ciInteraction with Other Namespaces 266962306a36Sopenharmony_ci--------------------------------- 267062306a36Sopenharmony_ci 267162306a36Sopenharmony_ciNamespace specific cgroup hierarchy can be mounted by a process 267262306a36Sopenharmony_cirunning inside a non-init cgroup namespace:: 267362306a36Sopenharmony_ci 267462306a36Sopenharmony_ci # mount -t cgroup2 none $MOUNT_POINT 267562306a36Sopenharmony_ci 267662306a36Sopenharmony_ciThis will mount the unified cgroup hierarchy with cgroupns root as the 267762306a36Sopenharmony_cifilesystem root. The process needs CAP_SYS_ADMIN against its user and 267862306a36Sopenharmony_cimount namespaces. 267962306a36Sopenharmony_ci 268062306a36Sopenharmony_ciThe virtualization of /proc/self/cgroup file combined with restricting 268162306a36Sopenharmony_cithe view of cgroup hierarchy by namespace-private cgroupfs mount 268262306a36Sopenharmony_ciprovides a properly isolated cgroup view inside the container. 268362306a36Sopenharmony_ci 268462306a36Sopenharmony_ci 268562306a36Sopenharmony_ciInformation on Kernel Programming 268662306a36Sopenharmony_ci================================= 268762306a36Sopenharmony_ci 268862306a36Sopenharmony_ciThis section contains kernel programming information in the areas 268962306a36Sopenharmony_ciwhere interacting with cgroup is necessary. cgroup core and 269062306a36Sopenharmony_cicontrollers are not covered. 269162306a36Sopenharmony_ci 269262306a36Sopenharmony_ci 269362306a36Sopenharmony_ciFilesystem Support for Writeback 269462306a36Sopenharmony_ci-------------------------------- 269562306a36Sopenharmony_ci 269662306a36Sopenharmony_ciA filesystem can support cgroup writeback by updating 269762306a36Sopenharmony_ciaddress_space_operations->writepage[s]() to annotate bio's using the 269862306a36Sopenharmony_cifollowing two functions. 269962306a36Sopenharmony_ci 270062306a36Sopenharmony_ci wbc_init_bio(@wbc, @bio) 270162306a36Sopenharmony_ci Should be called for each bio carrying writeback data and 270262306a36Sopenharmony_ci associates the bio with the inode's owner cgroup and the 270362306a36Sopenharmony_ci corresponding request queue. This must be called after 270462306a36Sopenharmony_ci a queue (device) has been associated with the bio and 270562306a36Sopenharmony_ci before submission. 270662306a36Sopenharmony_ci 270762306a36Sopenharmony_ci wbc_account_cgroup_owner(@wbc, @page, @bytes) 270862306a36Sopenharmony_ci Should be called for each data segment being written out. 270962306a36Sopenharmony_ci While this function doesn't care exactly when it's called 271062306a36Sopenharmony_ci during the writeback session, it's the easiest and most 271162306a36Sopenharmony_ci natural to call it as data segments are added to a bio. 271262306a36Sopenharmony_ci 271362306a36Sopenharmony_ciWith writeback bio's annotated, cgroup support can be enabled per 271462306a36Sopenharmony_cisuper_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for 271562306a36Sopenharmony_ciselective disabling of cgroup writeback support which is helpful when 271662306a36Sopenharmony_cicertain filesystem features, e.g. journaled data mode, are 271762306a36Sopenharmony_ciincompatible. 271862306a36Sopenharmony_ci 271962306a36Sopenharmony_ciwbc_init_bio() binds the specified bio to its cgroup. Depending on 272062306a36Sopenharmony_cithe configuration, the bio may be executed at a lower priority and if 272162306a36Sopenharmony_cithe writeback session is holding shared resources, e.g. a journal 272262306a36Sopenharmony_cientry, may lead to priority inversion. There is no one easy solution 272362306a36Sopenharmony_cifor the problem. Filesystems can try to work around specific problem 272462306a36Sopenharmony_cicases by skipping wbc_init_bio() and using bio_associate_blkg() 272562306a36Sopenharmony_cidirectly. 272662306a36Sopenharmony_ci 272762306a36Sopenharmony_ci 272862306a36Sopenharmony_ciDeprecated v1 Core Features 272962306a36Sopenharmony_ci=========================== 273062306a36Sopenharmony_ci 273162306a36Sopenharmony_ci- Multiple hierarchies including named ones are not supported. 273262306a36Sopenharmony_ci 273362306a36Sopenharmony_ci- All v1 mount options are not supported. 273462306a36Sopenharmony_ci 273562306a36Sopenharmony_ci- The "tasks" file is removed and "cgroup.procs" is not sorted. 273662306a36Sopenharmony_ci 273762306a36Sopenharmony_ci- "cgroup.clone_children" is removed. 273862306a36Sopenharmony_ci 273962306a36Sopenharmony_ci- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file 274062306a36Sopenharmony_ci at the root instead. 274162306a36Sopenharmony_ci 274262306a36Sopenharmony_ci 274362306a36Sopenharmony_ciIssues with v1 and Rationales for v2 274462306a36Sopenharmony_ci==================================== 274562306a36Sopenharmony_ci 274662306a36Sopenharmony_ciMultiple Hierarchies 274762306a36Sopenharmony_ci-------------------- 274862306a36Sopenharmony_ci 274962306a36Sopenharmony_cicgroup v1 allowed an arbitrary number of hierarchies and each 275062306a36Sopenharmony_cihierarchy could host any number of controllers. While this seemed to 275162306a36Sopenharmony_ciprovide a high level of flexibility, it wasn't useful in practice. 275262306a36Sopenharmony_ci 275362306a36Sopenharmony_ciFor example, as there is only one instance of each controller, utility 275462306a36Sopenharmony_citype controllers such as freezer which can be useful in all 275562306a36Sopenharmony_cihierarchies could only be used in one. The issue is exacerbated by 275662306a36Sopenharmony_cithe fact that controllers couldn't be moved to another hierarchy once 275762306a36Sopenharmony_cihierarchies were populated. Another issue was that all controllers 275862306a36Sopenharmony_cibound to a hierarchy were forced to have exactly the same view of the 275962306a36Sopenharmony_cihierarchy. It wasn't possible to vary the granularity depending on 276062306a36Sopenharmony_cithe specific controller. 276162306a36Sopenharmony_ci 276262306a36Sopenharmony_ciIn practice, these issues heavily limited which controllers could be 276362306a36Sopenharmony_ciput on the same hierarchy and most configurations resorted to putting 276462306a36Sopenharmony_cieach controller on its own hierarchy. Only closely related ones, such 276562306a36Sopenharmony_cias the cpu and cpuacct controllers, made sense to be put on the same 276662306a36Sopenharmony_cihierarchy. This often meant that userland ended up managing multiple 276762306a36Sopenharmony_cisimilar hierarchies repeating the same steps on each hierarchy 276862306a36Sopenharmony_ciwhenever a hierarchy management operation was necessary. 276962306a36Sopenharmony_ci 277062306a36Sopenharmony_ciFurthermore, support for multiple hierarchies came at a steep cost. 277162306a36Sopenharmony_ciIt greatly complicated cgroup core implementation but more importantly 277262306a36Sopenharmony_cithe support for multiple hierarchies restricted how cgroup could be 277362306a36Sopenharmony_ciused in general and what controllers was able to do. 277462306a36Sopenharmony_ci 277562306a36Sopenharmony_ciThere was no limit on how many hierarchies there might be, which meant 277662306a36Sopenharmony_cithat a thread's cgroup membership couldn't be described in finite 277762306a36Sopenharmony_cilength. The key might contain any number of entries and was unlimited 277862306a36Sopenharmony_ciin length, which made it highly awkward to manipulate and led to 277962306a36Sopenharmony_ciaddition of controllers which existed only to identify membership, 278062306a36Sopenharmony_ciwhich in turn exacerbated the original problem of proliferating number 278162306a36Sopenharmony_ciof hierarchies. 278262306a36Sopenharmony_ci 278362306a36Sopenharmony_ciAlso, as a controller couldn't have any expectation regarding the 278462306a36Sopenharmony_citopologies of hierarchies other controllers might be on, each 278562306a36Sopenharmony_cicontroller had to assume that all other controllers were attached to 278662306a36Sopenharmony_cicompletely orthogonal hierarchies. This made it impossible, or at 278762306a36Sopenharmony_cileast very cumbersome, for controllers to cooperate with each other. 278862306a36Sopenharmony_ci 278962306a36Sopenharmony_ciIn most use cases, putting controllers on hierarchies which are 279062306a36Sopenharmony_cicompletely orthogonal to each other isn't necessary. What usually is 279162306a36Sopenharmony_cicalled for is the ability to have differing levels of granularity 279262306a36Sopenharmony_cidepending on the specific controller. In other words, hierarchy may 279362306a36Sopenharmony_cibe collapsed from leaf towards root when viewed from specific 279462306a36Sopenharmony_cicontrollers. For example, a given configuration might not care about 279562306a36Sopenharmony_cihow memory is distributed beyond a certain level while still wanting 279662306a36Sopenharmony_cito control how CPU cycles are distributed. 279762306a36Sopenharmony_ci 279862306a36Sopenharmony_ci 279962306a36Sopenharmony_ciThread Granularity 280062306a36Sopenharmony_ci------------------ 280162306a36Sopenharmony_ci 280262306a36Sopenharmony_cicgroup v1 allowed threads of a process to belong to different cgroups. 280362306a36Sopenharmony_ciThis didn't make sense for some controllers and those controllers 280462306a36Sopenharmony_ciended up implementing different ways to ignore such situations but 280562306a36Sopenharmony_cimuch more importantly it blurred the line between API exposed to 280662306a36Sopenharmony_ciindividual applications and system management interface. 280762306a36Sopenharmony_ci 280862306a36Sopenharmony_ciGenerally, in-process knowledge is available only to the process 280962306a36Sopenharmony_ciitself; thus, unlike service-level organization of processes, 281062306a36Sopenharmony_cicategorizing threads of a process requires active participation from 281162306a36Sopenharmony_cithe application which owns the target process. 281262306a36Sopenharmony_ci 281362306a36Sopenharmony_cicgroup v1 had an ambiguously defined delegation model which got abused 281462306a36Sopenharmony_ciin combination with thread granularity. cgroups were delegated to 281562306a36Sopenharmony_ciindividual applications so that they can create and manage their own 281662306a36Sopenharmony_cisub-hierarchies and control resource distributions along them. This 281762306a36Sopenharmony_cieffectively raised cgroup to the status of a syscall-like API exposed 281862306a36Sopenharmony_cito lay programs. 281962306a36Sopenharmony_ci 282062306a36Sopenharmony_ciFirst of all, cgroup has a fundamentally inadequate interface to be 282162306a36Sopenharmony_ciexposed this way. For a process to access its own knobs, it has to 282262306a36Sopenharmony_ciextract the path on the target hierarchy from /proc/self/cgroup, 282362306a36Sopenharmony_ciconstruct the path by appending the name of the knob to the path, open 282462306a36Sopenharmony_ciand then read and/or write to it. This is not only extremely clunky 282562306a36Sopenharmony_ciand unusual but also inherently racy. There is no conventional way to 282662306a36Sopenharmony_cidefine transaction across the required steps and nothing can guarantee 282762306a36Sopenharmony_cithat the process would actually be operating on its own sub-hierarchy. 282862306a36Sopenharmony_ci 282962306a36Sopenharmony_cicgroup controllers implemented a number of knobs which would never be 283062306a36Sopenharmony_ciaccepted as public APIs because they were just adding control knobs to 283162306a36Sopenharmony_cisystem-management pseudo filesystem. cgroup ended up with interface 283262306a36Sopenharmony_ciknobs which were not properly abstracted or refined and directly 283362306a36Sopenharmony_cirevealed kernel internal details. These knobs got exposed to 283462306a36Sopenharmony_ciindividual applications through the ill-defined delegation mechanism 283562306a36Sopenharmony_cieffectively abusing cgroup as a shortcut to implementing public APIs 283662306a36Sopenharmony_ciwithout going through the required scrutiny. 283762306a36Sopenharmony_ci 283862306a36Sopenharmony_ciThis was painful for both userland and kernel. Userland ended up with 283962306a36Sopenharmony_cimisbehaving and poorly abstracted interfaces and kernel exposing and 284062306a36Sopenharmony_cilocked into constructs inadvertently. 284162306a36Sopenharmony_ci 284262306a36Sopenharmony_ci 284362306a36Sopenharmony_ciCompetition Between Inner Nodes and Threads 284462306a36Sopenharmony_ci------------------------------------------- 284562306a36Sopenharmony_ci 284662306a36Sopenharmony_cicgroup v1 allowed threads to be in any cgroups which created an 284762306a36Sopenharmony_ciinteresting problem where threads belonging to a parent cgroup and its 284862306a36Sopenharmony_cichildren cgroups competed for resources. This was nasty as two 284962306a36Sopenharmony_cidifferent types of entities competed and there was no obvious way to 285062306a36Sopenharmony_cisettle it. Different controllers did different things. 285162306a36Sopenharmony_ci 285262306a36Sopenharmony_ciThe cpu controller considered threads and cgroups as equivalents and 285362306a36Sopenharmony_cimapped nice levels to cgroup weights. This worked for some cases but 285462306a36Sopenharmony_cifell flat when children wanted to be allocated specific ratios of CPU 285562306a36Sopenharmony_cicycles and the number of internal threads fluctuated - the ratios 285662306a36Sopenharmony_ciconstantly changed as the number of competing entities fluctuated. 285762306a36Sopenharmony_ciThere also were other issues. The mapping from nice level to weight 285862306a36Sopenharmony_ciwasn't obvious or universal, and there were various other knobs which 285962306a36Sopenharmony_cisimply weren't available for threads. 286062306a36Sopenharmony_ci 286162306a36Sopenharmony_ciThe io controller implicitly created a hidden leaf node for each 286262306a36Sopenharmony_cicgroup to host the threads. The hidden leaf had its own copies of all 286362306a36Sopenharmony_cithe knobs with ``leaf_`` prefixed. While this allowed equivalent 286462306a36Sopenharmony_cicontrol over internal threads, it was with serious drawbacks. It 286562306a36Sopenharmony_cialways added an extra layer of nesting which wouldn't be necessary 286662306a36Sopenharmony_ciotherwise, made the interface messy and significantly complicated the 286762306a36Sopenharmony_ciimplementation. 286862306a36Sopenharmony_ci 286962306a36Sopenharmony_ciThe memory controller didn't have a way to control what happened 287062306a36Sopenharmony_cibetween internal tasks and child cgroups and the behavior was not 287162306a36Sopenharmony_ciclearly defined. There were attempts to add ad-hoc behaviors and 287262306a36Sopenharmony_ciknobs to tailor the behavior to specific workloads which would have 287362306a36Sopenharmony_ciled to problems extremely difficult to resolve in the long term. 287462306a36Sopenharmony_ci 287562306a36Sopenharmony_ciMultiple controllers struggled with internal tasks and came up with 287662306a36Sopenharmony_cidifferent ways to deal with it; unfortunately, all the approaches were 287762306a36Sopenharmony_ciseverely flawed and, furthermore, the widely different behaviors 287862306a36Sopenharmony_cimade cgroup as a whole highly inconsistent. 287962306a36Sopenharmony_ci 288062306a36Sopenharmony_ciThis clearly is a problem which needs to be addressed from cgroup core 288162306a36Sopenharmony_ciin a uniform way. 288262306a36Sopenharmony_ci 288362306a36Sopenharmony_ci 288462306a36Sopenharmony_ciOther Interface Issues 288562306a36Sopenharmony_ci---------------------- 288662306a36Sopenharmony_ci 288762306a36Sopenharmony_cicgroup v1 grew without oversight and developed a large number of 288862306a36Sopenharmony_ciidiosyncrasies and inconsistencies. One issue on the cgroup core side 288962306a36Sopenharmony_ciwas how an empty cgroup was notified - a userland helper binary was 289062306a36Sopenharmony_ciforked and executed for each event. The event delivery wasn't 289162306a36Sopenharmony_cirecursive or delegatable. The limitations of the mechanism also led 289262306a36Sopenharmony_cito in-kernel event delivery filtering mechanism further complicating 289362306a36Sopenharmony_cithe interface. 289462306a36Sopenharmony_ci 289562306a36Sopenharmony_ciController interfaces were problematic too. An extreme example is 289662306a36Sopenharmony_cicontrollers completely ignoring hierarchical organization and treating 289762306a36Sopenharmony_ciall cgroups as if they were all located directly under the root 289862306a36Sopenharmony_cicgroup. Some controllers exposed a large amount of inconsistent 289962306a36Sopenharmony_ciimplementation details to userland. 290062306a36Sopenharmony_ci 290162306a36Sopenharmony_ciThere also was no consistency across controllers. When a new cgroup 290262306a36Sopenharmony_ciwas created, some controllers defaulted to not imposing extra 290362306a36Sopenharmony_cirestrictions while others disallowed any resource usage until 290462306a36Sopenharmony_ciexplicitly configured. Configuration knobs for the same type of 290562306a36Sopenharmony_cicontrol used widely differing naming schemes and formats. Statistics 290662306a36Sopenharmony_ciand information knobs were named arbitrarily and used different 290762306a36Sopenharmony_ciformats and units even in the same controller. 290862306a36Sopenharmony_ci 290962306a36Sopenharmony_cicgroup v2 establishes common conventions where appropriate and updates 291062306a36Sopenharmony_cicontrollers so that they expose minimal and consistent interfaces. 291162306a36Sopenharmony_ci 291262306a36Sopenharmony_ci 291362306a36Sopenharmony_ciController Issues and Remedies 291462306a36Sopenharmony_ci------------------------------ 291562306a36Sopenharmony_ci 291662306a36Sopenharmony_ciMemory 291762306a36Sopenharmony_ci~~~~~~ 291862306a36Sopenharmony_ci 291962306a36Sopenharmony_ciThe original lower boundary, the soft limit, is defined as a limit 292062306a36Sopenharmony_cithat is per default unset. As a result, the set of cgroups that 292162306a36Sopenharmony_ciglobal reclaim prefers is opt-in, rather than opt-out. The costs for 292262306a36Sopenharmony_cioptimizing these mostly negative lookups are so high that the 292362306a36Sopenharmony_ciimplementation, despite its enormous size, does not even provide the 292462306a36Sopenharmony_cibasic desirable behavior. First off, the soft limit has no 292562306a36Sopenharmony_cihierarchical meaning. All configured groups are organized in a global 292662306a36Sopenharmony_cirbtree and treated like equal peers, regardless where they are located 292762306a36Sopenharmony_ciin the hierarchy. This makes subtree delegation impossible. Second, 292862306a36Sopenharmony_cithe soft limit reclaim pass is so aggressive that it not just 292962306a36Sopenharmony_ciintroduces high allocation latencies into the system, but also impacts 293062306a36Sopenharmony_cisystem performance due to overreclaim, to the point where the feature 293162306a36Sopenharmony_cibecomes self-defeating. 293262306a36Sopenharmony_ci 293362306a36Sopenharmony_ciThe memory.low boundary on the other hand is a top-down allocated 293462306a36Sopenharmony_cireserve. A cgroup enjoys reclaim protection when it's within its 293562306a36Sopenharmony_cieffective low, which makes delegation of subtrees possible. It also 293662306a36Sopenharmony_cienjoys having reclaim pressure proportional to its overage when 293762306a36Sopenharmony_ciabove its effective low. 293862306a36Sopenharmony_ci 293962306a36Sopenharmony_ciThe original high boundary, the hard limit, is defined as a strict 294062306a36Sopenharmony_cilimit that can not budge, even if the OOM killer has to be called. 294162306a36Sopenharmony_ciBut this generally goes against the goal of making the most out of the 294262306a36Sopenharmony_ciavailable memory. The memory consumption of workloads varies during 294362306a36Sopenharmony_ciruntime, and that requires users to overcommit. But doing that with a 294462306a36Sopenharmony_cistrict upper limit requires either a fairly accurate prediction of the 294562306a36Sopenharmony_ciworking set size or adding slack to the limit. Since working set size 294662306a36Sopenharmony_ciestimation is hard and error prone, and getting it wrong results in 294762306a36Sopenharmony_ciOOM kills, most users tend to err on the side of a looser limit and 294862306a36Sopenharmony_ciend up wasting precious resources. 294962306a36Sopenharmony_ci 295062306a36Sopenharmony_ciThe memory.high boundary on the other hand can be set much more 295162306a36Sopenharmony_ciconservatively. When hit, it throttles allocations by forcing them 295262306a36Sopenharmony_ciinto direct reclaim to work off the excess, but it never invokes the 295362306a36Sopenharmony_ciOOM killer. As a result, a high boundary that is chosen too 295462306a36Sopenharmony_ciaggressively will not terminate the processes, but instead it will 295562306a36Sopenharmony_cilead to gradual performance degradation. The user can monitor this 295662306a36Sopenharmony_ciand make corrections until the minimal memory footprint that still 295762306a36Sopenharmony_cigives acceptable performance is found. 295862306a36Sopenharmony_ci 295962306a36Sopenharmony_ciIn extreme cases, with many concurrent allocations and a complete 296062306a36Sopenharmony_cibreakdown of reclaim progress within the group, the high boundary can 296162306a36Sopenharmony_cibe exceeded. But even then it's mostly better to satisfy the 296262306a36Sopenharmony_ciallocation from the slack available in other groups or the rest of the 296362306a36Sopenharmony_cisystem than killing the group. Otherwise, memory.max is there to 296462306a36Sopenharmony_cilimit this type of spillover and ultimately contain buggy or even 296562306a36Sopenharmony_cimalicious applications. 296662306a36Sopenharmony_ci 296762306a36Sopenharmony_ciSetting the original memory.limit_in_bytes below the current usage was 296862306a36Sopenharmony_cisubject to a race condition, where concurrent charges could cause the 296962306a36Sopenharmony_cilimit setting to fail. memory.max on the other hand will first set the 297062306a36Sopenharmony_cilimit to prevent new charges, and then reclaim and OOM kill until the 297162306a36Sopenharmony_cinew limit is met - or the task writing to memory.max is killed. 297262306a36Sopenharmony_ci 297362306a36Sopenharmony_ciThe combined memory+swap accounting and limiting is replaced by real 297462306a36Sopenharmony_cicontrol over swap space. 297562306a36Sopenharmony_ci 297662306a36Sopenharmony_ciThe main argument for a combined memory+swap facility in the original 297762306a36Sopenharmony_cicgroup design was that global or parental pressure would always be 297862306a36Sopenharmony_ciable to swap all anonymous memory of a child group, regardless of the 297962306a36Sopenharmony_cichild's own (possibly untrusted) configuration. However, untrusted 298062306a36Sopenharmony_cigroups can sabotage swapping by other means - such as referencing its 298162306a36Sopenharmony_cianonymous memory in a tight loop - and an admin can not assume full 298262306a36Sopenharmony_ciswappability when overcommitting untrusted jobs. 298362306a36Sopenharmony_ci 298462306a36Sopenharmony_ciFor trusted jobs, on the other hand, a combined counter is not an 298562306a36Sopenharmony_ciintuitive userspace interface, and it flies in the face of the idea 298662306a36Sopenharmony_cithat cgroup controllers should account and limit specific physical 298762306a36Sopenharmony_ciresources. Swap space is a resource like all others in the system, 298862306a36Sopenharmony_ciand that's why unified hierarchy allows distributing it separately. 2989