18c2ecf20Sopenharmony_ci================ 28c2ecf20Sopenharmony_ciControl Group v2 38c2ecf20Sopenharmony_ci================ 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ci:Date: October, 2015 68c2ecf20Sopenharmony_ci:Author: Tejun Heo <tj@kernel.org> 78c2ecf20Sopenharmony_ci 88c2ecf20Sopenharmony_ciThis is the authoritative documentation on the design, interface and 98c2ecf20Sopenharmony_ciconventions of cgroup v2. It describes all userland-visible aspects 108c2ecf20Sopenharmony_ciof cgroup including core and specific controller behaviors. All 118c2ecf20Sopenharmony_cifuture changes must be reflected in this document. Documentation for 128c2ecf20Sopenharmony_civ1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. 138c2ecf20Sopenharmony_ci 148c2ecf20Sopenharmony_ci.. CONTENTS 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ci 1. Introduction 178c2ecf20Sopenharmony_ci 1-1. Terminology 188c2ecf20Sopenharmony_ci 1-2. What is cgroup? 198c2ecf20Sopenharmony_ci 2. Basic Operations 208c2ecf20Sopenharmony_ci 2-1. Mounting 218c2ecf20Sopenharmony_ci 2-2. Organizing Processes and Threads 228c2ecf20Sopenharmony_ci 2-2-1. Processes 238c2ecf20Sopenharmony_ci 2-2-2. Threads 248c2ecf20Sopenharmony_ci 2-3. [Un]populated Notification 258c2ecf20Sopenharmony_ci 2-4. Controlling Controllers 268c2ecf20Sopenharmony_ci 2-4-1. Enabling and Disabling 278c2ecf20Sopenharmony_ci 2-4-2. Top-down Constraint 288c2ecf20Sopenharmony_ci 2-4-3. No Internal Process Constraint 298c2ecf20Sopenharmony_ci 2-5. Delegation 308c2ecf20Sopenharmony_ci 2-5-1. Model of Delegation 318c2ecf20Sopenharmony_ci 2-5-2. Delegation Containment 328c2ecf20Sopenharmony_ci 2-6. Guidelines 338c2ecf20Sopenharmony_ci 2-6-1. Organize Once and Control 348c2ecf20Sopenharmony_ci 2-6-2. Avoid Name Collisions 358c2ecf20Sopenharmony_ci 3. Resource Distribution Models 368c2ecf20Sopenharmony_ci 3-1. Weights 378c2ecf20Sopenharmony_ci 3-2. Limits 388c2ecf20Sopenharmony_ci 3-3. Protections 398c2ecf20Sopenharmony_ci 3-4. Allocations 408c2ecf20Sopenharmony_ci 4. Interface Files 418c2ecf20Sopenharmony_ci 4-1. Format 428c2ecf20Sopenharmony_ci 4-2. Conventions 438c2ecf20Sopenharmony_ci 4-3. Core Interface Files 448c2ecf20Sopenharmony_ci 5. Controllers 458c2ecf20Sopenharmony_ci 5-1. CPU 468c2ecf20Sopenharmony_ci 5-1-1. CPU Interface Files 478c2ecf20Sopenharmony_ci 5-2. Memory 488c2ecf20Sopenharmony_ci 5-2-1. Memory Interface Files 498c2ecf20Sopenharmony_ci 5-2-2. Usage Guidelines 508c2ecf20Sopenharmony_ci 5-2-3. Memory Ownership 518c2ecf20Sopenharmony_ci 5-3. IO 528c2ecf20Sopenharmony_ci 5-3-1. IO Interface Files 538c2ecf20Sopenharmony_ci 5-3-2. Writeback 548c2ecf20Sopenharmony_ci 5-3-3. IO Latency 558c2ecf20Sopenharmony_ci 5-3-3-1. How IO Latency Throttling Works 568c2ecf20Sopenharmony_ci 5-3-3-2. IO Latency Interface Files 578c2ecf20Sopenharmony_ci 5-4. PID 588c2ecf20Sopenharmony_ci 5-4-1. PID Interface Files 598c2ecf20Sopenharmony_ci 5-5. Cpuset 608c2ecf20Sopenharmony_ci 5.5-1. Cpuset Interface Files 618c2ecf20Sopenharmony_ci 5-6. Device 628c2ecf20Sopenharmony_ci 5-7. RDMA 638c2ecf20Sopenharmony_ci 5-7-1. RDMA Interface Files 648c2ecf20Sopenharmony_ci 5-8. HugeTLB 658c2ecf20Sopenharmony_ci 5.8-1. HugeTLB Interface Files 668c2ecf20Sopenharmony_ci 5-8. Misc 678c2ecf20Sopenharmony_ci 5-8-1. perf_event 688c2ecf20Sopenharmony_ci 5-N. Non-normative information 698c2ecf20Sopenharmony_ci 5-N-1. CPU controller root cgroup process behaviour 708c2ecf20Sopenharmony_ci 5-N-2. IO controller root cgroup process behaviour 718c2ecf20Sopenharmony_ci 6. Namespace 728c2ecf20Sopenharmony_ci 6-1. Basics 738c2ecf20Sopenharmony_ci 6-2. The Root and Views 748c2ecf20Sopenharmony_ci 6-3. Migration and setns(2) 758c2ecf20Sopenharmony_ci 6-4. Interaction with Other Namespaces 768c2ecf20Sopenharmony_ci P. Information on Kernel Programming 778c2ecf20Sopenharmony_ci P-1. Filesystem Support for Writeback 788c2ecf20Sopenharmony_ci D. Deprecated v1 Core Features 798c2ecf20Sopenharmony_ci R. Issues with v1 and Rationales for v2 808c2ecf20Sopenharmony_ci R-1. Multiple Hierarchies 818c2ecf20Sopenharmony_ci R-2. Thread Granularity 828c2ecf20Sopenharmony_ci R-3. Competition Between Inner Nodes and Threads 838c2ecf20Sopenharmony_ci R-4. Other Interface Issues 848c2ecf20Sopenharmony_ci R-5. Controller Issues and Remedies 858c2ecf20Sopenharmony_ci R-5-1. Memory 868c2ecf20Sopenharmony_ci 878c2ecf20Sopenharmony_ci 888c2ecf20Sopenharmony_ciIntroduction 898c2ecf20Sopenharmony_ci============ 908c2ecf20Sopenharmony_ci 918c2ecf20Sopenharmony_ciTerminology 928c2ecf20Sopenharmony_ci----------- 938c2ecf20Sopenharmony_ci 948c2ecf20Sopenharmony_ci"cgroup" stands for "control group" and is never capitalized. The 958c2ecf20Sopenharmony_cisingular form is used to designate the whole feature and also as a 968c2ecf20Sopenharmony_ciqualifier as in "cgroup controllers". When explicitly referring to 978c2ecf20Sopenharmony_cimultiple individual control groups, the plural form "cgroups" is used. 988c2ecf20Sopenharmony_ci 998c2ecf20Sopenharmony_ci 1008c2ecf20Sopenharmony_ciWhat is cgroup? 1018c2ecf20Sopenharmony_ci--------------- 1028c2ecf20Sopenharmony_ci 1038c2ecf20Sopenharmony_cicgroup is a mechanism to organize processes hierarchically and 1048c2ecf20Sopenharmony_cidistribute system resources along the hierarchy in a controlled and 1058c2ecf20Sopenharmony_ciconfigurable manner. 1068c2ecf20Sopenharmony_ci 1078c2ecf20Sopenharmony_cicgroup is largely composed of two parts - the core and controllers. 1088c2ecf20Sopenharmony_cicgroup core is primarily responsible for hierarchically organizing 1098c2ecf20Sopenharmony_ciprocesses. A cgroup controller is usually responsible for 1108c2ecf20Sopenharmony_cidistributing a specific type of system resource along the hierarchy 1118c2ecf20Sopenharmony_cialthough there are utility controllers which serve purposes other than 1128c2ecf20Sopenharmony_ciresource distribution. 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_cicgroups form a tree structure and every process in the system belongs 1158c2ecf20Sopenharmony_cito one and only one cgroup. All threads of a process belong to the 1168c2ecf20Sopenharmony_cisame cgroup. On creation, all processes are put in the cgroup that 1178c2ecf20Sopenharmony_cithe parent process belongs to at the time. A process can be migrated 1188c2ecf20Sopenharmony_cito another cgroup. Migration of a process doesn't affect already 1198c2ecf20Sopenharmony_ciexisting descendant processes. 1208c2ecf20Sopenharmony_ci 1218c2ecf20Sopenharmony_ciFollowing certain structural constraints, controllers may be enabled or 1228c2ecf20Sopenharmony_cidisabled selectively on a cgroup. All controller behaviors are 1238c2ecf20Sopenharmony_cihierarchical - if a controller is enabled on a cgroup, it affects all 1248c2ecf20Sopenharmony_ciprocesses which belong to the cgroups consisting the inclusive 1258c2ecf20Sopenharmony_cisub-hierarchy of the cgroup. When a controller is enabled on a nested 1268c2ecf20Sopenharmony_cicgroup, it always restricts the resource distribution further. The 1278c2ecf20Sopenharmony_cirestrictions set closer to the root in the hierarchy can not be 1288c2ecf20Sopenharmony_cioverridden from further away. 1298c2ecf20Sopenharmony_ci 1308c2ecf20Sopenharmony_ci 1318c2ecf20Sopenharmony_ciBasic Operations 1328c2ecf20Sopenharmony_ci================ 1338c2ecf20Sopenharmony_ci 1348c2ecf20Sopenharmony_ciMounting 1358c2ecf20Sopenharmony_ci-------- 1368c2ecf20Sopenharmony_ci 1378c2ecf20Sopenharmony_ciUnlike v1, cgroup v2 has only single hierarchy. The cgroup v2 1388c2ecf20Sopenharmony_cihierarchy can be mounted with the following mount command:: 1398c2ecf20Sopenharmony_ci 1408c2ecf20Sopenharmony_ci # mount -t cgroup2 none $MOUNT_POINT 1418c2ecf20Sopenharmony_ci 1428c2ecf20Sopenharmony_cicgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All 1438c2ecf20Sopenharmony_cicontrollers which support v2 and are not bound to a v1 hierarchy are 1448c2ecf20Sopenharmony_ciautomatically bound to the v2 hierarchy and show up at the root. 1458c2ecf20Sopenharmony_ciControllers which are not in active use in the v2 hierarchy can be 1468c2ecf20Sopenharmony_cibound to other hierarchies. This allows mixing v2 hierarchy with the 1478c2ecf20Sopenharmony_cilegacy v1 multiple hierarchies in a fully backward compatible way. 1488c2ecf20Sopenharmony_ci 1498c2ecf20Sopenharmony_ciA controller can be moved across hierarchies only after the controller 1508c2ecf20Sopenharmony_ciis no longer referenced in its current hierarchy. Because per-cgroup 1518c2ecf20Sopenharmony_cicontroller states are destroyed asynchronously and controllers may 1528c2ecf20Sopenharmony_cihave lingering references, a controller may not show up immediately on 1538c2ecf20Sopenharmony_cithe v2 hierarchy after the final umount of the previous hierarchy. 1548c2ecf20Sopenharmony_ciSimilarly, a controller should be fully disabled to be moved out of 1558c2ecf20Sopenharmony_cithe unified hierarchy and it may take some time for the disabled 1568c2ecf20Sopenharmony_cicontroller to become available for other hierarchies; furthermore, due 1578c2ecf20Sopenharmony_cito inter-controller dependencies, other controllers may need to be 1588c2ecf20Sopenharmony_cidisabled too. 1598c2ecf20Sopenharmony_ci 1608c2ecf20Sopenharmony_ciWhile useful for development and manual configurations, moving 1618c2ecf20Sopenharmony_cicontrollers dynamically between the v2 and other hierarchies is 1628c2ecf20Sopenharmony_cistrongly discouraged for production use. It is recommended to decide 1638c2ecf20Sopenharmony_cithe hierarchies and controller associations before starting using the 1648c2ecf20Sopenharmony_cicontrollers after system boot. 1658c2ecf20Sopenharmony_ci 1668c2ecf20Sopenharmony_ciDuring transition to v2, system management software might still 1678c2ecf20Sopenharmony_ciautomount the v1 cgroup filesystem and so hijack all controllers 1688c2ecf20Sopenharmony_ciduring boot, before manual intervention is possible. To make testing 1698c2ecf20Sopenharmony_ciand experimenting easier, the kernel parameter cgroup_no_v1= allows 1708c2ecf20Sopenharmony_cidisabling controllers in v1 and make them always available in v2. 1718c2ecf20Sopenharmony_ci 1728c2ecf20Sopenharmony_cicgroup v2 currently supports the following mount options. 1738c2ecf20Sopenharmony_ci 1748c2ecf20Sopenharmony_ci nsdelegate 1758c2ecf20Sopenharmony_ci 1768c2ecf20Sopenharmony_ci Consider cgroup namespaces as delegation boundaries. This 1778c2ecf20Sopenharmony_ci option is system wide and can only be set on mount or modified 1788c2ecf20Sopenharmony_ci through remount from the init namespace. The mount option is 1798c2ecf20Sopenharmony_ci ignored on non-init namespace mounts. Please refer to the 1808c2ecf20Sopenharmony_ci Delegation section for details. 1818c2ecf20Sopenharmony_ci 1828c2ecf20Sopenharmony_ci memory_localevents 1838c2ecf20Sopenharmony_ci 1848c2ecf20Sopenharmony_ci Only populate memory.events with data for the current cgroup, 1858c2ecf20Sopenharmony_ci and not any subtrees. This is legacy behaviour, the default 1868c2ecf20Sopenharmony_ci behaviour without this option is to include subtree counts. 1878c2ecf20Sopenharmony_ci This option is system wide and can only be set on mount or 1888c2ecf20Sopenharmony_ci modified through remount from the init namespace. The mount 1898c2ecf20Sopenharmony_ci option is ignored on non-init namespace mounts. 1908c2ecf20Sopenharmony_ci 1918c2ecf20Sopenharmony_ci memory_recursiveprot 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ci Recursively apply memory.min and memory.low protection to 1948c2ecf20Sopenharmony_ci entire subtrees, without requiring explicit downward 1958c2ecf20Sopenharmony_ci propagation into leaf cgroups. This allows protecting entire 1968c2ecf20Sopenharmony_ci subtrees from one another, while retaining free competition 1978c2ecf20Sopenharmony_ci within those subtrees. This should have been the default 1988c2ecf20Sopenharmony_ci behavior but is a mount-option to avoid regressing setups 1998c2ecf20Sopenharmony_ci relying on the original semantics (e.g. specifying bogusly 2008c2ecf20Sopenharmony_ci high 'bypass' protection values at higher tree levels). 2018c2ecf20Sopenharmony_ci 2028c2ecf20Sopenharmony_ci 2038c2ecf20Sopenharmony_ciOrganizing Processes and Threads 2048c2ecf20Sopenharmony_ci-------------------------------- 2058c2ecf20Sopenharmony_ci 2068c2ecf20Sopenharmony_ciProcesses 2078c2ecf20Sopenharmony_ci~~~~~~~~~ 2088c2ecf20Sopenharmony_ci 2098c2ecf20Sopenharmony_ciInitially, only the root cgroup exists to which all processes belong. 2108c2ecf20Sopenharmony_ciA child cgroup can be created by creating a sub-directory:: 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ci # mkdir $CGROUP_NAME 2138c2ecf20Sopenharmony_ci 2148c2ecf20Sopenharmony_ciA given cgroup may have multiple child cgroups forming a tree 2158c2ecf20Sopenharmony_cistructure. Each cgroup has a read-writable interface file 2168c2ecf20Sopenharmony_ci"cgroup.procs". When read, it lists the PIDs of all processes which 2178c2ecf20Sopenharmony_cibelong to the cgroup one-per-line. The PIDs are not ordered and the 2188c2ecf20Sopenharmony_cisame PID may show up more than once if the process got moved to 2198c2ecf20Sopenharmony_cianother cgroup and then back or the PID got recycled while reading. 2208c2ecf20Sopenharmony_ci 2218c2ecf20Sopenharmony_ciA process can be migrated into a cgroup by writing its PID to the 2228c2ecf20Sopenharmony_citarget cgroup's "cgroup.procs" file. Only one process can be migrated 2238c2ecf20Sopenharmony_cion a single write(2) call. If a process is composed of multiple 2248c2ecf20Sopenharmony_cithreads, writing the PID of any thread migrates all threads of the 2258c2ecf20Sopenharmony_ciprocess. 2268c2ecf20Sopenharmony_ci 2278c2ecf20Sopenharmony_ciWhen a process forks a child process, the new process is born into the 2288c2ecf20Sopenharmony_cicgroup that the forking process belongs to at the time of the 2298c2ecf20Sopenharmony_cioperation. After exit, a process stays associated with the cgroup 2308c2ecf20Sopenharmony_cithat it belonged to at the time of exit until it's reaped; however, a 2318c2ecf20Sopenharmony_cizombie process does not appear in "cgroup.procs" and thus can't be 2328c2ecf20Sopenharmony_cimoved to another cgroup. 2338c2ecf20Sopenharmony_ci 2348c2ecf20Sopenharmony_ciA cgroup which doesn't have any children or live processes can be 2358c2ecf20Sopenharmony_cidestroyed by removing the directory. Note that a cgroup which doesn't 2368c2ecf20Sopenharmony_cihave any children and is associated only with zombie processes is 2378c2ecf20Sopenharmony_ciconsidered empty and can be removed:: 2388c2ecf20Sopenharmony_ci 2398c2ecf20Sopenharmony_ci # rmdir $CGROUP_NAME 2408c2ecf20Sopenharmony_ci 2418c2ecf20Sopenharmony_ci"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 2428c2ecf20Sopenharmony_cicgroup is in use in the system, this file may contain multiple lines, 2438c2ecf20Sopenharmony_cione for each hierarchy. The entry for cgroup v2 is always in the 2448c2ecf20Sopenharmony_ciformat "0::$PATH":: 2458c2ecf20Sopenharmony_ci 2468c2ecf20Sopenharmony_ci # cat /proc/842/cgroup 2478c2ecf20Sopenharmony_ci ... 2488c2ecf20Sopenharmony_ci 0::/test-cgroup/test-cgroup-nested 2498c2ecf20Sopenharmony_ci 2508c2ecf20Sopenharmony_ciIf the process becomes a zombie and the cgroup it was associated with 2518c2ecf20Sopenharmony_ciis removed subsequently, " (deleted)" is appended to the path:: 2528c2ecf20Sopenharmony_ci 2538c2ecf20Sopenharmony_ci # cat /proc/842/cgroup 2548c2ecf20Sopenharmony_ci ... 2558c2ecf20Sopenharmony_ci 0::/test-cgroup/test-cgroup-nested (deleted) 2568c2ecf20Sopenharmony_ci 2578c2ecf20Sopenharmony_ci 2588c2ecf20Sopenharmony_ciThreads 2598c2ecf20Sopenharmony_ci~~~~~~~ 2608c2ecf20Sopenharmony_ci 2618c2ecf20Sopenharmony_cicgroup v2 supports thread granularity for a subset of controllers to 2628c2ecf20Sopenharmony_cisupport use cases requiring hierarchical resource distribution across 2638c2ecf20Sopenharmony_cithe threads of a group of processes. By default, all threads of a 2648c2ecf20Sopenharmony_ciprocess belong to the same cgroup, which also serves as the resource 2658c2ecf20Sopenharmony_cidomain to host resource consumptions which are not specific to a 2668c2ecf20Sopenharmony_ciprocess or thread. The thread mode allows threads to be spread across 2678c2ecf20Sopenharmony_cia subtree while still maintaining the common resource domain for them. 2688c2ecf20Sopenharmony_ci 2698c2ecf20Sopenharmony_ciControllers which support thread mode are called threaded controllers. 2708c2ecf20Sopenharmony_ciThe ones which don't are called domain controllers. 2718c2ecf20Sopenharmony_ci 2728c2ecf20Sopenharmony_ciMarking a cgroup threaded makes it join the resource domain of its 2738c2ecf20Sopenharmony_ciparent as a threaded cgroup. The parent may be another threaded 2748c2ecf20Sopenharmony_cicgroup whose resource domain is further up in the hierarchy. The root 2758c2ecf20Sopenharmony_ciof a threaded subtree, that is, the nearest ancestor which is not 2768c2ecf20Sopenharmony_cithreaded, is called threaded domain or thread root interchangeably and 2778c2ecf20Sopenharmony_ciserves as the resource domain for the entire subtree. 2788c2ecf20Sopenharmony_ci 2798c2ecf20Sopenharmony_ciInside a threaded subtree, threads of a process can be put in 2808c2ecf20Sopenharmony_cidifferent cgroups and are not subject to the no internal process 2818c2ecf20Sopenharmony_ciconstraint - threaded controllers can be enabled on non-leaf cgroups 2828c2ecf20Sopenharmony_ciwhether they have threads in them or not. 2838c2ecf20Sopenharmony_ci 2848c2ecf20Sopenharmony_ciAs the threaded domain cgroup hosts all the domain resource 2858c2ecf20Sopenharmony_ciconsumptions of the subtree, it is considered to have internal 2868c2ecf20Sopenharmony_ciresource consumptions whether there are processes in it or not and 2878c2ecf20Sopenharmony_cican't have populated child cgroups which aren't threaded. Because the 2888c2ecf20Sopenharmony_ciroot cgroup is not subject to no internal process constraint, it can 2898c2ecf20Sopenharmony_ciserve both as a threaded domain and a parent to domain cgroups. 2908c2ecf20Sopenharmony_ci 2918c2ecf20Sopenharmony_ciThe current operation mode or type of the cgroup is shown in the 2928c2ecf20Sopenharmony_ci"cgroup.type" file which indicates whether the cgroup is a normal 2938c2ecf20Sopenharmony_cidomain, a domain which is serving as the domain of a threaded subtree, 2948c2ecf20Sopenharmony_cior a threaded cgroup. 2958c2ecf20Sopenharmony_ci 2968c2ecf20Sopenharmony_ciOn creation, a cgroup is always a domain cgroup and can be made 2978c2ecf20Sopenharmony_cithreaded by writing "threaded" to the "cgroup.type" file. The 2988c2ecf20Sopenharmony_cioperation is single direction:: 2998c2ecf20Sopenharmony_ci 3008c2ecf20Sopenharmony_ci # echo threaded > cgroup.type 3018c2ecf20Sopenharmony_ci 3028c2ecf20Sopenharmony_ciOnce threaded, the cgroup can't be made a domain again. To enable the 3038c2ecf20Sopenharmony_cithread mode, the following conditions must be met. 3048c2ecf20Sopenharmony_ci 3058c2ecf20Sopenharmony_ci- As the cgroup will join the parent's resource domain. The parent 3068c2ecf20Sopenharmony_ci must either be a valid (threaded) domain or a threaded cgroup. 3078c2ecf20Sopenharmony_ci 3088c2ecf20Sopenharmony_ci- When the parent is an unthreaded domain, it must not have any domain 3098c2ecf20Sopenharmony_ci controllers enabled or populated domain children. The root is 3108c2ecf20Sopenharmony_ci exempt from this requirement. 3118c2ecf20Sopenharmony_ci 3128c2ecf20Sopenharmony_ciTopology-wise, a cgroup can be in an invalid state. Please consider 3138c2ecf20Sopenharmony_cithe following topology:: 3148c2ecf20Sopenharmony_ci 3158c2ecf20Sopenharmony_ci A (threaded domain) - B (threaded) - C (domain, just created) 3168c2ecf20Sopenharmony_ci 3178c2ecf20Sopenharmony_ciC is created as a domain but isn't connected to a parent which can 3188c2ecf20Sopenharmony_cihost child domains. C can't be used until it is turned into a 3198c2ecf20Sopenharmony_cithreaded cgroup. "cgroup.type" file will report "domain (invalid)" in 3208c2ecf20Sopenharmony_cithese cases. Operations which fail due to invalid topology use 3218c2ecf20Sopenharmony_ciEOPNOTSUPP as the errno. 3228c2ecf20Sopenharmony_ci 3238c2ecf20Sopenharmony_ciA domain cgroup is turned into a threaded domain when one of its child 3248c2ecf20Sopenharmony_cicgroup becomes threaded or threaded controllers are enabled in the 3258c2ecf20Sopenharmony_ci"cgroup.subtree_control" file while there are processes in the cgroup. 3268c2ecf20Sopenharmony_ciA threaded domain reverts to a normal domain when the conditions 3278c2ecf20Sopenharmony_ciclear. 3288c2ecf20Sopenharmony_ci 3298c2ecf20Sopenharmony_ciWhen read, "cgroup.threads" contains the list of the thread IDs of all 3308c2ecf20Sopenharmony_cithreads in the cgroup. Except that the operations are per-thread 3318c2ecf20Sopenharmony_ciinstead of per-process, "cgroup.threads" has the same format and 3328c2ecf20Sopenharmony_cibehaves the same way as "cgroup.procs". While "cgroup.threads" can be 3338c2ecf20Sopenharmony_ciwritten to in any cgroup, as it can only move threads inside the same 3348c2ecf20Sopenharmony_cithreaded domain, its operations are confined inside each threaded 3358c2ecf20Sopenharmony_cisubtree. 3368c2ecf20Sopenharmony_ci 3378c2ecf20Sopenharmony_ciThe threaded domain cgroup serves as the resource domain for the whole 3388c2ecf20Sopenharmony_cisubtree, and, while the threads can be scattered across the subtree, 3398c2ecf20Sopenharmony_ciall the processes are considered to be in the threaded domain cgroup. 3408c2ecf20Sopenharmony_ci"cgroup.procs" in a threaded domain cgroup contains the PIDs of all 3418c2ecf20Sopenharmony_ciprocesses in the subtree and is not readable in the subtree proper. 3428c2ecf20Sopenharmony_ciHowever, "cgroup.procs" can be written to from anywhere in the subtree 3438c2ecf20Sopenharmony_cito migrate all threads of the matching process to the cgroup. 3448c2ecf20Sopenharmony_ci 3458c2ecf20Sopenharmony_ciOnly threaded controllers can be enabled in a threaded subtree. When 3468c2ecf20Sopenharmony_cia threaded controller is enabled inside a threaded subtree, it only 3478c2ecf20Sopenharmony_ciaccounts for and controls resource consumptions associated with the 3488c2ecf20Sopenharmony_cithreads in the cgroup and its descendants. All consumptions which 3498c2ecf20Sopenharmony_ciaren't tied to a specific thread belong to the threaded domain cgroup. 3508c2ecf20Sopenharmony_ci 3518c2ecf20Sopenharmony_ciBecause a threaded subtree is exempt from no internal process 3528c2ecf20Sopenharmony_ciconstraint, a threaded controller must be able to handle competition 3538c2ecf20Sopenharmony_cibetween threads in a non-leaf cgroup and its child cgroups. Each 3548c2ecf20Sopenharmony_cithreaded controller defines how such competitions are handled. 3558c2ecf20Sopenharmony_ci 3568c2ecf20Sopenharmony_ci 3578c2ecf20Sopenharmony_ci[Un]populated Notification 3588c2ecf20Sopenharmony_ci-------------------------- 3598c2ecf20Sopenharmony_ci 3608c2ecf20Sopenharmony_ciEach non-root cgroup has a "cgroup.events" file which contains 3618c2ecf20Sopenharmony_ci"populated" field indicating whether the cgroup's sub-hierarchy has 3628c2ecf20Sopenharmony_cilive processes in it. Its value is 0 if there is no live process in 3638c2ecf20Sopenharmony_cithe cgroup and its descendants; otherwise, 1. poll and [id]notify 3648c2ecf20Sopenharmony_cievents are triggered when the value changes. This can be used, for 3658c2ecf20Sopenharmony_ciexample, to start a clean-up operation after all processes of a given 3668c2ecf20Sopenharmony_cisub-hierarchy have exited. The populated state updates and 3678c2ecf20Sopenharmony_cinotifications are recursive. Consider the following sub-hierarchy 3688c2ecf20Sopenharmony_ciwhere the numbers in the parentheses represent the numbers of processes 3698c2ecf20Sopenharmony_ciin each cgroup:: 3708c2ecf20Sopenharmony_ci 3718c2ecf20Sopenharmony_ci A(4) - B(0) - C(1) 3728c2ecf20Sopenharmony_ci \ D(0) 3738c2ecf20Sopenharmony_ci 3748c2ecf20Sopenharmony_ciA, B and C's "populated" fields would be 1 while D's 0. After the one 3758c2ecf20Sopenharmony_ciprocess in C exits, B and C's "populated" fields would flip to "0" and 3768c2ecf20Sopenharmony_cifile modified events will be generated on the "cgroup.events" files of 3778c2ecf20Sopenharmony_ciboth cgroups. 3788c2ecf20Sopenharmony_ci 3798c2ecf20Sopenharmony_ci 3808c2ecf20Sopenharmony_ciControlling Controllers 3818c2ecf20Sopenharmony_ci----------------------- 3828c2ecf20Sopenharmony_ci 3838c2ecf20Sopenharmony_ciEnabling and Disabling 3848c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 3858c2ecf20Sopenharmony_ci 3868c2ecf20Sopenharmony_ciEach cgroup has a "cgroup.controllers" file which lists all 3878c2ecf20Sopenharmony_cicontrollers available for the cgroup to enable:: 3888c2ecf20Sopenharmony_ci 3898c2ecf20Sopenharmony_ci # cat cgroup.controllers 3908c2ecf20Sopenharmony_ci cpu io memory 3918c2ecf20Sopenharmony_ci 3928c2ecf20Sopenharmony_ciNo controller is enabled by default. Controllers can be enabled and 3938c2ecf20Sopenharmony_cidisabled by writing to the "cgroup.subtree_control" file:: 3948c2ecf20Sopenharmony_ci 3958c2ecf20Sopenharmony_ci # echo "+cpu +memory -io" > cgroup.subtree_control 3968c2ecf20Sopenharmony_ci 3978c2ecf20Sopenharmony_ciOnly controllers which are listed in "cgroup.controllers" can be 3988c2ecf20Sopenharmony_cienabled. When multiple operations are specified as above, either they 3998c2ecf20Sopenharmony_ciall succeed or fail. If multiple operations on the same controller 4008c2ecf20Sopenharmony_ciare specified, the last one is effective. 4018c2ecf20Sopenharmony_ci 4028c2ecf20Sopenharmony_ciEnabling a controller in a cgroup indicates that the distribution of 4038c2ecf20Sopenharmony_cithe target resource across its immediate children will be controlled. 4048c2ecf20Sopenharmony_ciConsider the following sub-hierarchy. The enabled controllers are 4058c2ecf20Sopenharmony_cilisted in parentheses:: 4068c2ecf20Sopenharmony_ci 4078c2ecf20Sopenharmony_ci A(cpu,memory) - B(memory) - C() 4088c2ecf20Sopenharmony_ci \ D() 4098c2ecf20Sopenharmony_ci 4108c2ecf20Sopenharmony_ciAs A has "cpu" and "memory" enabled, A will control the distribution 4118c2ecf20Sopenharmony_ciof CPU cycles and memory to its children, in this case, B. As B has 4128c2ecf20Sopenharmony_ci"memory" enabled but not "CPU", C and D will compete freely on CPU 4138c2ecf20Sopenharmony_cicycles but their division of memory available to B will be controlled. 4148c2ecf20Sopenharmony_ci 4158c2ecf20Sopenharmony_ciAs a controller regulates the distribution of the target resource to 4168c2ecf20Sopenharmony_cithe cgroup's children, enabling it creates the controller's interface 4178c2ecf20Sopenharmony_cifiles in the child cgroups. In the above example, enabling "cpu" on B 4188c2ecf20Sopenharmony_ciwould create the "cpu." prefixed controller interface files in C and 4198c2ecf20Sopenharmony_ciD. Likewise, disabling "memory" from B would remove the "memory." 4208c2ecf20Sopenharmony_ciprefixed controller interface files from C and D. This means that the 4218c2ecf20Sopenharmony_cicontroller interface files - anything which doesn't start with 4228c2ecf20Sopenharmony_ci"cgroup." are owned by the parent rather than the cgroup itself. 4238c2ecf20Sopenharmony_ci 4248c2ecf20Sopenharmony_ci 4258c2ecf20Sopenharmony_ciTop-down Constraint 4268c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 4278c2ecf20Sopenharmony_ci 4288c2ecf20Sopenharmony_ciResources are distributed top-down and a cgroup can further distribute 4298c2ecf20Sopenharmony_cia resource only if the resource has been distributed to it from the 4308c2ecf20Sopenharmony_ciparent. This means that all non-root "cgroup.subtree_control" files 4318c2ecf20Sopenharmony_cican only contain controllers which are enabled in the parent's 4328c2ecf20Sopenharmony_ci"cgroup.subtree_control" file. A controller can be enabled only if 4338c2ecf20Sopenharmony_cithe parent has the controller enabled and a controller can't be 4348c2ecf20Sopenharmony_cidisabled if one or more children have it enabled. 4358c2ecf20Sopenharmony_ci 4368c2ecf20Sopenharmony_ci 4378c2ecf20Sopenharmony_ciNo Internal Process Constraint 4388c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4398c2ecf20Sopenharmony_ci 4408c2ecf20Sopenharmony_ciNon-root cgroups can distribute domain resources to their children 4418c2ecf20Sopenharmony_cionly when they don't have any processes of their own. In other words, 4428c2ecf20Sopenharmony_cionly domain cgroups which don't contain any processes can have domain 4438c2ecf20Sopenharmony_cicontrollers enabled in their "cgroup.subtree_control" files. 4448c2ecf20Sopenharmony_ci 4458c2ecf20Sopenharmony_ciThis guarantees that, when a domain controller is looking at the part 4468c2ecf20Sopenharmony_ciof the hierarchy which has it enabled, processes are always only on 4478c2ecf20Sopenharmony_cithe leaves. This rules out situations where child cgroups compete 4488c2ecf20Sopenharmony_ciagainst internal processes of the parent. 4498c2ecf20Sopenharmony_ci 4508c2ecf20Sopenharmony_ciThe root cgroup is exempt from this restriction. Root contains 4518c2ecf20Sopenharmony_ciprocesses and anonymous resource consumption which can't be associated 4528c2ecf20Sopenharmony_ciwith any other cgroups and requires special treatment from most 4538c2ecf20Sopenharmony_cicontrollers. How resource consumption in the root cgroup is governed 4548c2ecf20Sopenharmony_ciis up to each controller (for more information on this topic please 4558c2ecf20Sopenharmony_cirefer to the Non-normative information section in the Controllers 4568c2ecf20Sopenharmony_cichapter). 4578c2ecf20Sopenharmony_ci 4588c2ecf20Sopenharmony_ciNote that the restriction doesn't get in the way if there is no 4598c2ecf20Sopenharmony_cienabled controller in the cgroup's "cgroup.subtree_control". This is 4608c2ecf20Sopenharmony_ciimportant as otherwise it wouldn't be possible to create children of a 4618c2ecf20Sopenharmony_cipopulated cgroup. To control resource distribution of a cgroup, the 4628c2ecf20Sopenharmony_cicgroup must create children and transfer all its processes to the 4638c2ecf20Sopenharmony_cichildren before enabling controllers in its "cgroup.subtree_control" 4648c2ecf20Sopenharmony_cifile. 4658c2ecf20Sopenharmony_ci 4668c2ecf20Sopenharmony_ci 4678c2ecf20Sopenharmony_ciDelegation 4688c2ecf20Sopenharmony_ci---------- 4698c2ecf20Sopenharmony_ci 4708c2ecf20Sopenharmony_ciModel of Delegation 4718c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 4728c2ecf20Sopenharmony_ci 4738c2ecf20Sopenharmony_ciA cgroup can be delegated in two ways. First, to a less privileged 4748c2ecf20Sopenharmony_ciuser by granting write access of the directory and its "cgroup.procs", 4758c2ecf20Sopenharmony_ci"cgroup.threads" and "cgroup.subtree_control" files to the user. 4768c2ecf20Sopenharmony_ciSecond, if the "nsdelegate" mount option is set, automatically to a 4778c2ecf20Sopenharmony_cicgroup namespace on namespace creation. 4788c2ecf20Sopenharmony_ci 4798c2ecf20Sopenharmony_ciBecause the resource control interface files in a given directory 4808c2ecf20Sopenharmony_cicontrol the distribution of the parent's resources, the delegatee 4818c2ecf20Sopenharmony_cishouldn't be allowed to write to them. For the first method, this is 4828c2ecf20Sopenharmony_ciachieved by not granting access to these files. For the second, the 4838c2ecf20Sopenharmony_cikernel rejects writes to all files other than "cgroup.procs" and 4848c2ecf20Sopenharmony_ci"cgroup.subtree_control" on a namespace root from inside the 4858c2ecf20Sopenharmony_cinamespace. 4868c2ecf20Sopenharmony_ci 4878c2ecf20Sopenharmony_ciThe end results are equivalent for both delegation types. Once 4888c2ecf20Sopenharmony_cidelegated, the user can build sub-hierarchy under the directory, 4898c2ecf20Sopenharmony_ciorganize processes inside it as it sees fit and further distribute the 4908c2ecf20Sopenharmony_ciresources it received from the parent. The limits and other settings 4918c2ecf20Sopenharmony_ciof all resource controllers are hierarchical and regardless of what 4928c2ecf20Sopenharmony_cihappens in the delegated sub-hierarchy, nothing can escape the 4938c2ecf20Sopenharmony_ciresource restrictions imposed by the parent. 4948c2ecf20Sopenharmony_ci 4958c2ecf20Sopenharmony_ciCurrently, cgroup doesn't impose any restrictions on the number of 4968c2ecf20Sopenharmony_cicgroups in or nesting depth of a delegated sub-hierarchy; however, 4978c2ecf20Sopenharmony_cithis may be limited explicitly in the future. 4988c2ecf20Sopenharmony_ci 4998c2ecf20Sopenharmony_ci 5008c2ecf20Sopenharmony_ciDelegation Containment 5018c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 5028c2ecf20Sopenharmony_ci 5038c2ecf20Sopenharmony_ciA delegated sub-hierarchy is contained in the sense that processes 5048c2ecf20Sopenharmony_cican't be moved into or out of the sub-hierarchy by the delegatee. 5058c2ecf20Sopenharmony_ci 5068c2ecf20Sopenharmony_ciFor delegations to a less privileged user, this is achieved by 5078c2ecf20Sopenharmony_cirequiring the following conditions for a process with a non-root euid 5088c2ecf20Sopenharmony_cito migrate a target process into a cgroup by writing its PID to the 5098c2ecf20Sopenharmony_ci"cgroup.procs" file. 5108c2ecf20Sopenharmony_ci 5118c2ecf20Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file. 5128c2ecf20Sopenharmony_ci 5138c2ecf20Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file of the 5148c2ecf20Sopenharmony_ci common ancestor of the source and destination cgroups. 5158c2ecf20Sopenharmony_ci 5168c2ecf20Sopenharmony_ciThe above two constraints ensure that while a delegatee may migrate 5178c2ecf20Sopenharmony_ciprocesses around freely in the delegated sub-hierarchy it can't pull 5188c2ecf20Sopenharmony_ciin from or push out to outside the sub-hierarchy. 5198c2ecf20Sopenharmony_ci 5208c2ecf20Sopenharmony_ciFor an example, let's assume cgroups C0 and C1 have been delegated to 5218c2ecf20Sopenharmony_ciuser U0 who created C00, C01 under C0 and C10 under C1 as follows and 5228c2ecf20Sopenharmony_ciall processes under C0 and C1 belong to U0:: 5238c2ecf20Sopenharmony_ci 5248c2ecf20Sopenharmony_ci ~~~~~~~~~~~~~ - C0 - C00 5258c2ecf20Sopenharmony_ci ~ cgroup ~ \ C01 5268c2ecf20Sopenharmony_ci ~ hierarchy ~ 5278c2ecf20Sopenharmony_ci ~~~~~~~~~~~~~ - C1 - C10 5288c2ecf20Sopenharmony_ci 5298c2ecf20Sopenharmony_ciLet's also say U0 wants to write the PID of a process which is 5308c2ecf20Sopenharmony_cicurrently in C10 into "C00/cgroup.procs". U0 has write access to the 5318c2ecf20Sopenharmony_cifile; however, the common ancestor of the source cgroup C10 and the 5328c2ecf20Sopenharmony_cidestination cgroup C00 is above the points of delegation and U0 would 5338c2ecf20Sopenharmony_cinot have write access to its "cgroup.procs" files and thus the write 5348c2ecf20Sopenharmony_ciwill be denied with -EACCES. 5358c2ecf20Sopenharmony_ci 5368c2ecf20Sopenharmony_ciFor delegations to namespaces, containment is achieved by requiring 5378c2ecf20Sopenharmony_cithat both the source and destination cgroups are reachable from the 5388c2ecf20Sopenharmony_cinamespace of the process which is attempting the migration. If either 5398c2ecf20Sopenharmony_ciis not reachable, the migration is rejected with -ENOENT. 5408c2ecf20Sopenharmony_ci 5418c2ecf20Sopenharmony_ci 5428c2ecf20Sopenharmony_ciGuidelines 5438c2ecf20Sopenharmony_ci---------- 5448c2ecf20Sopenharmony_ci 5458c2ecf20Sopenharmony_ciOrganize Once and Control 5468c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 5478c2ecf20Sopenharmony_ci 5488c2ecf20Sopenharmony_ciMigrating a process across cgroups is a relatively expensive operation 5498c2ecf20Sopenharmony_ciand stateful resources such as memory are not moved together with the 5508c2ecf20Sopenharmony_ciprocess. This is an explicit design decision as there often exist 5518c2ecf20Sopenharmony_ciinherent trade-offs between migration and various hot paths in terms 5528c2ecf20Sopenharmony_ciof synchronization cost. 5538c2ecf20Sopenharmony_ci 5548c2ecf20Sopenharmony_ciAs such, migrating processes across cgroups frequently as a means to 5558c2ecf20Sopenharmony_ciapply different resource restrictions is discouraged. A workload 5568c2ecf20Sopenharmony_cishould be assigned to a cgroup according to the system's logical and 5578c2ecf20Sopenharmony_ciresource structure once on start-up. Dynamic adjustments to resource 5588c2ecf20Sopenharmony_cidistribution can be made by changing controller configuration through 5598c2ecf20Sopenharmony_cithe interface files. 5608c2ecf20Sopenharmony_ci 5618c2ecf20Sopenharmony_ci 5628c2ecf20Sopenharmony_ciAvoid Name Collisions 5638c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 5648c2ecf20Sopenharmony_ci 5658c2ecf20Sopenharmony_ciInterface files for a cgroup and its children cgroups occupy the same 5668c2ecf20Sopenharmony_cidirectory and it is possible to create children cgroups which collide 5678c2ecf20Sopenharmony_ciwith interface files. 5688c2ecf20Sopenharmony_ci 5698c2ecf20Sopenharmony_ciAll cgroup core interface files are prefixed with "cgroup." and each 5708c2ecf20Sopenharmony_cicontroller's interface files are prefixed with the controller name and 5718c2ecf20Sopenharmony_cia dot. A controller's name is composed of lower case alphabets and 5728c2ecf20Sopenharmony_ci'_'s but never begins with an '_' so it can be used as the prefix 5738c2ecf20Sopenharmony_cicharacter for collision avoidance. Also, interface file names won't 5748c2ecf20Sopenharmony_cistart or end with terms which are often used in categorizing workloads 5758c2ecf20Sopenharmony_cisuch as job, service, slice, unit or workload. 5768c2ecf20Sopenharmony_ci 5778c2ecf20Sopenharmony_cicgroup doesn't do anything to prevent name collisions and it's the 5788c2ecf20Sopenharmony_ciuser's responsibility to avoid them. 5798c2ecf20Sopenharmony_ci 5808c2ecf20Sopenharmony_ci 5818c2ecf20Sopenharmony_ciResource Distribution Models 5828c2ecf20Sopenharmony_ci============================ 5838c2ecf20Sopenharmony_ci 5848c2ecf20Sopenharmony_cicgroup controllers implement several resource distribution schemes 5858c2ecf20Sopenharmony_cidepending on the resource type and expected use cases. This section 5868c2ecf20Sopenharmony_cidescribes major schemes in use along with their expected behaviors. 5878c2ecf20Sopenharmony_ci 5888c2ecf20Sopenharmony_ci 5898c2ecf20Sopenharmony_ciWeights 5908c2ecf20Sopenharmony_ci------- 5918c2ecf20Sopenharmony_ci 5928c2ecf20Sopenharmony_ciA parent's resource is distributed by adding up the weights of all 5938c2ecf20Sopenharmony_ciactive children and giving each the fraction matching the ratio of its 5948c2ecf20Sopenharmony_ciweight against the sum. As only children which can make use of the 5958c2ecf20Sopenharmony_ciresource at the moment participate in the distribution, this is 5968c2ecf20Sopenharmony_ciwork-conserving. Due to the dynamic nature, this model is usually 5978c2ecf20Sopenharmony_ciused for stateless resources. 5988c2ecf20Sopenharmony_ci 5998c2ecf20Sopenharmony_ciAll weights are in the range [1, 10000] with the default at 100. This 6008c2ecf20Sopenharmony_ciallows symmetric multiplicative biases in both directions at fine 6018c2ecf20Sopenharmony_cienough granularity while staying in the intuitive range. 6028c2ecf20Sopenharmony_ci 6038c2ecf20Sopenharmony_ciAs long as the weight is in range, all configuration combinations are 6048c2ecf20Sopenharmony_civalid and there is no reason to reject configuration changes or 6058c2ecf20Sopenharmony_ciprocess migrations. 6068c2ecf20Sopenharmony_ci 6078c2ecf20Sopenharmony_ci"cpu.weight" proportionally distributes CPU cycles to active children 6088c2ecf20Sopenharmony_ciand is an example of this type. 6098c2ecf20Sopenharmony_ci 6108c2ecf20Sopenharmony_ci 6118c2ecf20Sopenharmony_ciLimits 6128c2ecf20Sopenharmony_ci------ 6138c2ecf20Sopenharmony_ci 6148c2ecf20Sopenharmony_ciA child can only consume upto the configured amount of the resource. 6158c2ecf20Sopenharmony_ciLimits can be over-committed - the sum of the limits of children can 6168c2ecf20Sopenharmony_ciexceed the amount of resource available to the parent. 6178c2ecf20Sopenharmony_ci 6188c2ecf20Sopenharmony_ciLimits are in the range [0, max] and defaults to "max", which is noop. 6198c2ecf20Sopenharmony_ci 6208c2ecf20Sopenharmony_ciAs limits can be over-committed, all configuration combinations are 6218c2ecf20Sopenharmony_civalid and there is no reason to reject configuration changes or 6228c2ecf20Sopenharmony_ciprocess migrations. 6238c2ecf20Sopenharmony_ci 6248c2ecf20Sopenharmony_ci"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume 6258c2ecf20Sopenharmony_cion an IO device and is an example of this type. 6268c2ecf20Sopenharmony_ci 6278c2ecf20Sopenharmony_ci 6288c2ecf20Sopenharmony_ciProtections 6298c2ecf20Sopenharmony_ci----------- 6308c2ecf20Sopenharmony_ci 6318c2ecf20Sopenharmony_ciA cgroup is protected upto the configured amount of the resource 6328c2ecf20Sopenharmony_cias long as the usages of all its ancestors are under their 6338c2ecf20Sopenharmony_ciprotected levels. Protections can be hard guarantees or best effort 6348c2ecf20Sopenharmony_cisoft boundaries. Protections can also be over-committed in which case 6358c2ecf20Sopenharmony_cionly upto the amount available to the parent is protected among 6368c2ecf20Sopenharmony_cichildren. 6378c2ecf20Sopenharmony_ci 6388c2ecf20Sopenharmony_ciProtections are in the range [0, max] and defaults to 0, which is 6398c2ecf20Sopenharmony_cinoop. 6408c2ecf20Sopenharmony_ci 6418c2ecf20Sopenharmony_ciAs protections can be over-committed, all configuration combinations 6428c2ecf20Sopenharmony_ciare valid and there is no reason to reject configuration changes or 6438c2ecf20Sopenharmony_ciprocess migrations. 6448c2ecf20Sopenharmony_ci 6458c2ecf20Sopenharmony_ci"memory.low" implements best-effort memory protection and is an 6468c2ecf20Sopenharmony_ciexample of this type. 6478c2ecf20Sopenharmony_ci 6488c2ecf20Sopenharmony_ci 6498c2ecf20Sopenharmony_ciAllocations 6508c2ecf20Sopenharmony_ci----------- 6518c2ecf20Sopenharmony_ci 6528c2ecf20Sopenharmony_ciA cgroup is exclusively allocated a certain amount of a finite 6538c2ecf20Sopenharmony_ciresource. Allocations can't be over-committed - the sum of the 6548c2ecf20Sopenharmony_ciallocations of children can not exceed the amount of resource 6558c2ecf20Sopenharmony_ciavailable to the parent. 6568c2ecf20Sopenharmony_ci 6578c2ecf20Sopenharmony_ciAllocations are in the range [0, max] and defaults to 0, which is no 6588c2ecf20Sopenharmony_ciresource. 6598c2ecf20Sopenharmony_ci 6608c2ecf20Sopenharmony_ciAs allocations can't be over-committed, some configuration 6618c2ecf20Sopenharmony_cicombinations are invalid and should be rejected. Also, if the 6628c2ecf20Sopenharmony_ciresource is mandatory for execution of processes, process migrations 6638c2ecf20Sopenharmony_cimay be rejected. 6648c2ecf20Sopenharmony_ci 6658c2ecf20Sopenharmony_ci"cpu.rt.max" hard-allocates realtime slices and is an example of this 6668c2ecf20Sopenharmony_citype. 6678c2ecf20Sopenharmony_ci 6688c2ecf20Sopenharmony_ci 6698c2ecf20Sopenharmony_ciInterface Files 6708c2ecf20Sopenharmony_ci=============== 6718c2ecf20Sopenharmony_ci 6728c2ecf20Sopenharmony_ciFormat 6738c2ecf20Sopenharmony_ci------ 6748c2ecf20Sopenharmony_ci 6758c2ecf20Sopenharmony_ciAll interface files should be in one of the following formats whenever 6768c2ecf20Sopenharmony_cipossible:: 6778c2ecf20Sopenharmony_ci 6788c2ecf20Sopenharmony_ci New-line separated values 6798c2ecf20Sopenharmony_ci (when only one value can be written at once) 6808c2ecf20Sopenharmony_ci 6818c2ecf20Sopenharmony_ci VAL0\n 6828c2ecf20Sopenharmony_ci VAL1\n 6838c2ecf20Sopenharmony_ci ... 6848c2ecf20Sopenharmony_ci 6858c2ecf20Sopenharmony_ci Space separated values 6868c2ecf20Sopenharmony_ci (when read-only or multiple values can be written at once) 6878c2ecf20Sopenharmony_ci 6888c2ecf20Sopenharmony_ci VAL0 VAL1 ...\n 6898c2ecf20Sopenharmony_ci 6908c2ecf20Sopenharmony_ci Flat keyed 6918c2ecf20Sopenharmony_ci 6928c2ecf20Sopenharmony_ci KEY0 VAL0\n 6938c2ecf20Sopenharmony_ci KEY1 VAL1\n 6948c2ecf20Sopenharmony_ci ... 6958c2ecf20Sopenharmony_ci 6968c2ecf20Sopenharmony_ci Nested keyed 6978c2ecf20Sopenharmony_ci 6988c2ecf20Sopenharmony_ci KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... 6998c2ecf20Sopenharmony_ci KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... 7008c2ecf20Sopenharmony_ci ... 7018c2ecf20Sopenharmony_ci 7028c2ecf20Sopenharmony_ciFor a writable file, the format for writing should generally match 7038c2ecf20Sopenharmony_cireading; however, controllers may allow omitting later fields or 7048c2ecf20Sopenharmony_ciimplement restricted shortcuts for most common use cases. 7058c2ecf20Sopenharmony_ci 7068c2ecf20Sopenharmony_ciFor both flat and nested keyed files, only the values for a single key 7078c2ecf20Sopenharmony_cican be written at a time. For nested keyed files, the sub key pairs 7088c2ecf20Sopenharmony_cimay be specified in any order and not all pairs have to be specified. 7098c2ecf20Sopenharmony_ci 7108c2ecf20Sopenharmony_ci 7118c2ecf20Sopenharmony_ciConventions 7128c2ecf20Sopenharmony_ci----------- 7138c2ecf20Sopenharmony_ci 7148c2ecf20Sopenharmony_ci- Settings for a single feature should be contained in a single file. 7158c2ecf20Sopenharmony_ci 7168c2ecf20Sopenharmony_ci- The root cgroup should be exempt from resource control and thus 7178c2ecf20Sopenharmony_ci shouldn't have resource control interface files. 7188c2ecf20Sopenharmony_ci 7198c2ecf20Sopenharmony_ci- The default time unit is microseconds. If a different unit is ever 7208c2ecf20Sopenharmony_ci used, an explicit unit suffix must be present. 7218c2ecf20Sopenharmony_ci 7228c2ecf20Sopenharmony_ci- A parts-per quantity should use a percentage decimal with at least 7238c2ecf20Sopenharmony_ci two digit fractional part - e.g. 13.40. 7248c2ecf20Sopenharmony_ci 7258c2ecf20Sopenharmony_ci- If a controller implements weight based resource distribution, its 7268c2ecf20Sopenharmony_ci interface file should be named "weight" and have the range [1, 7278c2ecf20Sopenharmony_ci 10000] with 100 as the default. The values are chosen to allow 7288c2ecf20Sopenharmony_ci enough and symmetric bias in both directions while keeping it 7298c2ecf20Sopenharmony_ci intuitive (the default is 100%). 7308c2ecf20Sopenharmony_ci 7318c2ecf20Sopenharmony_ci- If a controller implements an absolute resource guarantee and/or 7328c2ecf20Sopenharmony_ci limit, the interface files should be named "min" and "max" 7338c2ecf20Sopenharmony_ci respectively. If a controller implements best effort resource 7348c2ecf20Sopenharmony_ci guarantee and/or limit, the interface files should be named "low" 7358c2ecf20Sopenharmony_ci and "high" respectively. 7368c2ecf20Sopenharmony_ci 7378c2ecf20Sopenharmony_ci In the above four control files, the special token "max" should be 7388c2ecf20Sopenharmony_ci used to represent upward infinity for both reading and writing. 7398c2ecf20Sopenharmony_ci 7408c2ecf20Sopenharmony_ci- If a setting has a configurable default value and keyed specific 7418c2ecf20Sopenharmony_ci overrides, the default entry should be keyed with "default" and 7428c2ecf20Sopenharmony_ci appear as the first entry in the file. 7438c2ecf20Sopenharmony_ci 7448c2ecf20Sopenharmony_ci The default value can be updated by writing either "default $VAL" or 7458c2ecf20Sopenharmony_ci "$VAL". 7468c2ecf20Sopenharmony_ci 7478c2ecf20Sopenharmony_ci When writing to update a specific override, "default" can be used as 7488c2ecf20Sopenharmony_ci the value to indicate removal of the override. Override entries 7498c2ecf20Sopenharmony_ci with "default" as the value must not appear when read. 7508c2ecf20Sopenharmony_ci 7518c2ecf20Sopenharmony_ci For example, a setting which is keyed by major:minor device numbers 7528c2ecf20Sopenharmony_ci with integer values may look like the following:: 7538c2ecf20Sopenharmony_ci 7548c2ecf20Sopenharmony_ci # cat cgroup-example-interface-file 7558c2ecf20Sopenharmony_ci default 150 7568c2ecf20Sopenharmony_ci 8:0 300 7578c2ecf20Sopenharmony_ci 7588c2ecf20Sopenharmony_ci The default value can be updated by:: 7598c2ecf20Sopenharmony_ci 7608c2ecf20Sopenharmony_ci # echo 125 > cgroup-example-interface-file 7618c2ecf20Sopenharmony_ci 7628c2ecf20Sopenharmony_ci or:: 7638c2ecf20Sopenharmony_ci 7648c2ecf20Sopenharmony_ci # echo "default 125" > cgroup-example-interface-file 7658c2ecf20Sopenharmony_ci 7668c2ecf20Sopenharmony_ci An override can be set by:: 7678c2ecf20Sopenharmony_ci 7688c2ecf20Sopenharmony_ci # echo "8:16 170" > cgroup-example-interface-file 7698c2ecf20Sopenharmony_ci 7708c2ecf20Sopenharmony_ci and cleared by:: 7718c2ecf20Sopenharmony_ci 7728c2ecf20Sopenharmony_ci # echo "8:0 default" > cgroup-example-interface-file 7738c2ecf20Sopenharmony_ci # cat cgroup-example-interface-file 7748c2ecf20Sopenharmony_ci default 125 7758c2ecf20Sopenharmony_ci 8:16 170 7768c2ecf20Sopenharmony_ci 7778c2ecf20Sopenharmony_ci- For events which are not very high frequency, an interface file 7788c2ecf20Sopenharmony_ci "events" should be created which lists event key value pairs. 7798c2ecf20Sopenharmony_ci Whenever a notifiable event happens, file modified event should be 7808c2ecf20Sopenharmony_ci generated on the file. 7818c2ecf20Sopenharmony_ci 7828c2ecf20Sopenharmony_ci 7838c2ecf20Sopenharmony_ciCore Interface Files 7848c2ecf20Sopenharmony_ci-------------------- 7858c2ecf20Sopenharmony_ci 7868c2ecf20Sopenharmony_ciAll cgroup core files are prefixed with "cgroup." 7878c2ecf20Sopenharmony_ci 7888c2ecf20Sopenharmony_ci cgroup.type 7898c2ecf20Sopenharmony_ci 7908c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 7918c2ecf20Sopenharmony_ci cgroups. 7928c2ecf20Sopenharmony_ci 7938c2ecf20Sopenharmony_ci When read, it indicates the current type of the cgroup, which 7948c2ecf20Sopenharmony_ci can be one of the following values. 7958c2ecf20Sopenharmony_ci 7968c2ecf20Sopenharmony_ci - "domain" : A normal valid domain cgroup. 7978c2ecf20Sopenharmony_ci 7988c2ecf20Sopenharmony_ci - "domain threaded" : A threaded domain cgroup which is 7998c2ecf20Sopenharmony_ci serving as the root of a threaded subtree. 8008c2ecf20Sopenharmony_ci 8018c2ecf20Sopenharmony_ci - "domain invalid" : A cgroup which is in an invalid state. 8028c2ecf20Sopenharmony_ci It can't be populated or have controllers enabled. It may 8038c2ecf20Sopenharmony_ci be allowed to become a threaded cgroup. 8048c2ecf20Sopenharmony_ci 8058c2ecf20Sopenharmony_ci - "threaded" : A threaded cgroup which is a member of a 8068c2ecf20Sopenharmony_ci threaded subtree. 8078c2ecf20Sopenharmony_ci 8088c2ecf20Sopenharmony_ci A cgroup can be turned into a threaded cgroup by writing 8098c2ecf20Sopenharmony_ci "threaded" to this file. 8108c2ecf20Sopenharmony_ci 8118c2ecf20Sopenharmony_ci cgroup.procs 8128c2ecf20Sopenharmony_ci A read-write new-line separated values file which exists on 8138c2ecf20Sopenharmony_ci all cgroups. 8148c2ecf20Sopenharmony_ci 8158c2ecf20Sopenharmony_ci When read, it lists the PIDs of all processes which belong to 8168c2ecf20Sopenharmony_ci the cgroup one-per-line. The PIDs are not ordered and the 8178c2ecf20Sopenharmony_ci same PID may show up more than once if the process got moved 8188c2ecf20Sopenharmony_ci to another cgroup and then back or the PID got recycled while 8198c2ecf20Sopenharmony_ci reading. 8208c2ecf20Sopenharmony_ci 8218c2ecf20Sopenharmony_ci A PID can be written to migrate the process associated with 8228c2ecf20Sopenharmony_ci the PID to the cgroup. The writer should match all of the 8238c2ecf20Sopenharmony_ci following conditions. 8248c2ecf20Sopenharmony_ci 8258c2ecf20Sopenharmony_ci - It must have write access to the "cgroup.procs" file. 8268c2ecf20Sopenharmony_ci 8278c2ecf20Sopenharmony_ci - It must have write access to the "cgroup.procs" file of the 8288c2ecf20Sopenharmony_ci common ancestor of the source and destination cgroups. 8298c2ecf20Sopenharmony_ci 8308c2ecf20Sopenharmony_ci When delegating a sub-hierarchy, write access to this file 8318c2ecf20Sopenharmony_ci should be granted along with the containing directory. 8328c2ecf20Sopenharmony_ci 8338c2ecf20Sopenharmony_ci In a threaded cgroup, reading this file fails with EOPNOTSUPP 8348c2ecf20Sopenharmony_ci as all the processes belong to the thread root. Writing is 8358c2ecf20Sopenharmony_ci supported and moves every thread of the process to the cgroup. 8368c2ecf20Sopenharmony_ci 8378c2ecf20Sopenharmony_ci cgroup.threads 8388c2ecf20Sopenharmony_ci A read-write new-line separated values file which exists on 8398c2ecf20Sopenharmony_ci all cgroups. 8408c2ecf20Sopenharmony_ci 8418c2ecf20Sopenharmony_ci When read, it lists the TIDs of all threads which belong to 8428c2ecf20Sopenharmony_ci the cgroup one-per-line. The TIDs are not ordered and the 8438c2ecf20Sopenharmony_ci same TID may show up more than once if the thread got moved to 8448c2ecf20Sopenharmony_ci another cgroup and then back or the TID got recycled while 8458c2ecf20Sopenharmony_ci reading. 8468c2ecf20Sopenharmony_ci 8478c2ecf20Sopenharmony_ci A TID can be written to migrate the thread associated with the 8488c2ecf20Sopenharmony_ci TID to the cgroup. The writer should match all of the 8498c2ecf20Sopenharmony_ci following conditions. 8508c2ecf20Sopenharmony_ci 8518c2ecf20Sopenharmony_ci - It must have write access to the "cgroup.threads" file. 8528c2ecf20Sopenharmony_ci 8538c2ecf20Sopenharmony_ci - The cgroup that the thread is currently in must be in the 8548c2ecf20Sopenharmony_ci same resource domain as the destination cgroup. 8558c2ecf20Sopenharmony_ci 8568c2ecf20Sopenharmony_ci - It must have write access to the "cgroup.procs" file of the 8578c2ecf20Sopenharmony_ci common ancestor of the source and destination cgroups. 8588c2ecf20Sopenharmony_ci 8598c2ecf20Sopenharmony_ci When delegating a sub-hierarchy, write access to this file 8608c2ecf20Sopenharmony_ci should be granted along with the containing directory. 8618c2ecf20Sopenharmony_ci 8628c2ecf20Sopenharmony_ci cgroup.controllers 8638c2ecf20Sopenharmony_ci A read-only space separated values file which exists on all 8648c2ecf20Sopenharmony_ci cgroups. 8658c2ecf20Sopenharmony_ci 8668c2ecf20Sopenharmony_ci It shows space separated list of all controllers available to 8678c2ecf20Sopenharmony_ci the cgroup. The controllers are not ordered. 8688c2ecf20Sopenharmony_ci 8698c2ecf20Sopenharmony_ci cgroup.subtree_control 8708c2ecf20Sopenharmony_ci A read-write space separated values file which exists on all 8718c2ecf20Sopenharmony_ci cgroups. Starts out empty. 8728c2ecf20Sopenharmony_ci 8738c2ecf20Sopenharmony_ci When read, it shows space separated list of the controllers 8748c2ecf20Sopenharmony_ci which are enabled to control resource distribution from the 8758c2ecf20Sopenharmony_ci cgroup to its children. 8768c2ecf20Sopenharmony_ci 8778c2ecf20Sopenharmony_ci Space separated list of controllers prefixed with '+' or '-' 8788c2ecf20Sopenharmony_ci can be written to enable or disable controllers. A controller 8798c2ecf20Sopenharmony_ci name prefixed with '+' enables the controller and '-' 8808c2ecf20Sopenharmony_ci disables. If a controller appears more than once on the list, 8818c2ecf20Sopenharmony_ci the last one is effective. When multiple enable and disable 8828c2ecf20Sopenharmony_ci operations are specified, either all succeed or all fail. 8838c2ecf20Sopenharmony_ci 8848c2ecf20Sopenharmony_ci cgroup.events 8858c2ecf20Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 8868c2ecf20Sopenharmony_ci The following entries are defined. Unless specified 8878c2ecf20Sopenharmony_ci otherwise, a value change in this file generates a file 8888c2ecf20Sopenharmony_ci modified event. 8898c2ecf20Sopenharmony_ci 8908c2ecf20Sopenharmony_ci populated 8918c2ecf20Sopenharmony_ci 1 if the cgroup or its descendants contains any live 8928c2ecf20Sopenharmony_ci processes; otherwise, 0. 8938c2ecf20Sopenharmony_ci frozen 8948c2ecf20Sopenharmony_ci 1 if the cgroup is frozen; otherwise, 0. 8958c2ecf20Sopenharmony_ci 8968c2ecf20Sopenharmony_ci cgroup.max.descendants 8978c2ecf20Sopenharmony_ci A read-write single value files. The default is "max". 8988c2ecf20Sopenharmony_ci 8998c2ecf20Sopenharmony_ci Maximum allowed number of descent cgroups. 9008c2ecf20Sopenharmony_ci If the actual number of descendants is equal or larger, 9018c2ecf20Sopenharmony_ci an attempt to create a new cgroup in the hierarchy will fail. 9028c2ecf20Sopenharmony_ci 9038c2ecf20Sopenharmony_ci cgroup.max.depth 9048c2ecf20Sopenharmony_ci A read-write single value files. The default is "max". 9058c2ecf20Sopenharmony_ci 9068c2ecf20Sopenharmony_ci Maximum allowed descent depth below the current cgroup. 9078c2ecf20Sopenharmony_ci If the actual descent depth is equal or larger, 9088c2ecf20Sopenharmony_ci an attempt to create a new child cgroup will fail. 9098c2ecf20Sopenharmony_ci 9108c2ecf20Sopenharmony_ci cgroup.stat 9118c2ecf20Sopenharmony_ci A read-only flat-keyed file with the following entries: 9128c2ecf20Sopenharmony_ci 9138c2ecf20Sopenharmony_ci nr_descendants 9148c2ecf20Sopenharmony_ci Total number of visible descendant cgroups. 9158c2ecf20Sopenharmony_ci 9168c2ecf20Sopenharmony_ci nr_dying_descendants 9178c2ecf20Sopenharmony_ci Total number of dying descendant cgroups. A cgroup becomes 9188c2ecf20Sopenharmony_ci dying after being deleted by a user. The cgroup will remain 9198c2ecf20Sopenharmony_ci in dying state for some time undefined time (which can depend 9208c2ecf20Sopenharmony_ci on system load) before being completely destroyed. 9218c2ecf20Sopenharmony_ci 9228c2ecf20Sopenharmony_ci A process can't enter a dying cgroup under any circumstances, 9238c2ecf20Sopenharmony_ci a dying cgroup can't revive. 9248c2ecf20Sopenharmony_ci 9258c2ecf20Sopenharmony_ci A dying cgroup can consume system resources not exceeding 9268c2ecf20Sopenharmony_ci limits, which were active at the moment of cgroup deletion. 9278c2ecf20Sopenharmony_ci 9288c2ecf20Sopenharmony_ci cgroup.freeze 9298c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 9308c2ecf20Sopenharmony_ci Allowed values are "0" and "1". The default is "0". 9318c2ecf20Sopenharmony_ci 9328c2ecf20Sopenharmony_ci Writing "1" to the file causes freezing of the cgroup and all 9338c2ecf20Sopenharmony_ci descendant cgroups. This means that all belonging processes will 9348c2ecf20Sopenharmony_ci be stopped and will not run until the cgroup will be explicitly 9358c2ecf20Sopenharmony_ci unfrozen. Freezing of the cgroup may take some time; when this action 9368c2ecf20Sopenharmony_ci is completed, the "frozen" value in the cgroup.events control file 9378c2ecf20Sopenharmony_ci will be updated to "1" and the corresponding notification will be 9388c2ecf20Sopenharmony_ci issued. 9398c2ecf20Sopenharmony_ci 9408c2ecf20Sopenharmony_ci A cgroup can be frozen either by its own settings, or by settings 9418c2ecf20Sopenharmony_ci of any ancestor cgroups. If any of ancestor cgroups is frozen, the 9428c2ecf20Sopenharmony_ci cgroup will remain frozen. 9438c2ecf20Sopenharmony_ci 9448c2ecf20Sopenharmony_ci Processes in the frozen cgroup can be killed by a fatal signal. 9458c2ecf20Sopenharmony_ci They also can enter and leave a frozen cgroup: either by an explicit 9468c2ecf20Sopenharmony_ci move by a user, or if freezing of the cgroup races with fork(). 9478c2ecf20Sopenharmony_ci If a process is moved to a frozen cgroup, it stops. If a process is 9488c2ecf20Sopenharmony_ci moved out of a frozen cgroup, it becomes running. 9498c2ecf20Sopenharmony_ci 9508c2ecf20Sopenharmony_ci Frozen status of a cgroup doesn't affect any cgroup tree operations: 9518c2ecf20Sopenharmony_ci it's possible to delete a frozen (and empty) cgroup, as well as 9528c2ecf20Sopenharmony_ci create new sub-cgroups. 9538c2ecf20Sopenharmony_ci 9548c2ecf20Sopenharmony_ciControllers 9558c2ecf20Sopenharmony_ci=========== 9568c2ecf20Sopenharmony_ci 9578c2ecf20Sopenharmony_ciCPU 9588c2ecf20Sopenharmony_ci--- 9598c2ecf20Sopenharmony_ci 9608c2ecf20Sopenharmony_ciThe "cpu" controllers regulates distribution of CPU cycles. This 9618c2ecf20Sopenharmony_cicontroller implements weight and absolute bandwidth limit models for 9628c2ecf20Sopenharmony_cinormal scheduling policy and absolute bandwidth allocation model for 9638c2ecf20Sopenharmony_cirealtime scheduling policy. 9648c2ecf20Sopenharmony_ci 9658c2ecf20Sopenharmony_ciIn all the above models, cycles distribution is defined only on a temporal 9668c2ecf20Sopenharmony_cibase and it does not account for the frequency at which tasks are executed. 9678c2ecf20Sopenharmony_ciThe (optional) utilization clamping support allows to hint the schedutil 9688c2ecf20Sopenharmony_cicpufreq governor about the minimum desired frequency which should always be 9698c2ecf20Sopenharmony_ciprovided by a CPU, as well as the maximum desired frequency, which should not 9708c2ecf20Sopenharmony_cibe exceeded by a CPU. 9718c2ecf20Sopenharmony_ci 9728c2ecf20Sopenharmony_ciWARNING: cgroup2 doesn't yet support control of realtime processes and 9738c2ecf20Sopenharmony_cithe cpu controller can only be enabled when all RT processes are in 9748c2ecf20Sopenharmony_cithe root cgroup. Be aware that system management software may already 9758c2ecf20Sopenharmony_cihave placed RT processes into nonroot cgroups during the system boot 9768c2ecf20Sopenharmony_ciprocess, and these processes may need to be moved to the root cgroup 9778c2ecf20Sopenharmony_cibefore the cpu controller can be enabled. 9788c2ecf20Sopenharmony_ci 9798c2ecf20Sopenharmony_ci 9808c2ecf20Sopenharmony_ciCPU Interface Files 9818c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 9828c2ecf20Sopenharmony_ci 9838c2ecf20Sopenharmony_ciAll time durations are in microseconds. 9848c2ecf20Sopenharmony_ci 9858c2ecf20Sopenharmony_ci cpu.stat 9868c2ecf20Sopenharmony_ci A read-only flat-keyed file. 9878c2ecf20Sopenharmony_ci This file exists whether the controller is enabled or not. 9888c2ecf20Sopenharmony_ci 9898c2ecf20Sopenharmony_ci It always reports the following three stats: 9908c2ecf20Sopenharmony_ci 9918c2ecf20Sopenharmony_ci - usage_usec 9928c2ecf20Sopenharmony_ci - user_usec 9938c2ecf20Sopenharmony_ci - system_usec 9948c2ecf20Sopenharmony_ci 9958c2ecf20Sopenharmony_ci and the following three when the controller is enabled: 9968c2ecf20Sopenharmony_ci 9978c2ecf20Sopenharmony_ci - nr_periods 9988c2ecf20Sopenharmony_ci - nr_throttled 9998c2ecf20Sopenharmony_ci - throttled_usec 10008c2ecf20Sopenharmony_ci 10018c2ecf20Sopenharmony_ci cpu.weight 10028c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 10038c2ecf20Sopenharmony_ci cgroups. The default is "100". 10048c2ecf20Sopenharmony_ci 10058c2ecf20Sopenharmony_ci The weight in the range [1, 10000]. 10068c2ecf20Sopenharmony_ci 10078c2ecf20Sopenharmony_ci cpu.weight.nice 10088c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 10098c2ecf20Sopenharmony_ci cgroups. The default is "0". 10108c2ecf20Sopenharmony_ci 10118c2ecf20Sopenharmony_ci The nice value is in the range [-20, 19]. 10128c2ecf20Sopenharmony_ci 10138c2ecf20Sopenharmony_ci This interface file is an alternative interface for 10148c2ecf20Sopenharmony_ci "cpu.weight" and allows reading and setting weight using the 10158c2ecf20Sopenharmony_ci same values used by nice(2). Because the range is smaller and 10168c2ecf20Sopenharmony_ci granularity is coarser for the nice values, the read value is 10178c2ecf20Sopenharmony_ci the closest approximation of the current weight. 10188c2ecf20Sopenharmony_ci 10198c2ecf20Sopenharmony_ci cpu.max 10208c2ecf20Sopenharmony_ci A read-write two value file which exists on non-root cgroups. 10218c2ecf20Sopenharmony_ci The default is "max 100000". 10228c2ecf20Sopenharmony_ci 10238c2ecf20Sopenharmony_ci The maximum bandwidth limit. It's in the following format:: 10248c2ecf20Sopenharmony_ci 10258c2ecf20Sopenharmony_ci $MAX $PERIOD 10268c2ecf20Sopenharmony_ci 10278c2ecf20Sopenharmony_ci which indicates that the group may consume upto $MAX in each 10288c2ecf20Sopenharmony_ci $PERIOD duration. "max" for $MAX indicates no limit. If only 10298c2ecf20Sopenharmony_ci one number is written, $MAX is updated. 10308c2ecf20Sopenharmony_ci 10318c2ecf20Sopenharmony_ci cpu.pressure 10328c2ecf20Sopenharmony_ci A read-only nested-key file which exists on non-root cgroups. 10338c2ecf20Sopenharmony_ci 10348c2ecf20Sopenharmony_ci Shows pressure stall information for CPU. See 10358c2ecf20Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 10368c2ecf20Sopenharmony_ci 10378c2ecf20Sopenharmony_ci cpu.uclamp.min 10388c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 10398c2ecf20Sopenharmony_ci The default is "0", i.e. no utilization boosting. 10408c2ecf20Sopenharmony_ci 10418c2ecf20Sopenharmony_ci The requested minimum utilization (protection) as a percentage 10428c2ecf20Sopenharmony_ci rational number, e.g. 12.34 for 12.34%. 10438c2ecf20Sopenharmony_ci 10448c2ecf20Sopenharmony_ci This interface allows reading and setting minimum utilization clamp 10458c2ecf20Sopenharmony_ci values similar to the sched_setattr(2). This minimum utilization 10468c2ecf20Sopenharmony_ci value is used to clamp the task specific minimum utilization clamp. 10478c2ecf20Sopenharmony_ci 10488c2ecf20Sopenharmony_ci The requested minimum utilization (protection) is always capped by 10498c2ecf20Sopenharmony_ci the current value for the maximum utilization (limit), i.e. 10508c2ecf20Sopenharmony_ci `cpu.uclamp.max`. 10518c2ecf20Sopenharmony_ci 10528c2ecf20Sopenharmony_ci cpu.uclamp.max 10538c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root cgroups. 10548c2ecf20Sopenharmony_ci The default is "max". i.e. no utilization capping 10558c2ecf20Sopenharmony_ci 10568c2ecf20Sopenharmony_ci The requested maximum utilization (limit) as a percentage rational 10578c2ecf20Sopenharmony_ci number, e.g. 98.76 for 98.76%. 10588c2ecf20Sopenharmony_ci 10598c2ecf20Sopenharmony_ci This interface allows reading and setting maximum utilization clamp 10608c2ecf20Sopenharmony_ci values similar to the sched_setattr(2). This maximum utilization 10618c2ecf20Sopenharmony_ci value is used to clamp the task specific maximum utilization clamp. 10628c2ecf20Sopenharmony_ci 10638c2ecf20Sopenharmony_ci 10648c2ecf20Sopenharmony_ci 10658c2ecf20Sopenharmony_ciMemory 10668c2ecf20Sopenharmony_ci------ 10678c2ecf20Sopenharmony_ci 10688c2ecf20Sopenharmony_ciThe "memory" controller regulates distribution of memory. Memory is 10698c2ecf20Sopenharmony_cistateful and implements both limit and protection models. Due to the 10708c2ecf20Sopenharmony_ciintertwining between memory usage and reclaim pressure and the 10718c2ecf20Sopenharmony_cistateful nature of memory, the distribution model is relatively 10728c2ecf20Sopenharmony_cicomplex. 10738c2ecf20Sopenharmony_ci 10748c2ecf20Sopenharmony_ciWhile not completely water-tight, all major memory usages by a given 10758c2ecf20Sopenharmony_cicgroup are tracked so that the total memory consumption can be 10768c2ecf20Sopenharmony_ciaccounted and controlled to a reasonable extent. Currently, the 10778c2ecf20Sopenharmony_cifollowing types of memory usages are tracked. 10788c2ecf20Sopenharmony_ci 10798c2ecf20Sopenharmony_ci- Userland memory - page cache and anonymous memory. 10808c2ecf20Sopenharmony_ci 10818c2ecf20Sopenharmony_ci- Kernel data structures such as dentries and inodes. 10828c2ecf20Sopenharmony_ci 10838c2ecf20Sopenharmony_ci- TCP socket buffers. 10848c2ecf20Sopenharmony_ci 10858c2ecf20Sopenharmony_ciThe above list may expand in the future for better coverage. 10868c2ecf20Sopenharmony_ci 10878c2ecf20Sopenharmony_ci 10888c2ecf20Sopenharmony_ciMemory Interface Files 10898c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 10908c2ecf20Sopenharmony_ci 10918c2ecf20Sopenharmony_ciAll memory amounts are in bytes. If a value which is not aligned to 10928c2ecf20Sopenharmony_ciPAGE_SIZE is written, the value may be rounded up to the closest 10938c2ecf20Sopenharmony_ciPAGE_SIZE multiple when read back. 10948c2ecf20Sopenharmony_ci 10958c2ecf20Sopenharmony_ci memory.current 10968c2ecf20Sopenharmony_ci A read-only single value file which exists on non-root 10978c2ecf20Sopenharmony_ci cgroups. 10988c2ecf20Sopenharmony_ci 10998c2ecf20Sopenharmony_ci The total amount of memory currently being used by the cgroup 11008c2ecf20Sopenharmony_ci and its descendants. 11018c2ecf20Sopenharmony_ci 11028c2ecf20Sopenharmony_ci memory.min 11038c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 11048c2ecf20Sopenharmony_ci cgroups. The default is "0". 11058c2ecf20Sopenharmony_ci 11068c2ecf20Sopenharmony_ci Hard memory protection. If the memory usage of a cgroup 11078c2ecf20Sopenharmony_ci is within its effective min boundary, the cgroup's memory 11088c2ecf20Sopenharmony_ci won't be reclaimed under any conditions. If there is no 11098c2ecf20Sopenharmony_ci unprotected reclaimable memory available, OOM killer 11108c2ecf20Sopenharmony_ci is invoked. Above the effective min boundary (or 11118c2ecf20Sopenharmony_ci effective low boundary if it is higher), pages are reclaimed 11128c2ecf20Sopenharmony_ci proportionally to the overage, reducing reclaim pressure for 11138c2ecf20Sopenharmony_ci smaller overages. 11148c2ecf20Sopenharmony_ci 11158c2ecf20Sopenharmony_ci Effective min boundary is limited by memory.min values of 11168c2ecf20Sopenharmony_ci all ancestor cgroups. If there is memory.min overcommitment 11178c2ecf20Sopenharmony_ci (child cgroup or cgroups are requiring more protected memory 11188c2ecf20Sopenharmony_ci than parent will allow), then each child cgroup will get 11198c2ecf20Sopenharmony_ci the part of parent's protection proportional to its 11208c2ecf20Sopenharmony_ci actual memory usage below memory.min. 11218c2ecf20Sopenharmony_ci 11228c2ecf20Sopenharmony_ci Putting more memory than generally available under this 11238c2ecf20Sopenharmony_ci protection is discouraged and may lead to constant OOMs. 11248c2ecf20Sopenharmony_ci 11258c2ecf20Sopenharmony_ci If a memory cgroup is not populated with processes, 11268c2ecf20Sopenharmony_ci its memory.min is ignored. 11278c2ecf20Sopenharmony_ci 11288c2ecf20Sopenharmony_ci memory.low 11298c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 11308c2ecf20Sopenharmony_ci cgroups. The default is "0". 11318c2ecf20Sopenharmony_ci 11328c2ecf20Sopenharmony_ci Best-effort memory protection. If the memory usage of a 11338c2ecf20Sopenharmony_ci cgroup is within its effective low boundary, the cgroup's 11348c2ecf20Sopenharmony_ci memory won't be reclaimed unless there is no reclaimable 11358c2ecf20Sopenharmony_ci memory available in unprotected cgroups. 11368c2ecf20Sopenharmony_ci Above the effective low boundary (or 11378c2ecf20Sopenharmony_ci effective min boundary if it is higher), pages are reclaimed 11388c2ecf20Sopenharmony_ci proportionally to the overage, reducing reclaim pressure for 11398c2ecf20Sopenharmony_ci smaller overages. 11408c2ecf20Sopenharmony_ci 11418c2ecf20Sopenharmony_ci Effective low boundary is limited by memory.low values of 11428c2ecf20Sopenharmony_ci all ancestor cgroups. If there is memory.low overcommitment 11438c2ecf20Sopenharmony_ci (child cgroup or cgroups are requiring more protected memory 11448c2ecf20Sopenharmony_ci than parent will allow), then each child cgroup will get 11458c2ecf20Sopenharmony_ci the part of parent's protection proportional to its 11468c2ecf20Sopenharmony_ci actual memory usage below memory.low. 11478c2ecf20Sopenharmony_ci 11488c2ecf20Sopenharmony_ci Putting more memory than generally available under this 11498c2ecf20Sopenharmony_ci protection is discouraged. 11508c2ecf20Sopenharmony_ci 11518c2ecf20Sopenharmony_ci memory.high 11528c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 11538c2ecf20Sopenharmony_ci cgroups. The default is "max". 11548c2ecf20Sopenharmony_ci 11558c2ecf20Sopenharmony_ci Memory usage throttle limit. This is the main mechanism to 11568c2ecf20Sopenharmony_ci control memory usage of a cgroup. If a cgroup's usage goes 11578c2ecf20Sopenharmony_ci over the high boundary, the processes of the cgroup are 11588c2ecf20Sopenharmony_ci throttled and put under heavy reclaim pressure. 11598c2ecf20Sopenharmony_ci 11608c2ecf20Sopenharmony_ci Going over the high limit never invokes the OOM killer and 11618c2ecf20Sopenharmony_ci under extreme conditions the limit may be breached. 11628c2ecf20Sopenharmony_ci 11638c2ecf20Sopenharmony_ci memory.max 11648c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 11658c2ecf20Sopenharmony_ci cgroups. The default is "max". 11668c2ecf20Sopenharmony_ci 11678c2ecf20Sopenharmony_ci Memory usage hard limit. This is the final protection 11688c2ecf20Sopenharmony_ci mechanism. If a cgroup's memory usage reaches this limit and 11698c2ecf20Sopenharmony_ci can't be reduced, the OOM killer is invoked in the cgroup. 11708c2ecf20Sopenharmony_ci Under certain circumstances, the usage may go over the limit 11718c2ecf20Sopenharmony_ci temporarily. 11728c2ecf20Sopenharmony_ci 11738c2ecf20Sopenharmony_ci In default configuration regular 0-order allocations always 11748c2ecf20Sopenharmony_ci succeed unless OOM killer chooses current task as a victim. 11758c2ecf20Sopenharmony_ci 11768c2ecf20Sopenharmony_ci Some kinds of allocations don't invoke the OOM killer. 11778c2ecf20Sopenharmony_ci Caller could retry them differently, return into userspace 11788c2ecf20Sopenharmony_ci as -ENOMEM or silently ignore in cases like disk readahead. 11798c2ecf20Sopenharmony_ci 11808c2ecf20Sopenharmony_ci This is the ultimate protection mechanism. As long as the 11818c2ecf20Sopenharmony_ci high limit is used and monitored properly, this limit's 11828c2ecf20Sopenharmony_ci utility is limited to providing the final safety net. 11838c2ecf20Sopenharmony_ci 11848c2ecf20Sopenharmony_ci memory.oom.group 11858c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 11868c2ecf20Sopenharmony_ci cgroups. The default value is "0". 11878c2ecf20Sopenharmony_ci 11888c2ecf20Sopenharmony_ci Determines whether the cgroup should be treated as 11898c2ecf20Sopenharmony_ci an indivisible workload by the OOM killer. If set, 11908c2ecf20Sopenharmony_ci all tasks belonging to the cgroup or to its descendants 11918c2ecf20Sopenharmony_ci (if the memory cgroup is not a leaf cgroup) are killed 11928c2ecf20Sopenharmony_ci together or not at all. This can be used to avoid 11938c2ecf20Sopenharmony_ci partial kills to guarantee workload integrity. 11948c2ecf20Sopenharmony_ci 11958c2ecf20Sopenharmony_ci Tasks with the OOM protection (oom_score_adj set to -1000) 11968c2ecf20Sopenharmony_ci are treated as an exception and are never killed. 11978c2ecf20Sopenharmony_ci 11988c2ecf20Sopenharmony_ci If the OOM killer is invoked in a cgroup, it's not going 11998c2ecf20Sopenharmony_ci to kill any tasks outside of this cgroup, regardless 12008c2ecf20Sopenharmony_ci memory.oom.group values of ancestor cgroups. 12018c2ecf20Sopenharmony_ci 12028c2ecf20Sopenharmony_ci memory.events 12038c2ecf20Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 12048c2ecf20Sopenharmony_ci The following entries are defined. Unless specified 12058c2ecf20Sopenharmony_ci otherwise, a value change in this file generates a file 12068c2ecf20Sopenharmony_ci modified event. 12078c2ecf20Sopenharmony_ci 12088c2ecf20Sopenharmony_ci Note that all fields in this file are hierarchical and the 12098c2ecf20Sopenharmony_ci file modified event can be generated due to an event down the 12108c2ecf20Sopenharmony_ci hierarchy. For for the local events at the cgroup level see 12118c2ecf20Sopenharmony_ci memory.events.local. 12128c2ecf20Sopenharmony_ci 12138c2ecf20Sopenharmony_ci low 12148c2ecf20Sopenharmony_ci The number of times the cgroup is reclaimed due to 12158c2ecf20Sopenharmony_ci high memory pressure even though its usage is under 12168c2ecf20Sopenharmony_ci the low boundary. This usually indicates that the low 12178c2ecf20Sopenharmony_ci boundary is over-committed. 12188c2ecf20Sopenharmony_ci 12198c2ecf20Sopenharmony_ci high 12208c2ecf20Sopenharmony_ci The number of times processes of the cgroup are 12218c2ecf20Sopenharmony_ci throttled and routed to perform direct memory reclaim 12228c2ecf20Sopenharmony_ci because the high memory boundary was exceeded. For a 12238c2ecf20Sopenharmony_ci cgroup whose memory usage is capped by the high limit 12248c2ecf20Sopenharmony_ci rather than global memory pressure, this event's 12258c2ecf20Sopenharmony_ci occurrences are expected. 12268c2ecf20Sopenharmony_ci 12278c2ecf20Sopenharmony_ci max 12288c2ecf20Sopenharmony_ci The number of times the cgroup's memory usage was 12298c2ecf20Sopenharmony_ci about to go over the max boundary. If direct reclaim 12308c2ecf20Sopenharmony_ci fails to bring it down, the cgroup goes to OOM state. 12318c2ecf20Sopenharmony_ci 12328c2ecf20Sopenharmony_ci oom 12338c2ecf20Sopenharmony_ci The number of time the cgroup's memory usage was 12348c2ecf20Sopenharmony_ci reached the limit and allocation was about to fail. 12358c2ecf20Sopenharmony_ci 12368c2ecf20Sopenharmony_ci This event is not raised if the OOM killer is not 12378c2ecf20Sopenharmony_ci considered as an option, e.g. for failed high-order 12388c2ecf20Sopenharmony_ci allocations or if caller asked to not retry attempts. 12398c2ecf20Sopenharmony_ci 12408c2ecf20Sopenharmony_ci oom_kill 12418c2ecf20Sopenharmony_ci The number of processes belonging to this cgroup 12428c2ecf20Sopenharmony_ci killed by any kind of OOM killer. 12438c2ecf20Sopenharmony_ci 12448c2ecf20Sopenharmony_ci memory.events.local 12458c2ecf20Sopenharmony_ci Similar to memory.events but the fields in the file are local 12468c2ecf20Sopenharmony_ci to the cgroup i.e. not hierarchical. The file modified event 12478c2ecf20Sopenharmony_ci generated on this file reflects only the local events. 12488c2ecf20Sopenharmony_ci 12498c2ecf20Sopenharmony_ci memory.stat 12508c2ecf20Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 12518c2ecf20Sopenharmony_ci 12528c2ecf20Sopenharmony_ci This breaks down the cgroup's memory footprint into different 12538c2ecf20Sopenharmony_ci types of memory, type-specific details, and other information 12548c2ecf20Sopenharmony_ci on the state and past events of the memory management system. 12558c2ecf20Sopenharmony_ci 12568c2ecf20Sopenharmony_ci All memory amounts are in bytes. 12578c2ecf20Sopenharmony_ci 12588c2ecf20Sopenharmony_ci The entries are ordered to be human readable, and new entries 12598c2ecf20Sopenharmony_ci can show up in the middle. Don't rely on items remaining in a 12608c2ecf20Sopenharmony_ci fixed position; use the keys to look up specific values! 12618c2ecf20Sopenharmony_ci 12628c2ecf20Sopenharmony_ci If the entry has no per-node counter(or not show in the 12638c2ecf20Sopenharmony_ci mempry.numa_stat). We use 'npn'(non-per-node) as the tag 12648c2ecf20Sopenharmony_ci to indicate that it will not show in the mempry.numa_stat. 12658c2ecf20Sopenharmony_ci 12668c2ecf20Sopenharmony_ci anon 12678c2ecf20Sopenharmony_ci Amount of memory used in anonymous mappings such as 12688c2ecf20Sopenharmony_ci brk(), sbrk(), and mmap(MAP_ANONYMOUS) 12698c2ecf20Sopenharmony_ci 12708c2ecf20Sopenharmony_ci file 12718c2ecf20Sopenharmony_ci Amount of memory used to cache filesystem data, 12728c2ecf20Sopenharmony_ci including tmpfs and shared memory. 12738c2ecf20Sopenharmony_ci 12748c2ecf20Sopenharmony_ci kernel_stack 12758c2ecf20Sopenharmony_ci Amount of memory allocated to kernel stacks. 12768c2ecf20Sopenharmony_ci 12778c2ecf20Sopenharmony_ci percpu(npn) 12788c2ecf20Sopenharmony_ci Amount of memory used for storing per-cpu kernel 12798c2ecf20Sopenharmony_ci data structures. 12808c2ecf20Sopenharmony_ci 12818c2ecf20Sopenharmony_ci sock(npn) 12828c2ecf20Sopenharmony_ci Amount of memory used in network transmission buffers 12838c2ecf20Sopenharmony_ci 12848c2ecf20Sopenharmony_ci shmem 12858c2ecf20Sopenharmony_ci Amount of cached filesystem data that is swap-backed, 12868c2ecf20Sopenharmony_ci such as tmpfs, shm segments, shared anonymous mmap()s 12878c2ecf20Sopenharmony_ci 12888c2ecf20Sopenharmony_ci file_mapped 12898c2ecf20Sopenharmony_ci Amount of cached filesystem data mapped with mmap() 12908c2ecf20Sopenharmony_ci 12918c2ecf20Sopenharmony_ci file_dirty 12928c2ecf20Sopenharmony_ci Amount of cached filesystem data that was modified but 12938c2ecf20Sopenharmony_ci not yet written back to disk 12948c2ecf20Sopenharmony_ci 12958c2ecf20Sopenharmony_ci file_writeback 12968c2ecf20Sopenharmony_ci Amount of cached filesystem data that was modified and 12978c2ecf20Sopenharmony_ci is currently being written back to disk 12988c2ecf20Sopenharmony_ci 12998c2ecf20Sopenharmony_ci anon_thp 13008c2ecf20Sopenharmony_ci Amount of memory used in anonymous mappings backed by 13018c2ecf20Sopenharmony_ci transparent hugepages 13028c2ecf20Sopenharmony_ci 13038c2ecf20Sopenharmony_ci inactive_anon, active_anon, inactive_file, active_file, unevictable 13048c2ecf20Sopenharmony_ci Amount of memory, swap-backed and filesystem-backed, 13058c2ecf20Sopenharmony_ci on the internal memory management lists used by the 13068c2ecf20Sopenharmony_ci page reclaim algorithm. 13078c2ecf20Sopenharmony_ci 13088c2ecf20Sopenharmony_ci As these represent internal list state (eg. shmem pages are on anon 13098c2ecf20Sopenharmony_ci memory management lists), inactive_foo + active_foo may not be equal to 13108c2ecf20Sopenharmony_ci the value for the foo counter, since the foo counter is type-based, not 13118c2ecf20Sopenharmony_ci list-based. 13128c2ecf20Sopenharmony_ci 13138c2ecf20Sopenharmony_ci slab_reclaimable 13148c2ecf20Sopenharmony_ci Part of "slab" that might be reclaimed, such as 13158c2ecf20Sopenharmony_ci dentries and inodes. 13168c2ecf20Sopenharmony_ci 13178c2ecf20Sopenharmony_ci slab_unreclaimable 13188c2ecf20Sopenharmony_ci Part of "slab" that cannot be reclaimed on memory 13198c2ecf20Sopenharmony_ci pressure. 13208c2ecf20Sopenharmony_ci 13218c2ecf20Sopenharmony_ci slab(npn) 13228c2ecf20Sopenharmony_ci Amount of memory used for storing in-kernel data 13238c2ecf20Sopenharmony_ci structures. 13248c2ecf20Sopenharmony_ci 13258c2ecf20Sopenharmony_ci workingset_refault_anon 13268c2ecf20Sopenharmony_ci Number of refaults of previously evicted anonymous pages. 13278c2ecf20Sopenharmony_ci 13288c2ecf20Sopenharmony_ci workingset_refault_file 13298c2ecf20Sopenharmony_ci Number of refaults of previously evicted file pages. 13308c2ecf20Sopenharmony_ci 13318c2ecf20Sopenharmony_ci workingset_activate_anon 13328c2ecf20Sopenharmony_ci Number of refaulted anonymous pages that were immediately 13338c2ecf20Sopenharmony_ci activated. 13348c2ecf20Sopenharmony_ci 13358c2ecf20Sopenharmony_ci workingset_activate_file 13368c2ecf20Sopenharmony_ci Number of refaulted file pages that were immediately activated. 13378c2ecf20Sopenharmony_ci 13388c2ecf20Sopenharmony_ci workingset_restore_anon 13398c2ecf20Sopenharmony_ci Number of restored anonymous pages which have been detected as 13408c2ecf20Sopenharmony_ci an active workingset before they got reclaimed. 13418c2ecf20Sopenharmony_ci 13428c2ecf20Sopenharmony_ci workingset_restore_file 13438c2ecf20Sopenharmony_ci Number of restored file pages which have been detected as an 13448c2ecf20Sopenharmony_ci active workingset before they got reclaimed. 13458c2ecf20Sopenharmony_ci 13468c2ecf20Sopenharmony_ci workingset_nodereclaim 13478c2ecf20Sopenharmony_ci Number of times a shadow node has been reclaimed 13488c2ecf20Sopenharmony_ci 13498c2ecf20Sopenharmony_ci pgfault(npn) 13508c2ecf20Sopenharmony_ci Total number of page faults incurred 13518c2ecf20Sopenharmony_ci 13528c2ecf20Sopenharmony_ci pgmajfault(npn) 13538c2ecf20Sopenharmony_ci Number of major page faults incurred 13548c2ecf20Sopenharmony_ci 13558c2ecf20Sopenharmony_ci pgrefill(npn) 13568c2ecf20Sopenharmony_ci Amount of scanned pages (in an active LRU list) 13578c2ecf20Sopenharmony_ci 13588c2ecf20Sopenharmony_ci pgscan(npn) 13598c2ecf20Sopenharmony_ci Amount of scanned pages (in an inactive LRU list) 13608c2ecf20Sopenharmony_ci 13618c2ecf20Sopenharmony_ci pgsteal(npn) 13628c2ecf20Sopenharmony_ci Amount of reclaimed pages 13638c2ecf20Sopenharmony_ci 13648c2ecf20Sopenharmony_ci pgactivate(npn) 13658c2ecf20Sopenharmony_ci Amount of pages moved to the active LRU list 13668c2ecf20Sopenharmony_ci 13678c2ecf20Sopenharmony_ci pgdeactivate(npn) 13688c2ecf20Sopenharmony_ci Amount of pages moved to the inactive LRU list 13698c2ecf20Sopenharmony_ci 13708c2ecf20Sopenharmony_ci pglazyfree(npn) 13718c2ecf20Sopenharmony_ci Amount of pages postponed to be freed under memory pressure 13728c2ecf20Sopenharmony_ci 13738c2ecf20Sopenharmony_ci pglazyfreed(npn) 13748c2ecf20Sopenharmony_ci Amount of reclaimed lazyfree pages 13758c2ecf20Sopenharmony_ci 13768c2ecf20Sopenharmony_ci thp_fault_alloc(npn) 13778c2ecf20Sopenharmony_ci Number of transparent hugepages which were allocated to satisfy 13788c2ecf20Sopenharmony_ci a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE 13798c2ecf20Sopenharmony_ci is not set. 13808c2ecf20Sopenharmony_ci 13818c2ecf20Sopenharmony_ci thp_collapse_alloc(npn) 13828c2ecf20Sopenharmony_ci Number of transparent hugepages which were allocated to allow 13838c2ecf20Sopenharmony_ci collapsing an existing range of pages. This counter is not 13848c2ecf20Sopenharmony_ci present when CONFIG_TRANSPARENT_HUGEPAGE is not set. 13858c2ecf20Sopenharmony_ci 13868c2ecf20Sopenharmony_ci memory.numa_stat 13878c2ecf20Sopenharmony_ci A read-only nested-keyed file which exists on non-root cgroups. 13888c2ecf20Sopenharmony_ci 13898c2ecf20Sopenharmony_ci This breaks down the cgroup's memory footprint into different 13908c2ecf20Sopenharmony_ci types of memory, type-specific details, and other information 13918c2ecf20Sopenharmony_ci per node on the state of the memory management system. 13928c2ecf20Sopenharmony_ci 13938c2ecf20Sopenharmony_ci This is useful for providing visibility into the NUMA locality 13948c2ecf20Sopenharmony_ci information within an memcg since the pages are allowed to be 13958c2ecf20Sopenharmony_ci allocated from any physical node. One of the use case is evaluating 13968c2ecf20Sopenharmony_ci application performance by combining this information with the 13978c2ecf20Sopenharmony_ci application's CPU allocation. 13988c2ecf20Sopenharmony_ci 13998c2ecf20Sopenharmony_ci All memory amounts are in bytes. 14008c2ecf20Sopenharmony_ci 14018c2ecf20Sopenharmony_ci The output format of memory.numa_stat is:: 14028c2ecf20Sopenharmony_ci 14038c2ecf20Sopenharmony_ci type N0=<bytes in node 0> N1=<bytes in node 1> ... 14048c2ecf20Sopenharmony_ci 14058c2ecf20Sopenharmony_ci The entries are ordered to be human readable, and new entries 14068c2ecf20Sopenharmony_ci can show up in the middle. Don't rely on items remaining in a 14078c2ecf20Sopenharmony_ci fixed position; use the keys to look up specific values! 14088c2ecf20Sopenharmony_ci 14098c2ecf20Sopenharmony_ci The entries can refer to the memory.stat. 14108c2ecf20Sopenharmony_ci 14118c2ecf20Sopenharmony_ci memory.swap.current 14128c2ecf20Sopenharmony_ci A read-only single value file which exists on non-root 14138c2ecf20Sopenharmony_ci cgroups. 14148c2ecf20Sopenharmony_ci 14158c2ecf20Sopenharmony_ci The total amount of swap currently being used by the cgroup 14168c2ecf20Sopenharmony_ci and its descendants. 14178c2ecf20Sopenharmony_ci 14188c2ecf20Sopenharmony_ci memory.swap.high 14198c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 14208c2ecf20Sopenharmony_ci cgroups. The default is "max". 14218c2ecf20Sopenharmony_ci 14228c2ecf20Sopenharmony_ci Swap usage throttle limit. If a cgroup's swap usage exceeds 14238c2ecf20Sopenharmony_ci this limit, all its further allocations will be throttled to 14248c2ecf20Sopenharmony_ci allow userspace to implement custom out-of-memory procedures. 14258c2ecf20Sopenharmony_ci 14268c2ecf20Sopenharmony_ci This limit marks a point of no return for the cgroup. It is NOT 14278c2ecf20Sopenharmony_ci designed to manage the amount of swapping a workload does 14288c2ecf20Sopenharmony_ci during regular operation. Compare to memory.swap.max, which 14298c2ecf20Sopenharmony_ci prohibits swapping past a set amount, but lets the cgroup 14308c2ecf20Sopenharmony_ci continue unimpeded as long as other memory can be reclaimed. 14318c2ecf20Sopenharmony_ci 14328c2ecf20Sopenharmony_ci Healthy workloads are not expected to reach this limit. 14338c2ecf20Sopenharmony_ci 14348c2ecf20Sopenharmony_ci memory.swap.max 14358c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 14368c2ecf20Sopenharmony_ci cgroups. The default is "max". 14378c2ecf20Sopenharmony_ci 14388c2ecf20Sopenharmony_ci Swap usage hard limit. If a cgroup's swap usage reaches this 14398c2ecf20Sopenharmony_ci limit, anonymous memory of the cgroup will not be swapped out. 14408c2ecf20Sopenharmony_ci 14418c2ecf20Sopenharmony_ci memory.swap.events 14428c2ecf20Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 14438c2ecf20Sopenharmony_ci The following entries are defined. Unless specified 14448c2ecf20Sopenharmony_ci otherwise, a value change in this file generates a file 14458c2ecf20Sopenharmony_ci modified event. 14468c2ecf20Sopenharmony_ci 14478c2ecf20Sopenharmony_ci high 14488c2ecf20Sopenharmony_ci The number of times the cgroup's swap usage was over 14498c2ecf20Sopenharmony_ci the high threshold. 14508c2ecf20Sopenharmony_ci 14518c2ecf20Sopenharmony_ci max 14528c2ecf20Sopenharmony_ci The number of times the cgroup's swap usage was about 14538c2ecf20Sopenharmony_ci to go over the max boundary and swap allocation 14548c2ecf20Sopenharmony_ci failed. 14558c2ecf20Sopenharmony_ci 14568c2ecf20Sopenharmony_ci fail 14578c2ecf20Sopenharmony_ci The number of times swap allocation failed either 14588c2ecf20Sopenharmony_ci because of running out of swap system-wide or max 14598c2ecf20Sopenharmony_ci limit. 14608c2ecf20Sopenharmony_ci 14618c2ecf20Sopenharmony_ci When reduced under the current usage, the existing swap 14628c2ecf20Sopenharmony_ci entries are reclaimed gradually and the swap usage may stay 14638c2ecf20Sopenharmony_ci higher than the limit for an extended period of time. This 14648c2ecf20Sopenharmony_ci reduces the impact on the workload and memory management. 14658c2ecf20Sopenharmony_ci 14668c2ecf20Sopenharmony_ci memory.pressure 14678c2ecf20Sopenharmony_ci A read-only nested-key file which exists on non-root cgroups. 14688c2ecf20Sopenharmony_ci 14698c2ecf20Sopenharmony_ci Shows pressure stall information for memory. See 14708c2ecf20Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 14718c2ecf20Sopenharmony_ci 14728c2ecf20Sopenharmony_ci 14738c2ecf20Sopenharmony_ciUsage Guidelines 14748c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~ 14758c2ecf20Sopenharmony_ci 14768c2ecf20Sopenharmony_ci"memory.high" is the main mechanism to control memory usage. 14778c2ecf20Sopenharmony_ciOver-committing on high limit (sum of high limits > available memory) 14788c2ecf20Sopenharmony_ciand letting global memory pressure to distribute memory according to 14798c2ecf20Sopenharmony_ciusage is a viable strategy. 14808c2ecf20Sopenharmony_ci 14818c2ecf20Sopenharmony_ciBecause breach of the high limit doesn't trigger the OOM killer but 14828c2ecf20Sopenharmony_cithrottles the offending cgroup, a management agent has ample 14838c2ecf20Sopenharmony_ciopportunities to monitor and take appropriate actions such as granting 14848c2ecf20Sopenharmony_cimore memory or terminating the workload. 14858c2ecf20Sopenharmony_ci 14868c2ecf20Sopenharmony_ciDetermining whether a cgroup has enough memory is not trivial as 14878c2ecf20Sopenharmony_cimemory usage doesn't indicate whether the workload can benefit from 14888c2ecf20Sopenharmony_cimore memory. For example, a workload which writes data received from 14898c2ecf20Sopenharmony_cinetwork to a file can use all available memory but can also operate as 14908c2ecf20Sopenharmony_ciperformant with a small amount of memory. A measure of memory 14918c2ecf20Sopenharmony_cipressure - how much the workload is being impacted due to lack of 14928c2ecf20Sopenharmony_cimemory - is necessary to determine whether a workload needs more 14938c2ecf20Sopenharmony_cimemory; unfortunately, memory pressure monitoring mechanism isn't 14948c2ecf20Sopenharmony_ciimplemented yet. 14958c2ecf20Sopenharmony_ci 14968c2ecf20Sopenharmony_ci 14978c2ecf20Sopenharmony_ciMemory Ownership 14988c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~ 14998c2ecf20Sopenharmony_ci 15008c2ecf20Sopenharmony_ciA memory area is charged to the cgroup which instantiated it and stays 15018c2ecf20Sopenharmony_cicharged to the cgroup until the area is released. Migrating a process 15028c2ecf20Sopenharmony_cito a different cgroup doesn't move the memory usages that it 15038c2ecf20Sopenharmony_ciinstantiated while in the previous cgroup to the new cgroup. 15048c2ecf20Sopenharmony_ci 15058c2ecf20Sopenharmony_ciA memory area may be used by processes belonging to different cgroups. 15068c2ecf20Sopenharmony_ciTo which cgroup the area will be charged is in-deterministic; however, 15078c2ecf20Sopenharmony_ciover time, the memory area is likely to end up in a cgroup which has 15088c2ecf20Sopenharmony_cienough memory allowance to avoid high reclaim pressure. 15098c2ecf20Sopenharmony_ci 15108c2ecf20Sopenharmony_ciIf a cgroup sweeps a considerable amount of memory which is expected 15118c2ecf20Sopenharmony_cito be accessed repeatedly by other cgroups, it may make sense to use 15128c2ecf20Sopenharmony_ciPOSIX_FADV_DONTNEED to relinquish the ownership of memory areas 15138c2ecf20Sopenharmony_cibelonging to the affected files to ensure correct memory ownership. 15148c2ecf20Sopenharmony_ci 15158c2ecf20Sopenharmony_ci 15168c2ecf20Sopenharmony_ciIO 15178c2ecf20Sopenharmony_ci-- 15188c2ecf20Sopenharmony_ci 15198c2ecf20Sopenharmony_ciThe "io" controller regulates the distribution of IO resources. This 15208c2ecf20Sopenharmony_cicontroller implements both weight based and absolute bandwidth or IOPS 15218c2ecf20Sopenharmony_cilimit distribution; however, weight based distribution is available 15228c2ecf20Sopenharmony_cionly if cfq-iosched is in use and neither scheme is available for 15238c2ecf20Sopenharmony_ciblk-mq devices. 15248c2ecf20Sopenharmony_ci 15258c2ecf20Sopenharmony_ci 15268c2ecf20Sopenharmony_ciIO Interface Files 15278c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 15288c2ecf20Sopenharmony_ci 15298c2ecf20Sopenharmony_ci io.stat 15308c2ecf20Sopenharmony_ci A read-only nested-keyed file. 15318c2ecf20Sopenharmony_ci 15328c2ecf20Sopenharmony_ci Lines are keyed by $MAJ:$MIN device numbers and not ordered. 15338c2ecf20Sopenharmony_ci The following nested keys are defined. 15348c2ecf20Sopenharmony_ci 15358c2ecf20Sopenharmony_ci ====== ===================== 15368c2ecf20Sopenharmony_ci rbytes Bytes read 15378c2ecf20Sopenharmony_ci wbytes Bytes written 15388c2ecf20Sopenharmony_ci rios Number of read IOs 15398c2ecf20Sopenharmony_ci wios Number of write IOs 15408c2ecf20Sopenharmony_ci dbytes Bytes discarded 15418c2ecf20Sopenharmony_ci dios Number of discard IOs 15428c2ecf20Sopenharmony_ci ====== ===================== 15438c2ecf20Sopenharmony_ci 15448c2ecf20Sopenharmony_ci An example read output follows:: 15458c2ecf20Sopenharmony_ci 15468c2ecf20Sopenharmony_ci 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 15478c2ecf20Sopenharmony_ci 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 15488c2ecf20Sopenharmony_ci 15498c2ecf20Sopenharmony_ci io.cost.qos 15508c2ecf20Sopenharmony_ci A read-write nested-keyed file with exists only on the root 15518c2ecf20Sopenharmony_ci cgroup. 15528c2ecf20Sopenharmony_ci 15538c2ecf20Sopenharmony_ci This file configures the Quality of Service of the IO cost 15548c2ecf20Sopenharmony_ci model based controller (CONFIG_BLK_CGROUP_IOCOST) which 15558c2ecf20Sopenharmony_ci currently implements "io.weight" proportional control. Lines 15568c2ecf20Sopenharmony_ci are keyed by $MAJ:$MIN device numbers and not ordered. The 15578c2ecf20Sopenharmony_ci line for a given device is populated on the first write for 15588c2ecf20Sopenharmony_ci the device on "io.cost.qos" or "io.cost.model". The following 15598c2ecf20Sopenharmony_ci nested keys are defined. 15608c2ecf20Sopenharmony_ci 15618c2ecf20Sopenharmony_ci ====== ===================================== 15628c2ecf20Sopenharmony_ci enable Weight-based control enable 15638c2ecf20Sopenharmony_ci ctrl "auto" or "user" 15648c2ecf20Sopenharmony_ci rpct Read latency percentile [0, 100] 15658c2ecf20Sopenharmony_ci rlat Read latency threshold 15668c2ecf20Sopenharmony_ci wpct Write latency percentile [0, 100] 15678c2ecf20Sopenharmony_ci wlat Write latency threshold 15688c2ecf20Sopenharmony_ci min Minimum scaling percentage [1, 10000] 15698c2ecf20Sopenharmony_ci max Maximum scaling percentage [1, 10000] 15708c2ecf20Sopenharmony_ci ====== ===================================== 15718c2ecf20Sopenharmony_ci 15728c2ecf20Sopenharmony_ci The controller is disabled by default and can be enabled by 15738c2ecf20Sopenharmony_ci setting "enable" to 1. "rpct" and "wpct" parameters default 15748c2ecf20Sopenharmony_ci to zero and the controller uses internal device saturation 15758c2ecf20Sopenharmony_ci state to adjust the overall IO rate between "min" and "max". 15768c2ecf20Sopenharmony_ci 15778c2ecf20Sopenharmony_ci When a better control quality is needed, latency QoS 15788c2ecf20Sopenharmony_ci parameters can be configured. For example:: 15798c2ecf20Sopenharmony_ci 15808c2ecf20Sopenharmony_ci 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 15818c2ecf20Sopenharmony_ci 15828c2ecf20Sopenharmony_ci shows that on sdb, the controller is enabled, will consider 15838c2ecf20Sopenharmony_ci the device saturated if the 95th percentile of read completion 15848c2ecf20Sopenharmony_ci latencies is above 75ms or write 150ms, and adjust the overall 15858c2ecf20Sopenharmony_ci IO issue rate between 50% and 150% accordingly. 15868c2ecf20Sopenharmony_ci 15878c2ecf20Sopenharmony_ci The lower the saturation point, the better the latency QoS at 15888c2ecf20Sopenharmony_ci the cost of aggregate bandwidth. The narrower the allowed 15898c2ecf20Sopenharmony_ci adjustment range between "min" and "max", the more conformant 15908c2ecf20Sopenharmony_ci to the cost model the IO behavior. Note that the IO issue 15918c2ecf20Sopenharmony_ci base rate may be far off from 100% and setting "min" and "max" 15928c2ecf20Sopenharmony_ci blindly can lead to a significant loss of device capacity or 15938c2ecf20Sopenharmony_ci control quality. "min" and "max" are useful for regulating 15948c2ecf20Sopenharmony_ci devices which show wide temporary behavior changes - e.g. a 15958c2ecf20Sopenharmony_ci ssd which accepts writes at the line speed for a while and 15968c2ecf20Sopenharmony_ci then completely stalls for multiple seconds. 15978c2ecf20Sopenharmony_ci 15988c2ecf20Sopenharmony_ci When "ctrl" is "auto", the parameters are controlled by the 15998c2ecf20Sopenharmony_ci kernel and may change automatically. Setting "ctrl" to "user" 16008c2ecf20Sopenharmony_ci or setting any of the percentile and latency parameters puts 16018c2ecf20Sopenharmony_ci it into "user" mode and disables the automatic changes. The 16028c2ecf20Sopenharmony_ci automatic mode can be restored by setting "ctrl" to "auto". 16038c2ecf20Sopenharmony_ci 16048c2ecf20Sopenharmony_ci io.cost.model 16058c2ecf20Sopenharmony_ci A read-write nested-keyed file with exists only on the root 16068c2ecf20Sopenharmony_ci cgroup. 16078c2ecf20Sopenharmony_ci 16088c2ecf20Sopenharmony_ci This file configures the cost model of the IO cost model based 16098c2ecf20Sopenharmony_ci controller (CONFIG_BLK_CGROUP_IOCOST) which currently 16108c2ecf20Sopenharmony_ci implements "io.weight" proportional control. Lines are keyed 16118c2ecf20Sopenharmony_ci by $MAJ:$MIN device numbers and not ordered. The line for a 16128c2ecf20Sopenharmony_ci given device is populated on the first write for the device on 16138c2ecf20Sopenharmony_ci "io.cost.qos" or "io.cost.model". The following nested keys 16148c2ecf20Sopenharmony_ci are defined. 16158c2ecf20Sopenharmony_ci 16168c2ecf20Sopenharmony_ci ===== ================================ 16178c2ecf20Sopenharmony_ci ctrl "auto" or "user" 16188c2ecf20Sopenharmony_ci model The cost model in use - "linear" 16198c2ecf20Sopenharmony_ci ===== ================================ 16208c2ecf20Sopenharmony_ci 16218c2ecf20Sopenharmony_ci When "ctrl" is "auto", the kernel may change all parameters 16228c2ecf20Sopenharmony_ci dynamically. When "ctrl" is set to "user" or any other 16238c2ecf20Sopenharmony_ci parameters are written to, "ctrl" become "user" and the 16248c2ecf20Sopenharmony_ci automatic changes are disabled. 16258c2ecf20Sopenharmony_ci 16268c2ecf20Sopenharmony_ci When "model" is "linear", the following model parameters are 16278c2ecf20Sopenharmony_ci defined. 16288c2ecf20Sopenharmony_ci 16298c2ecf20Sopenharmony_ci ============= ======================================== 16308c2ecf20Sopenharmony_ci [r|w]bps The maximum sequential IO throughput 16318c2ecf20Sopenharmony_ci [r|w]seqiops The maximum 4k sequential IOs per second 16328c2ecf20Sopenharmony_ci [r|w]randiops The maximum 4k random IOs per second 16338c2ecf20Sopenharmony_ci ============= ======================================== 16348c2ecf20Sopenharmony_ci 16358c2ecf20Sopenharmony_ci From the above, the builtin linear model determines the base 16368c2ecf20Sopenharmony_ci costs of a sequential and random IO and the cost coefficient 16378c2ecf20Sopenharmony_ci for the IO size. While simple, this model can cover most 16388c2ecf20Sopenharmony_ci common device classes acceptably. 16398c2ecf20Sopenharmony_ci 16408c2ecf20Sopenharmony_ci The IO cost model isn't expected to be accurate in absolute 16418c2ecf20Sopenharmony_ci sense and is scaled to the device behavior dynamically. 16428c2ecf20Sopenharmony_ci 16438c2ecf20Sopenharmony_ci If needed, tools/cgroup/iocost_coef_gen.py can be used to 16448c2ecf20Sopenharmony_ci generate device-specific coefficients. 16458c2ecf20Sopenharmony_ci 16468c2ecf20Sopenharmony_ci io.weight 16478c2ecf20Sopenharmony_ci A read-write flat-keyed file which exists on non-root cgroups. 16488c2ecf20Sopenharmony_ci The default is "default 100". 16498c2ecf20Sopenharmony_ci 16508c2ecf20Sopenharmony_ci The first line is the default weight applied to devices 16518c2ecf20Sopenharmony_ci without specific override. The rest are overrides keyed by 16528c2ecf20Sopenharmony_ci $MAJ:$MIN device numbers and not ordered. The weights are in 16538c2ecf20Sopenharmony_ci the range [1, 10000] and specifies the relative amount IO time 16548c2ecf20Sopenharmony_ci the cgroup can use in relation to its siblings. 16558c2ecf20Sopenharmony_ci 16568c2ecf20Sopenharmony_ci The default weight can be updated by writing either "default 16578c2ecf20Sopenharmony_ci $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 16588c2ecf20Sopenharmony_ci "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 16598c2ecf20Sopenharmony_ci 16608c2ecf20Sopenharmony_ci An example read output follows:: 16618c2ecf20Sopenharmony_ci 16628c2ecf20Sopenharmony_ci default 100 16638c2ecf20Sopenharmony_ci 8:16 200 16648c2ecf20Sopenharmony_ci 8:0 50 16658c2ecf20Sopenharmony_ci 16668c2ecf20Sopenharmony_ci io.max 16678c2ecf20Sopenharmony_ci A read-write nested-keyed file which exists on non-root 16688c2ecf20Sopenharmony_ci cgroups. 16698c2ecf20Sopenharmony_ci 16708c2ecf20Sopenharmony_ci BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN 16718c2ecf20Sopenharmony_ci device numbers and not ordered. The following nested keys are 16728c2ecf20Sopenharmony_ci defined. 16738c2ecf20Sopenharmony_ci 16748c2ecf20Sopenharmony_ci ===== ================================== 16758c2ecf20Sopenharmony_ci rbps Max read bytes per second 16768c2ecf20Sopenharmony_ci wbps Max write bytes per second 16778c2ecf20Sopenharmony_ci riops Max read IO operations per second 16788c2ecf20Sopenharmony_ci wiops Max write IO operations per second 16798c2ecf20Sopenharmony_ci ===== ================================== 16808c2ecf20Sopenharmony_ci 16818c2ecf20Sopenharmony_ci When writing, any number of nested key-value pairs can be 16828c2ecf20Sopenharmony_ci specified in any order. "max" can be specified as the value 16838c2ecf20Sopenharmony_ci to remove a specific limit. If the same key is specified 16848c2ecf20Sopenharmony_ci multiple times, the outcome is undefined. 16858c2ecf20Sopenharmony_ci 16868c2ecf20Sopenharmony_ci BPS and IOPS are measured in each IO direction and IOs are 16878c2ecf20Sopenharmony_ci delayed if limit is reached. Temporary bursts are allowed. 16888c2ecf20Sopenharmony_ci 16898c2ecf20Sopenharmony_ci Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: 16908c2ecf20Sopenharmony_ci 16918c2ecf20Sopenharmony_ci echo "8:16 rbps=2097152 wiops=120" > io.max 16928c2ecf20Sopenharmony_ci 16938c2ecf20Sopenharmony_ci Reading returns the following:: 16948c2ecf20Sopenharmony_ci 16958c2ecf20Sopenharmony_ci 8:16 rbps=2097152 wbps=max riops=max wiops=120 16968c2ecf20Sopenharmony_ci 16978c2ecf20Sopenharmony_ci Write IOPS limit can be removed by writing the following:: 16988c2ecf20Sopenharmony_ci 16998c2ecf20Sopenharmony_ci echo "8:16 wiops=max" > io.max 17008c2ecf20Sopenharmony_ci 17018c2ecf20Sopenharmony_ci Reading now returns the following:: 17028c2ecf20Sopenharmony_ci 17038c2ecf20Sopenharmony_ci 8:16 rbps=2097152 wbps=max riops=max wiops=max 17048c2ecf20Sopenharmony_ci 17058c2ecf20Sopenharmony_ci io.pressure 17068c2ecf20Sopenharmony_ci A read-only nested-key file which exists on non-root cgroups. 17078c2ecf20Sopenharmony_ci 17088c2ecf20Sopenharmony_ci Shows pressure stall information for IO. See 17098c2ecf20Sopenharmony_ci :ref:`Documentation/accounting/psi.rst <psi>` for details. 17108c2ecf20Sopenharmony_ci 17118c2ecf20Sopenharmony_ci 17128c2ecf20Sopenharmony_ciWriteback 17138c2ecf20Sopenharmony_ci~~~~~~~~~ 17148c2ecf20Sopenharmony_ci 17158c2ecf20Sopenharmony_ciPage cache is dirtied through buffered writes and shared mmaps and 17168c2ecf20Sopenharmony_ciwritten asynchronously to the backing filesystem by the writeback 17178c2ecf20Sopenharmony_cimechanism. Writeback sits between the memory and IO domains and 17188c2ecf20Sopenharmony_ciregulates the proportion of dirty memory by balancing dirtying and 17198c2ecf20Sopenharmony_ciwrite IOs. 17208c2ecf20Sopenharmony_ci 17218c2ecf20Sopenharmony_ciThe io controller, in conjunction with the memory controller, 17228c2ecf20Sopenharmony_ciimplements control of page cache writeback IOs. The memory controller 17238c2ecf20Sopenharmony_cidefines the memory domain that dirty memory ratio is calculated and 17248c2ecf20Sopenharmony_cimaintained for and the io controller defines the io domain which 17258c2ecf20Sopenharmony_ciwrites out dirty pages for the memory domain. Both system-wide and 17268c2ecf20Sopenharmony_ciper-cgroup dirty memory states are examined and the more restrictive 17278c2ecf20Sopenharmony_ciof the two is enforced. 17288c2ecf20Sopenharmony_ci 17298c2ecf20Sopenharmony_cicgroup writeback requires explicit support from the underlying 17308c2ecf20Sopenharmony_cifilesystem. Currently, cgroup writeback is implemented on ext2, ext4, 17318c2ecf20Sopenharmony_cibtrfs, f2fs, and xfs. On other filesystems, all writeback IOs are 17328c2ecf20Sopenharmony_ciattributed to the root cgroup. 17338c2ecf20Sopenharmony_ci 17348c2ecf20Sopenharmony_ciThere are inherent differences in memory and writeback management 17358c2ecf20Sopenharmony_ciwhich affects how cgroup ownership is tracked. Memory is tracked per 17368c2ecf20Sopenharmony_cipage while writeback per inode. For the purpose of writeback, an 17378c2ecf20Sopenharmony_ciinode is assigned to a cgroup and all IO requests to write dirty pages 17388c2ecf20Sopenharmony_cifrom the inode are attributed to that cgroup. 17398c2ecf20Sopenharmony_ci 17408c2ecf20Sopenharmony_ciAs cgroup ownership for memory is tracked per page, there can be pages 17418c2ecf20Sopenharmony_ciwhich are associated with different cgroups than the one the inode is 17428c2ecf20Sopenharmony_ciassociated with. These are called foreign pages. The writeback 17438c2ecf20Sopenharmony_ciconstantly keeps track of foreign pages and, if a particular foreign 17448c2ecf20Sopenharmony_cicgroup becomes the majority over a certain period of time, switches 17458c2ecf20Sopenharmony_cithe ownership of the inode to that cgroup. 17468c2ecf20Sopenharmony_ci 17478c2ecf20Sopenharmony_ciWhile this model is enough for most use cases where a given inode is 17488c2ecf20Sopenharmony_cimostly dirtied by a single cgroup even when the main writing cgroup 17498c2ecf20Sopenharmony_cichanges over time, use cases where multiple cgroups write to a single 17508c2ecf20Sopenharmony_ciinode simultaneously are not supported well. In such circumstances, a 17518c2ecf20Sopenharmony_cisignificant portion of IOs are likely to be attributed incorrectly. 17528c2ecf20Sopenharmony_ciAs memory controller assigns page ownership on the first use and 17538c2ecf20Sopenharmony_cidoesn't update it until the page is released, even if writeback 17548c2ecf20Sopenharmony_cistrictly follows page ownership, multiple cgroups dirtying overlapping 17558c2ecf20Sopenharmony_ciareas wouldn't work as expected. It's recommended to avoid such usage 17568c2ecf20Sopenharmony_cipatterns. 17578c2ecf20Sopenharmony_ci 17588c2ecf20Sopenharmony_ciThe sysctl knobs which affect writeback behavior are applied to cgroup 17598c2ecf20Sopenharmony_ciwriteback as follows. 17608c2ecf20Sopenharmony_ci 17618c2ecf20Sopenharmony_ci vm.dirty_background_ratio, vm.dirty_ratio 17628c2ecf20Sopenharmony_ci These ratios apply the same to cgroup writeback with the 17638c2ecf20Sopenharmony_ci amount of available memory capped by limits imposed by the 17648c2ecf20Sopenharmony_ci memory controller and system-wide clean memory. 17658c2ecf20Sopenharmony_ci 17668c2ecf20Sopenharmony_ci vm.dirty_background_bytes, vm.dirty_bytes 17678c2ecf20Sopenharmony_ci For cgroup writeback, this is calculated into ratio against 17688c2ecf20Sopenharmony_ci total available memory and applied the same way as 17698c2ecf20Sopenharmony_ci vm.dirty[_background]_ratio. 17708c2ecf20Sopenharmony_ci 17718c2ecf20Sopenharmony_ci 17728c2ecf20Sopenharmony_ciIO Latency 17738c2ecf20Sopenharmony_ci~~~~~~~~~~ 17748c2ecf20Sopenharmony_ci 17758c2ecf20Sopenharmony_ciThis is a cgroup v2 controller for IO workload protection. You provide a group 17768c2ecf20Sopenharmony_ciwith a latency target, and if the average latency exceeds that target the 17778c2ecf20Sopenharmony_cicontroller will throttle any peers that have a lower latency target than the 17788c2ecf20Sopenharmony_ciprotected workload. 17798c2ecf20Sopenharmony_ci 17808c2ecf20Sopenharmony_ciThe limits are only applied at the peer level in the hierarchy. This means that 17818c2ecf20Sopenharmony_ciin the diagram below, only groups A, B, and C will influence each other, and 17828c2ecf20Sopenharmony_cigroups D and F will influence each other. Group G will influence nobody:: 17838c2ecf20Sopenharmony_ci 17848c2ecf20Sopenharmony_ci [root] 17858c2ecf20Sopenharmony_ci / | \ 17868c2ecf20Sopenharmony_ci A B C 17878c2ecf20Sopenharmony_ci / \ | 17888c2ecf20Sopenharmony_ci D F G 17898c2ecf20Sopenharmony_ci 17908c2ecf20Sopenharmony_ci 17918c2ecf20Sopenharmony_ciSo the ideal way to configure this is to set io.latency in groups A, B, and C. 17928c2ecf20Sopenharmony_ciGenerally you do not want to set a value lower than the latency your device 17938c2ecf20Sopenharmony_cisupports. Experiment to find the value that works best for your workload. 17948c2ecf20Sopenharmony_ciStart at higher than the expected latency for your device and watch the 17958c2ecf20Sopenharmony_ciavg_lat value in io.stat for your workload group to get an idea of the 17968c2ecf20Sopenharmony_cilatency you see during normal operation. Use the avg_lat value as a basis for 17978c2ecf20Sopenharmony_ciyour real setting, setting at 10-15% higher than the value in io.stat. 17988c2ecf20Sopenharmony_ci 17998c2ecf20Sopenharmony_ciHow IO Latency Throttling Works 18008c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18018c2ecf20Sopenharmony_ci 18028c2ecf20Sopenharmony_ciio.latency is work conserving; so as long as everybody is meeting their latency 18038c2ecf20Sopenharmony_citarget the controller doesn't do anything. Once a group starts missing its 18048c2ecf20Sopenharmony_citarget it begins throttling any peer group that has a higher target than itself. 18058c2ecf20Sopenharmony_ciThis throttling takes 2 forms: 18068c2ecf20Sopenharmony_ci 18078c2ecf20Sopenharmony_ci- Queue depth throttling. This is the number of outstanding IO's a group is 18088c2ecf20Sopenharmony_ci allowed to have. We will clamp down relatively quickly, starting at no limit 18098c2ecf20Sopenharmony_ci and going all the way down to 1 IO at a time. 18108c2ecf20Sopenharmony_ci 18118c2ecf20Sopenharmony_ci- Artificial delay induction. There are certain types of IO that cannot be 18128c2ecf20Sopenharmony_ci throttled without possibly adversely affecting higher priority groups. This 18138c2ecf20Sopenharmony_ci includes swapping and metadata IO. These types of IO are allowed to occur 18148c2ecf20Sopenharmony_ci normally, however they are "charged" to the originating group. If the 18158c2ecf20Sopenharmony_ci originating group is being throttled you will see the use_delay and delay 18168c2ecf20Sopenharmony_ci fields in io.stat increase. The delay value is how many microseconds that are 18178c2ecf20Sopenharmony_ci being added to any process that runs in this group. Because this number can 18188c2ecf20Sopenharmony_ci grow quite large if there is a lot of swapping or metadata IO occurring we 18198c2ecf20Sopenharmony_ci limit the individual delay events to 1 second at a time. 18208c2ecf20Sopenharmony_ci 18218c2ecf20Sopenharmony_ciOnce the victimized group starts meeting its latency target again it will start 18228c2ecf20Sopenharmony_ciunthrottling any peer groups that were throttled previously. If the victimized 18238c2ecf20Sopenharmony_cigroup simply stops doing IO the global counter will unthrottle appropriately. 18248c2ecf20Sopenharmony_ci 18258c2ecf20Sopenharmony_ciIO Latency Interface Files 18268c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 18278c2ecf20Sopenharmony_ci 18288c2ecf20Sopenharmony_ci io.latency 18298c2ecf20Sopenharmony_ci This takes a similar format as the other controllers. 18308c2ecf20Sopenharmony_ci 18318c2ecf20Sopenharmony_ci "MAJOR:MINOR target=<target time in microseconds" 18328c2ecf20Sopenharmony_ci 18338c2ecf20Sopenharmony_ci io.stat 18348c2ecf20Sopenharmony_ci If the controller is enabled you will see extra stats in io.stat in 18358c2ecf20Sopenharmony_ci addition to the normal ones. 18368c2ecf20Sopenharmony_ci 18378c2ecf20Sopenharmony_ci depth 18388c2ecf20Sopenharmony_ci This is the current queue depth for the group. 18398c2ecf20Sopenharmony_ci 18408c2ecf20Sopenharmony_ci avg_lat 18418c2ecf20Sopenharmony_ci This is an exponential moving average with a decay rate of 1/exp 18428c2ecf20Sopenharmony_ci bound by the sampling interval. The decay rate interval can be 18438c2ecf20Sopenharmony_ci calculated by multiplying the win value in io.stat by the 18448c2ecf20Sopenharmony_ci corresponding number of samples based on the win value. 18458c2ecf20Sopenharmony_ci 18468c2ecf20Sopenharmony_ci win 18478c2ecf20Sopenharmony_ci The sampling window size in milliseconds. This is the minimum 18488c2ecf20Sopenharmony_ci duration of time between evaluation events. Windows only elapse 18498c2ecf20Sopenharmony_ci with IO activity. Idle periods extend the most recent window. 18508c2ecf20Sopenharmony_ci 18518c2ecf20Sopenharmony_ciPID 18528c2ecf20Sopenharmony_ci--- 18538c2ecf20Sopenharmony_ci 18548c2ecf20Sopenharmony_ciThe process number controller is used to allow a cgroup to stop any 18558c2ecf20Sopenharmony_cinew tasks from being fork()'d or clone()'d after a specified limit is 18568c2ecf20Sopenharmony_cireached. 18578c2ecf20Sopenharmony_ci 18588c2ecf20Sopenharmony_ciThe number of tasks in a cgroup can be exhausted in ways which other 18598c2ecf20Sopenharmony_cicontrollers cannot prevent, thus warranting its own controller. For 18608c2ecf20Sopenharmony_ciexample, a fork bomb is likely to exhaust the number of tasks before 18618c2ecf20Sopenharmony_cihitting memory restrictions. 18628c2ecf20Sopenharmony_ci 18638c2ecf20Sopenharmony_ciNote that PIDs used in this controller refer to TIDs, process IDs as 18648c2ecf20Sopenharmony_ciused by the kernel. 18658c2ecf20Sopenharmony_ci 18668c2ecf20Sopenharmony_ci 18678c2ecf20Sopenharmony_ciPID Interface Files 18688c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 18698c2ecf20Sopenharmony_ci 18708c2ecf20Sopenharmony_ci pids.max 18718c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 18728c2ecf20Sopenharmony_ci cgroups. The default is "max". 18738c2ecf20Sopenharmony_ci 18748c2ecf20Sopenharmony_ci Hard limit of number of processes. 18758c2ecf20Sopenharmony_ci 18768c2ecf20Sopenharmony_ci pids.current 18778c2ecf20Sopenharmony_ci A read-only single value file which exists on all cgroups. 18788c2ecf20Sopenharmony_ci 18798c2ecf20Sopenharmony_ci The number of processes currently in the cgroup and its 18808c2ecf20Sopenharmony_ci descendants. 18818c2ecf20Sopenharmony_ci 18828c2ecf20Sopenharmony_ciOrganisational operations are not blocked by cgroup policies, so it is 18838c2ecf20Sopenharmony_cipossible to have pids.current > pids.max. This can be done by either 18848c2ecf20Sopenharmony_cisetting the limit to be smaller than pids.current, or attaching enough 18858c2ecf20Sopenharmony_ciprocesses to the cgroup such that pids.current is larger than 18868c2ecf20Sopenharmony_cipids.max. However, it is not possible to violate a cgroup PID policy 18878c2ecf20Sopenharmony_cithrough fork() or clone(). These will return -EAGAIN if the creation 18888c2ecf20Sopenharmony_ciof a new process would cause a cgroup policy to be violated. 18898c2ecf20Sopenharmony_ci 18908c2ecf20Sopenharmony_ci 18918c2ecf20Sopenharmony_ciCpuset 18928c2ecf20Sopenharmony_ci------ 18938c2ecf20Sopenharmony_ci 18948c2ecf20Sopenharmony_ciThe "cpuset" controller provides a mechanism for constraining 18958c2ecf20Sopenharmony_cithe CPU and memory node placement of tasks to only the resources 18968c2ecf20Sopenharmony_cispecified in the cpuset interface files in a task's current cgroup. 18978c2ecf20Sopenharmony_ciThis is especially valuable on large NUMA systems where placing jobs 18988c2ecf20Sopenharmony_cion properly sized subsets of the systems with careful processor and 18998c2ecf20Sopenharmony_cimemory placement to reduce cross-node memory access and contention 19008c2ecf20Sopenharmony_cican improve overall system performance. 19018c2ecf20Sopenharmony_ci 19028c2ecf20Sopenharmony_ciThe "cpuset" controller is hierarchical. That means the controller 19038c2ecf20Sopenharmony_cicannot use CPUs or memory nodes not allowed in its parent. 19048c2ecf20Sopenharmony_ci 19058c2ecf20Sopenharmony_ci 19068c2ecf20Sopenharmony_ciCpuset Interface Files 19078c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 19088c2ecf20Sopenharmony_ci 19098c2ecf20Sopenharmony_ci cpuset.cpus 19108c2ecf20Sopenharmony_ci A read-write multiple values file which exists on non-root 19118c2ecf20Sopenharmony_ci cpuset-enabled cgroups. 19128c2ecf20Sopenharmony_ci 19138c2ecf20Sopenharmony_ci It lists the requested CPUs to be used by tasks within this 19148c2ecf20Sopenharmony_ci cgroup. The actual list of CPUs to be granted, however, is 19158c2ecf20Sopenharmony_ci subjected to constraints imposed by its parent and can differ 19168c2ecf20Sopenharmony_ci from the requested CPUs. 19178c2ecf20Sopenharmony_ci 19188c2ecf20Sopenharmony_ci The CPU numbers are comma-separated numbers or ranges. 19198c2ecf20Sopenharmony_ci For example:: 19208c2ecf20Sopenharmony_ci 19218c2ecf20Sopenharmony_ci # cat cpuset.cpus 19228c2ecf20Sopenharmony_ci 0-4,6,8-10 19238c2ecf20Sopenharmony_ci 19248c2ecf20Sopenharmony_ci An empty value indicates that the cgroup is using the same 19258c2ecf20Sopenharmony_ci setting as the nearest cgroup ancestor with a non-empty 19268c2ecf20Sopenharmony_ci "cpuset.cpus" or all the available CPUs if none is found. 19278c2ecf20Sopenharmony_ci 19288c2ecf20Sopenharmony_ci The value of "cpuset.cpus" stays constant until the next update 19298c2ecf20Sopenharmony_ci and won't be affected by any CPU hotplug events. 19308c2ecf20Sopenharmony_ci 19318c2ecf20Sopenharmony_ci cpuset.cpus.effective 19328c2ecf20Sopenharmony_ci A read-only multiple values file which exists on all 19338c2ecf20Sopenharmony_ci cpuset-enabled cgroups. 19348c2ecf20Sopenharmony_ci 19358c2ecf20Sopenharmony_ci It lists the onlined CPUs that are actually granted to this 19368c2ecf20Sopenharmony_ci cgroup by its parent. These CPUs are allowed to be used by 19378c2ecf20Sopenharmony_ci tasks within the current cgroup. 19388c2ecf20Sopenharmony_ci 19398c2ecf20Sopenharmony_ci If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows 19408c2ecf20Sopenharmony_ci all the CPUs from the parent cgroup that can be available to 19418c2ecf20Sopenharmony_ci be used by this cgroup. Otherwise, it should be a subset of 19428c2ecf20Sopenharmony_ci "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" 19438c2ecf20Sopenharmony_ci can be granted. In this case, it will be treated just like an 19448c2ecf20Sopenharmony_ci empty "cpuset.cpus". 19458c2ecf20Sopenharmony_ci 19468c2ecf20Sopenharmony_ci Its value will be affected by CPU hotplug events. 19478c2ecf20Sopenharmony_ci 19488c2ecf20Sopenharmony_ci cpuset.mems 19498c2ecf20Sopenharmony_ci A read-write multiple values file which exists on non-root 19508c2ecf20Sopenharmony_ci cpuset-enabled cgroups. 19518c2ecf20Sopenharmony_ci 19528c2ecf20Sopenharmony_ci It lists the requested memory nodes to be used by tasks within 19538c2ecf20Sopenharmony_ci this cgroup. The actual list of memory nodes granted, however, 19548c2ecf20Sopenharmony_ci is subjected to constraints imposed by its parent and can differ 19558c2ecf20Sopenharmony_ci from the requested memory nodes. 19568c2ecf20Sopenharmony_ci 19578c2ecf20Sopenharmony_ci The memory node numbers are comma-separated numbers or ranges. 19588c2ecf20Sopenharmony_ci For example:: 19598c2ecf20Sopenharmony_ci 19608c2ecf20Sopenharmony_ci # cat cpuset.mems 19618c2ecf20Sopenharmony_ci 0-1,3 19628c2ecf20Sopenharmony_ci 19638c2ecf20Sopenharmony_ci An empty value indicates that the cgroup is using the same 19648c2ecf20Sopenharmony_ci setting as the nearest cgroup ancestor with a non-empty 19658c2ecf20Sopenharmony_ci "cpuset.mems" or all the available memory nodes if none 19668c2ecf20Sopenharmony_ci is found. 19678c2ecf20Sopenharmony_ci 19688c2ecf20Sopenharmony_ci The value of "cpuset.mems" stays constant until the next update 19698c2ecf20Sopenharmony_ci and won't be affected by any memory nodes hotplug events. 19708c2ecf20Sopenharmony_ci 19718c2ecf20Sopenharmony_ci cpuset.mems.effective 19728c2ecf20Sopenharmony_ci A read-only multiple values file which exists on all 19738c2ecf20Sopenharmony_ci cpuset-enabled cgroups. 19748c2ecf20Sopenharmony_ci 19758c2ecf20Sopenharmony_ci It lists the onlined memory nodes that are actually granted to 19768c2ecf20Sopenharmony_ci this cgroup by its parent. These memory nodes are allowed to 19778c2ecf20Sopenharmony_ci be used by tasks within the current cgroup. 19788c2ecf20Sopenharmony_ci 19798c2ecf20Sopenharmony_ci If "cpuset.mems" is empty, it shows all the memory nodes from the 19808c2ecf20Sopenharmony_ci parent cgroup that will be available to be used by this cgroup. 19818c2ecf20Sopenharmony_ci Otherwise, it should be a subset of "cpuset.mems" unless none of 19828c2ecf20Sopenharmony_ci the memory nodes listed in "cpuset.mems" can be granted. In this 19838c2ecf20Sopenharmony_ci case, it will be treated just like an empty "cpuset.mems". 19848c2ecf20Sopenharmony_ci 19858c2ecf20Sopenharmony_ci Its value will be affected by memory nodes hotplug events. 19868c2ecf20Sopenharmony_ci 19878c2ecf20Sopenharmony_ci cpuset.cpus.partition 19888c2ecf20Sopenharmony_ci A read-write single value file which exists on non-root 19898c2ecf20Sopenharmony_ci cpuset-enabled cgroups. This flag is owned by the parent cgroup 19908c2ecf20Sopenharmony_ci and is not delegatable. 19918c2ecf20Sopenharmony_ci 19928c2ecf20Sopenharmony_ci It accepts only the following input values when written to. 19938c2ecf20Sopenharmony_ci 19948c2ecf20Sopenharmony_ci "root" - a partition root 19958c2ecf20Sopenharmony_ci "member" - a non-root member of a partition 19968c2ecf20Sopenharmony_ci 19978c2ecf20Sopenharmony_ci When set to be a partition root, the current cgroup is the 19988c2ecf20Sopenharmony_ci root of a new partition or scheduling domain that comprises 19998c2ecf20Sopenharmony_ci itself and all its descendants except those that are separate 20008c2ecf20Sopenharmony_ci partition roots themselves and their descendants. The root 20018c2ecf20Sopenharmony_ci cgroup is always a partition root. 20028c2ecf20Sopenharmony_ci 20038c2ecf20Sopenharmony_ci There are constraints on where a partition root can be set. 20048c2ecf20Sopenharmony_ci It can only be set in a cgroup if all the following conditions 20058c2ecf20Sopenharmony_ci are true. 20068c2ecf20Sopenharmony_ci 20078c2ecf20Sopenharmony_ci 1) The "cpuset.cpus" is not empty and the list of CPUs are 20088c2ecf20Sopenharmony_ci exclusive, i.e. they are not shared by any of its siblings. 20098c2ecf20Sopenharmony_ci 2) The parent cgroup is a partition root. 20108c2ecf20Sopenharmony_ci 3) The "cpuset.cpus" is also a proper subset of the parent's 20118c2ecf20Sopenharmony_ci "cpuset.cpus.effective". 20128c2ecf20Sopenharmony_ci 4) There is no child cgroups with cpuset enabled. This is for 20138c2ecf20Sopenharmony_ci eliminating corner cases that have to be handled if such a 20148c2ecf20Sopenharmony_ci condition is allowed. 20158c2ecf20Sopenharmony_ci 20168c2ecf20Sopenharmony_ci Setting it to partition root will take the CPUs away from the 20178c2ecf20Sopenharmony_ci effective CPUs of the parent cgroup. Once it is set, this 20188c2ecf20Sopenharmony_ci file cannot be reverted back to "member" if there are any child 20198c2ecf20Sopenharmony_ci cgroups with cpuset enabled. 20208c2ecf20Sopenharmony_ci 20218c2ecf20Sopenharmony_ci A parent partition cannot distribute all its CPUs to its 20228c2ecf20Sopenharmony_ci child partitions. There must be at least one cpu left in the 20238c2ecf20Sopenharmony_ci parent partition. 20248c2ecf20Sopenharmony_ci 20258c2ecf20Sopenharmony_ci Once becoming a partition root, changes to "cpuset.cpus" is 20268c2ecf20Sopenharmony_ci generally allowed as long as the first condition above is true, 20278c2ecf20Sopenharmony_ci the change will not take away all the CPUs from the parent 20288c2ecf20Sopenharmony_ci partition and the new "cpuset.cpus" value is a superset of its 20298c2ecf20Sopenharmony_ci children's "cpuset.cpus" values. 20308c2ecf20Sopenharmony_ci 20318c2ecf20Sopenharmony_ci Sometimes, external factors like changes to ancestors' 20328c2ecf20Sopenharmony_ci "cpuset.cpus" or cpu hotplug can cause the state of the partition 20338c2ecf20Sopenharmony_ci root to change. On read, the "cpuset.sched.partition" file 20348c2ecf20Sopenharmony_ci can show the following values. 20358c2ecf20Sopenharmony_ci 20368c2ecf20Sopenharmony_ci "member" Non-root member of a partition 20378c2ecf20Sopenharmony_ci "root" Partition root 20388c2ecf20Sopenharmony_ci "root invalid" Invalid partition root 20398c2ecf20Sopenharmony_ci 20408c2ecf20Sopenharmony_ci It is a partition root if the first 2 partition root conditions 20418c2ecf20Sopenharmony_ci above are true and at least one CPU from "cpuset.cpus" is 20428c2ecf20Sopenharmony_ci granted by the parent cgroup. 20438c2ecf20Sopenharmony_ci 20448c2ecf20Sopenharmony_ci A partition root can become invalid if none of CPUs requested 20458c2ecf20Sopenharmony_ci in "cpuset.cpus" can be granted by the parent cgroup or the 20468c2ecf20Sopenharmony_ci parent cgroup is no longer a partition root itself. In this 20478c2ecf20Sopenharmony_ci case, it is not a real partition even though the restriction 20488c2ecf20Sopenharmony_ci of the first partition root condition above will still apply. 20498c2ecf20Sopenharmony_ci The cpu affinity of all the tasks in the cgroup will then be 20508c2ecf20Sopenharmony_ci associated with CPUs in the nearest ancestor partition. 20518c2ecf20Sopenharmony_ci 20528c2ecf20Sopenharmony_ci An invalid partition root can be transitioned back to a 20538c2ecf20Sopenharmony_ci real partition root if at least one of the requested CPUs 20548c2ecf20Sopenharmony_ci can now be granted by its parent. In this case, the cpu 20558c2ecf20Sopenharmony_ci affinity of all the tasks in the formerly invalid partition 20568c2ecf20Sopenharmony_ci will be associated to the CPUs of the newly formed partition. 20578c2ecf20Sopenharmony_ci Changing the partition state of an invalid partition root to 20588c2ecf20Sopenharmony_ci "member" is always allowed even if child cpusets are present. 20598c2ecf20Sopenharmony_ci 20608c2ecf20Sopenharmony_ci 20618c2ecf20Sopenharmony_ciDevice controller 20628c2ecf20Sopenharmony_ci----------------- 20638c2ecf20Sopenharmony_ci 20648c2ecf20Sopenharmony_ciDevice controller manages access to device files. It includes both 20658c2ecf20Sopenharmony_cicreation of new device files (using mknod), and access to the 20668c2ecf20Sopenharmony_ciexisting device files. 20678c2ecf20Sopenharmony_ci 20688c2ecf20Sopenharmony_ciCgroup v2 device controller has no interface files and is implemented 20698c2ecf20Sopenharmony_cion top of cgroup BPF. To control access to device files, a user may 20708c2ecf20Sopenharmony_cicreate bpf programs of the BPF_CGROUP_DEVICE type and attach them 20718c2ecf20Sopenharmony_cito cgroups. On an attempt to access a device file, corresponding 20728c2ecf20Sopenharmony_ciBPF programs will be executed, and depending on the return value 20738c2ecf20Sopenharmony_cithe attempt will succeed or fail with -EPERM. 20748c2ecf20Sopenharmony_ci 20758c2ecf20Sopenharmony_ciA BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx 20768c2ecf20Sopenharmony_cistructure, which describes the device access attempt: access type 20778c2ecf20Sopenharmony_ci(mknod/read/write) and device (type, major and minor numbers). 20788c2ecf20Sopenharmony_ciIf the program returns 0, the attempt fails with -EPERM, otherwise 20798c2ecf20Sopenharmony_ciit succeeds. 20808c2ecf20Sopenharmony_ci 20818c2ecf20Sopenharmony_ciAn example of BPF_CGROUP_DEVICE program may be found in the kernel 20828c2ecf20Sopenharmony_cisource tree in the tools/testing/selftests/bpf/dev_cgroup.c file. 20838c2ecf20Sopenharmony_ci 20848c2ecf20Sopenharmony_ci 20858c2ecf20Sopenharmony_ciRDMA 20868c2ecf20Sopenharmony_ci---- 20878c2ecf20Sopenharmony_ci 20888c2ecf20Sopenharmony_ciThe "rdma" controller regulates the distribution and accounting of 20898c2ecf20Sopenharmony_ciRDMA resources. 20908c2ecf20Sopenharmony_ci 20918c2ecf20Sopenharmony_ciRDMA Interface Files 20928c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 20938c2ecf20Sopenharmony_ci 20948c2ecf20Sopenharmony_ci rdma.max 20958c2ecf20Sopenharmony_ci A readwrite nested-keyed file that exists for all the cgroups 20968c2ecf20Sopenharmony_ci except root that describes current configured resource limit 20978c2ecf20Sopenharmony_ci for a RDMA/IB device. 20988c2ecf20Sopenharmony_ci 20998c2ecf20Sopenharmony_ci Lines are keyed by device name and are not ordered. 21008c2ecf20Sopenharmony_ci Each line contains space separated resource name and its configured 21018c2ecf20Sopenharmony_ci limit that can be distributed. 21028c2ecf20Sopenharmony_ci 21038c2ecf20Sopenharmony_ci The following nested keys are defined. 21048c2ecf20Sopenharmony_ci 21058c2ecf20Sopenharmony_ci ========== ============================= 21068c2ecf20Sopenharmony_ci hca_handle Maximum number of HCA Handles 21078c2ecf20Sopenharmony_ci hca_object Maximum number of HCA Objects 21088c2ecf20Sopenharmony_ci ========== ============================= 21098c2ecf20Sopenharmony_ci 21108c2ecf20Sopenharmony_ci An example for mlx4 and ocrdma device follows:: 21118c2ecf20Sopenharmony_ci 21128c2ecf20Sopenharmony_ci mlx4_0 hca_handle=2 hca_object=2000 21138c2ecf20Sopenharmony_ci ocrdma1 hca_handle=3 hca_object=max 21148c2ecf20Sopenharmony_ci 21158c2ecf20Sopenharmony_ci rdma.current 21168c2ecf20Sopenharmony_ci A read-only file that describes current resource usage. 21178c2ecf20Sopenharmony_ci It exists for all the cgroup except root. 21188c2ecf20Sopenharmony_ci 21198c2ecf20Sopenharmony_ci An example for mlx4 and ocrdma device follows:: 21208c2ecf20Sopenharmony_ci 21218c2ecf20Sopenharmony_ci mlx4_0 hca_handle=1 hca_object=20 21228c2ecf20Sopenharmony_ci ocrdma1 hca_handle=1 hca_object=23 21238c2ecf20Sopenharmony_ci 21248c2ecf20Sopenharmony_ciHugeTLB 21258c2ecf20Sopenharmony_ci------- 21268c2ecf20Sopenharmony_ci 21278c2ecf20Sopenharmony_ciThe HugeTLB controller allows to limit the HugeTLB usage per control group and 21288c2ecf20Sopenharmony_cienforces the controller limit during page fault. 21298c2ecf20Sopenharmony_ci 21308c2ecf20Sopenharmony_ciHugeTLB Interface Files 21318c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 21328c2ecf20Sopenharmony_ci 21338c2ecf20Sopenharmony_ci hugetlb.<hugepagesize>.current 21348c2ecf20Sopenharmony_ci Show current usage for "hugepagesize" hugetlb. It exists for all 21358c2ecf20Sopenharmony_ci the cgroup except root. 21368c2ecf20Sopenharmony_ci 21378c2ecf20Sopenharmony_ci hugetlb.<hugepagesize>.max 21388c2ecf20Sopenharmony_ci Set/show the hard limit of "hugepagesize" hugetlb usage. 21398c2ecf20Sopenharmony_ci The default value is "max". It exists for all the cgroup except root. 21408c2ecf20Sopenharmony_ci 21418c2ecf20Sopenharmony_ci hugetlb.<hugepagesize>.events 21428c2ecf20Sopenharmony_ci A read-only flat-keyed file which exists on non-root cgroups. 21438c2ecf20Sopenharmony_ci 21448c2ecf20Sopenharmony_ci max 21458c2ecf20Sopenharmony_ci The number of allocation failure due to HugeTLB limit 21468c2ecf20Sopenharmony_ci 21478c2ecf20Sopenharmony_ci hugetlb.<hugepagesize>.events.local 21488c2ecf20Sopenharmony_ci Similar to hugetlb.<hugepagesize>.events but the fields in the file 21498c2ecf20Sopenharmony_ci are local to the cgroup i.e. not hierarchical. The file modified event 21508c2ecf20Sopenharmony_ci generated on this file reflects only the local events. 21518c2ecf20Sopenharmony_ci 21528c2ecf20Sopenharmony_ciMisc 21538c2ecf20Sopenharmony_ci---- 21548c2ecf20Sopenharmony_ci 21558c2ecf20Sopenharmony_ciperf_event 21568c2ecf20Sopenharmony_ci~~~~~~~~~~ 21578c2ecf20Sopenharmony_ci 21588c2ecf20Sopenharmony_ciperf_event controller, if not mounted on a legacy hierarchy, is 21598c2ecf20Sopenharmony_ciautomatically enabled on the v2 hierarchy so that perf events can 21608c2ecf20Sopenharmony_cialways be filtered by cgroup v2 path. The controller can still be 21618c2ecf20Sopenharmony_cimoved to a legacy hierarchy after v2 hierarchy is populated. 21628c2ecf20Sopenharmony_ci 21638c2ecf20Sopenharmony_ci 21648c2ecf20Sopenharmony_ciNon-normative information 21658c2ecf20Sopenharmony_ci------------------------- 21668c2ecf20Sopenharmony_ci 21678c2ecf20Sopenharmony_ciThis section contains information that isn't considered to be a part of 21688c2ecf20Sopenharmony_cithe stable kernel API and so is subject to change. 21698c2ecf20Sopenharmony_ci 21708c2ecf20Sopenharmony_ci 21718c2ecf20Sopenharmony_ciCPU controller root cgroup process behaviour 21728c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21738c2ecf20Sopenharmony_ci 21748c2ecf20Sopenharmony_ciWhen distributing CPU cycles in the root cgroup each thread in this 21758c2ecf20Sopenharmony_cicgroup is treated as if it was hosted in a separate child cgroup of the 21768c2ecf20Sopenharmony_ciroot cgroup. This child cgroup weight is dependent on its thread nice 21778c2ecf20Sopenharmony_cilevel. 21788c2ecf20Sopenharmony_ci 21798c2ecf20Sopenharmony_ciFor details of this mapping see sched_prio_to_weight array in 21808c2ecf20Sopenharmony_cikernel/sched/core.c file (values from this array should be scaled 21818c2ecf20Sopenharmony_ciappropriately so the neutral - nice 0 - value is 100 instead of 1024). 21828c2ecf20Sopenharmony_ci 21838c2ecf20Sopenharmony_ci 21848c2ecf20Sopenharmony_ciIO controller root cgroup process behaviour 21858c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21868c2ecf20Sopenharmony_ci 21878c2ecf20Sopenharmony_ciRoot cgroup processes are hosted in an implicit leaf child node. 21888c2ecf20Sopenharmony_ciWhen distributing IO resources this implicit child node is taken into 21898c2ecf20Sopenharmony_ciaccount as if it was a normal child cgroup of the root cgroup with a 21908c2ecf20Sopenharmony_ciweight value of 200. 21918c2ecf20Sopenharmony_ci 21928c2ecf20Sopenharmony_ci 21938c2ecf20Sopenharmony_ciNamespace 21948c2ecf20Sopenharmony_ci========= 21958c2ecf20Sopenharmony_ci 21968c2ecf20Sopenharmony_ciBasics 21978c2ecf20Sopenharmony_ci------ 21988c2ecf20Sopenharmony_ci 21998c2ecf20Sopenharmony_cicgroup namespace provides a mechanism to virtualize the view of the 22008c2ecf20Sopenharmony_ci"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 22018c2ecf20Sopenharmony_ciflag can be used with clone(2) and unshare(2) to create a new cgroup 22028c2ecf20Sopenharmony_cinamespace. The process running inside the cgroup namespace will have 22038c2ecf20Sopenharmony_ciits "/proc/$PID/cgroup" output restricted to cgroupns root. The 22048c2ecf20Sopenharmony_cicgroupns root is the cgroup of the process at the time of creation of 22058c2ecf20Sopenharmony_cithe cgroup namespace. 22068c2ecf20Sopenharmony_ci 22078c2ecf20Sopenharmony_ciWithout cgroup namespace, the "/proc/$PID/cgroup" file shows the 22088c2ecf20Sopenharmony_cicomplete path of the cgroup of a process. In a container setup where 22098c2ecf20Sopenharmony_cia set of cgroups and namespaces are intended to isolate processes the 22108c2ecf20Sopenharmony_ci"/proc/$PID/cgroup" file may leak potential system level information 22118c2ecf20Sopenharmony_cito the isolated processes. For Example:: 22128c2ecf20Sopenharmony_ci 22138c2ecf20Sopenharmony_ci # cat /proc/self/cgroup 22148c2ecf20Sopenharmony_ci 0::/batchjobs/container_id1 22158c2ecf20Sopenharmony_ci 22168c2ecf20Sopenharmony_ciThe path '/batchjobs/container_id1' can be considered as system-data 22178c2ecf20Sopenharmony_ciand undesirable to expose to the isolated processes. cgroup namespace 22188c2ecf20Sopenharmony_cican be used to restrict visibility of this path. For example, before 22198c2ecf20Sopenharmony_cicreating a cgroup namespace, one would see:: 22208c2ecf20Sopenharmony_ci 22218c2ecf20Sopenharmony_ci # ls -l /proc/self/ns/cgroup 22228c2ecf20Sopenharmony_ci lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 22238c2ecf20Sopenharmony_ci # cat /proc/self/cgroup 22248c2ecf20Sopenharmony_ci 0::/batchjobs/container_id1 22258c2ecf20Sopenharmony_ci 22268c2ecf20Sopenharmony_ciAfter unsharing a new namespace, the view changes:: 22278c2ecf20Sopenharmony_ci 22288c2ecf20Sopenharmony_ci # ls -l /proc/self/ns/cgroup 22298c2ecf20Sopenharmony_ci lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 22308c2ecf20Sopenharmony_ci # cat /proc/self/cgroup 22318c2ecf20Sopenharmony_ci 0::/ 22328c2ecf20Sopenharmony_ci 22338c2ecf20Sopenharmony_ciWhen some thread from a multi-threaded process unshares its cgroup 22348c2ecf20Sopenharmony_cinamespace, the new cgroupns gets applied to the entire process (all 22358c2ecf20Sopenharmony_cithe threads). This is natural for the v2 hierarchy; however, for the 22368c2ecf20Sopenharmony_cilegacy hierarchies, this may be unexpected. 22378c2ecf20Sopenharmony_ci 22388c2ecf20Sopenharmony_ciA cgroup namespace is alive as long as there are processes inside or 22398c2ecf20Sopenharmony_cimounts pinning it. When the last usage goes away, the cgroup 22408c2ecf20Sopenharmony_cinamespace is destroyed. The cgroupns root and the actual cgroups 22418c2ecf20Sopenharmony_ciremain. 22428c2ecf20Sopenharmony_ci 22438c2ecf20Sopenharmony_ci 22448c2ecf20Sopenharmony_ciThe Root and Views 22458c2ecf20Sopenharmony_ci------------------ 22468c2ecf20Sopenharmony_ci 22478c2ecf20Sopenharmony_ciThe 'cgroupns root' for a cgroup namespace is the cgroup in which the 22488c2ecf20Sopenharmony_ciprocess calling unshare(2) is running. For example, if a process in 22498c2ecf20Sopenharmony_ci/batchjobs/container_id1 cgroup calls unshare, cgroup 22508c2ecf20Sopenharmony_ci/batchjobs/container_id1 becomes the cgroupns root. For the 22518c2ecf20Sopenharmony_ciinit_cgroup_ns, this is the real root ('/') cgroup. 22528c2ecf20Sopenharmony_ci 22538c2ecf20Sopenharmony_ciThe cgroupns root cgroup does not change even if the namespace creator 22548c2ecf20Sopenharmony_ciprocess later moves to a different cgroup:: 22558c2ecf20Sopenharmony_ci 22568c2ecf20Sopenharmony_ci # ~/unshare -c # unshare cgroupns in some cgroup 22578c2ecf20Sopenharmony_ci # cat /proc/self/cgroup 22588c2ecf20Sopenharmony_ci 0::/ 22598c2ecf20Sopenharmony_ci # mkdir sub_cgrp_1 22608c2ecf20Sopenharmony_ci # echo 0 > sub_cgrp_1/cgroup.procs 22618c2ecf20Sopenharmony_ci # cat /proc/self/cgroup 22628c2ecf20Sopenharmony_ci 0::/sub_cgrp_1 22638c2ecf20Sopenharmony_ci 22648c2ecf20Sopenharmony_ciEach process gets its namespace-specific view of "/proc/$PID/cgroup" 22658c2ecf20Sopenharmony_ci 22668c2ecf20Sopenharmony_ciProcesses running inside the cgroup namespace will be able to see 22678c2ecf20Sopenharmony_cicgroup paths (in /proc/self/cgroup) only inside their root cgroup. 22688c2ecf20Sopenharmony_ciFrom within an unshared cgroupns:: 22698c2ecf20Sopenharmony_ci 22708c2ecf20Sopenharmony_ci # sleep 100000 & 22718c2ecf20Sopenharmony_ci [1] 7353 22728c2ecf20Sopenharmony_ci # echo 7353 > sub_cgrp_1/cgroup.procs 22738c2ecf20Sopenharmony_ci # cat /proc/7353/cgroup 22748c2ecf20Sopenharmony_ci 0::/sub_cgrp_1 22758c2ecf20Sopenharmony_ci 22768c2ecf20Sopenharmony_ciFrom the initial cgroup namespace, the real cgroup path will be 22778c2ecf20Sopenharmony_civisible:: 22788c2ecf20Sopenharmony_ci 22798c2ecf20Sopenharmony_ci $ cat /proc/7353/cgroup 22808c2ecf20Sopenharmony_ci 0::/batchjobs/container_id1/sub_cgrp_1 22818c2ecf20Sopenharmony_ci 22828c2ecf20Sopenharmony_ciFrom a sibling cgroup namespace (that is, a namespace rooted at a 22838c2ecf20Sopenharmony_cidifferent cgroup), the cgroup path relative to its own cgroup 22848c2ecf20Sopenharmony_cinamespace root will be shown. For instance, if PID 7353's cgroup 22858c2ecf20Sopenharmony_cinamespace root is at '/batchjobs/container_id2', then it will see:: 22868c2ecf20Sopenharmony_ci 22878c2ecf20Sopenharmony_ci # cat /proc/7353/cgroup 22888c2ecf20Sopenharmony_ci 0::/../container_id2/sub_cgrp_1 22898c2ecf20Sopenharmony_ci 22908c2ecf20Sopenharmony_ciNote that the relative path always starts with '/' to indicate that 22918c2ecf20Sopenharmony_ciits relative to the cgroup namespace root of the caller. 22928c2ecf20Sopenharmony_ci 22938c2ecf20Sopenharmony_ci 22948c2ecf20Sopenharmony_ciMigration and setns(2) 22958c2ecf20Sopenharmony_ci---------------------- 22968c2ecf20Sopenharmony_ci 22978c2ecf20Sopenharmony_ciProcesses inside a cgroup namespace can move into and out of the 22988c2ecf20Sopenharmony_cinamespace root if they have proper access to external cgroups. For 22998c2ecf20Sopenharmony_ciexample, from inside a namespace with cgroupns root at 23008c2ecf20Sopenharmony_ci/batchjobs/container_id1, and assuming that the global hierarchy is 23018c2ecf20Sopenharmony_cistill accessible inside cgroupns:: 23028c2ecf20Sopenharmony_ci 23038c2ecf20Sopenharmony_ci # cat /proc/7353/cgroup 23048c2ecf20Sopenharmony_ci 0::/sub_cgrp_1 23058c2ecf20Sopenharmony_ci # echo 7353 > batchjobs/container_id2/cgroup.procs 23068c2ecf20Sopenharmony_ci # cat /proc/7353/cgroup 23078c2ecf20Sopenharmony_ci 0::/../container_id2 23088c2ecf20Sopenharmony_ci 23098c2ecf20Sopenharmony_ciNote that this kind of setup is not encouraged. A task inside cgroup 23108c2ecf20Sopenharmony_cinamespace should only be exposed to its own cgroupns hierarchy. 23118c2ecf20Sopenharmony_ci 23128c2ecf20Sopenharmony_cisetns(2) to another cgroup namespace is allowed when: 23138c2ecf20Sopenharmony_ci 23148c2ecf20Sopenharmony_ci(a) the process has CAP_SYS_ADMIN against its current user namespace 23158c2ecf20Sopenharmony_ci(b) the process has CAP_SYS_ADMIN against the target cgroup 23168c2ecf20Sopenharmony_ci namespace's userns 23178c2ecf20Sopenharmony_ci 23188c2ecf20Sopenharmony_ciNo implicit cgroup changes happen with attaching to another cgroup 23198c2ecf20Sopenharmony_cinamespace. It is expected that the someone moves the attaching 23208c2ecf20Sopenharmony_ciprocess under the target cgroup namespace root. 23218c2ecf20Sopenharmony_ci 23228c2ecf20Sopenharmony_ci 23238c2ecf20Sopenharmony_ciInteraction with Other Namespaces 23248c2ecf20Sopenharmony_ci--------------------------------- 23258c2ecf20Sopenharmony_ci 23268c2ecf20Sopenharmony_ciNamespace specific cgroup hierarchy can be mounted by a process 23278c2ecf20Sopenharmony_cirunning inside a non-init cgroup namespace:: 23288c2ecf20Sopenharmony_ci 23298c2ecf20Sopenharmony_ci # mount -t cgroup2 none $MOUNT_POINT 23308c2ecf20Sopenharmony_ci 23318c2ecf20Sopenharmony_ciThis will mount the unified cgroup hierarchy with cgroupns root as the 23328c2ecf20Sopenharmony_cifilesystem root. The process needs CAP_SYS_ADMIN against its user and 23338c2ecf20Sopenharmony_cimount namespaces. 23348c2ecf20Sopenharmony_ci 23358c2ecf20Sopenharmony_ciThe virtualization of /proc/self/cgroup file combined with restricting 23368c2ecf20Sopenharmony_cithe view of cgroup hierarchy by namespace-private cgroupfs mount 23378c2ecf20Sopenharmony_ciprovides a properly isolated cgroup view inside the container. 23388c2ecf20Sopenharmony_ci 23398c2ecf20Sopenharmony_ci 23408c2ecf20Sopenharmony_ciInformation on Kernel Programming 23418c2ecf20Sopenharmony_ci================================= 23428c2ecf20Sopenharmony_ci 23438c2ecf20Sopenharmony_ciThis section contains kernel programming information in the areas 23448c2ecf20Sopenharmony_ciwhere interacting with cgroup is necessary. cgroup core and 23458c2ecf20Sopenharmony_cicontrollers are not covered. 23468c2ecf20Sopenharmony_ci 23478c2ecf20Sopenharmony_ci 23488c2ecf20Sopenharmony_ciFilesystem Support for Writeback 23498c2ecf20Sopenharmony_ci-------------------------------- 23508c2ecf20Sopenharmony_ci 23518c2ecf20Sopenharmony_ciA filesystem can support cgroup writeback by updating 23528c2ecf20Sopenharmony_ciaddress_space_operations->writepage[s]() to annotate bio's using the 23538c2ecf20Sopenharmony_cifollowing two functions. 23548c2ecf20Sopenharmony_ci 23558c2ecf20Sopenharmony_ci wbc_init_bio(@wbc, @bio) 23568c2ecf20Sopenharmony_ci Should be called for each bio carrying writeback data and 23578c2ecf20Sopenharmony_ci associates the bio with the inode's owner cgroup and the 23588c2ecf20Sopenharmony_ci corresponding request queue. This must be called after 23598c2ecf20Sopenharmony_ci a queue (device) has been associated with the bio and 23608c2ecf20Sopenharmony_ci before submission. 23618c2ecf20Sopenharmony_ci 23628c2ecf20Sopenharmony_ci wbc_account_cgroup_owner(@wbc, @page, @bytes) 23638c2ecf20Sopenharmony_ci Should be called for each data segment being written out. 23648c2ecf20Sopenharmony_ci While this function doesn't care exactly when it's called 23658c2ecf20Sopenharmony_ci during the writeback session, it's the easiest and most 23668c2ecf20Sopenharmony_ci natural to call it as data segments are added to a bio. 23678c2ecf20Sopenharmony_ci 23688c2ecf20Sopenharmony_ciWith writeback bio's annotated, cgroup support can be enabled per 23698c2ecf20Sopenharmony_cisuper_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for 23708c2ecf20Sopenharmony_ciselective disabling of cgroup writeback support which is helpful when 23718c2ecf20Sopenharmony_cicertain filesystem features, e.g. journaled data mode, are 23728c2ecf20Sopenharmony_ciincompatible. 23738c2ecf20Sopenharmony_ci 23748c2ecf20Sopenharmony_ciwbc_init_bio() binds the specified bio to its cgroup. Depending on 23758c2ecf20Sopenharmony_cithe configuration, the bio may be executed at a lower priority and if 23768c2ecf20Sopenharmony_cithe writeback session is holding shared resources, e.g. a journal 23778c2ecf20Sopenharmony_cientry, may lead to priority inversion. There is no one easy solution 23788c2ecf20Sopenharmony_cifor the problem. Filesystems can try to work around specific problem 23798c2ecf20Sopenharmony_cicases by skipping wbc_init_bio() and using bio_associate_blkg() 23808c2ecf20Sopenharmony_cidirectly. 23818c2ecf20Sopenharmony_ci 23828c2ecf20Sopenharmony_ci 23838c2ecf20Sopenharmony_ciDeprecated v1 Core Features 23848c2ecf20Sopenharmony_ci=========================== 23858c2ecf20Sopenharmony_ci 23868c2ecf20Sopenharmony_ci- Multiple hierarchies including named ones are not supported. 23878c2ecf20Sopenharmony_ci 23888c2ecf20Sopenharmony_ci- All v1 mount options are not supported. 23898c2ecf20Sopenharmony_ci 23908c2ecf20Sopenharmony_ci- The "tasks" file is removed and "cgroup.procs" is not sorted. 23918c2ecf20Sopenharmony_ci 23928c2ecf20Sopenharmony_ci- "cgroup.clone_children" is removed. 23938c2ecf20Sopenharmony_ci 23948c2ecf20Sopenharmony_ci- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file 23958c2ecf20Sopenharmony_ci at the root instead. 23968c2ecf20Sopenharmony_ci 23978c2ecf20Sopenharmony_ci 23988c2ecf20Sopenharmony_ciIssues with v1 and Rationales for v2 23998c2ecf20Sopenharmony_ci==================================== 24008c2ecf20Sopenharmony_ci 24018c2ecf20Sopenharmony_ciMultiple Hierarchies 24028c2ecf20Sopenharmony_ci-------------------- 24038c2ecf20Sopenharmony_ci 24048c2ecf20Sopenharmony_cicgroup v1 allowed an arbitrary number of hierarchies and each 24058c2ecf20Sopenharmony_cihierarchy could host any number of controllers. While this seemed to 24068c2ecf20Sopenharmony_ciprovide a high level of flexibility, it wasn't useful in practice. 24078c2ecf20Sopenharmony_ci 24088c2ecf20Sopenharmony_ciFor example, as there is only one instance of each controller, utility 24098c2ecf20Sopenharmony_citype controllers such as freezer which can be useful in all 24108c2ecf20Sopenharmony_cihierarchies could only be used in one. The issue is exacerbated by 24118c2ecf20Sopenharmony_cithe fact that controllers couldn't be moved to another hierarchy once 24128c2ecf20Sopenharmony_cihierarchies were populated. Another issue was that all controllers 24138c2ecf20Sopenharmony_cibound to a hierarchy were forced to have exactly the same view of the 24148c2ecf20Sopenharmony_cihierarchy. It wasn't possible to vary the granularity depending on 24158c2ecf20Sopenharmony_cithe specific controller. 24168c2ecf20Sopenharmony_ci 24178c2ecf20Sopenharmony_ciIn practice, these issues heavily limited which controllers could be 24188c2ecf20Sopenharmony_ciput on the same hierarchy and most configurations resorted to putting 24198c2ecf20Sopenharmony_cieach controller on its own hierarchy. Only closely related ones, such 24208c2ecf20Sopenharmony_cias the cpu and cpuacct controllers, made sense to be put on the same 24218c2ecf20Sopenharmony_cihierarchy. This often meant that userland ended up managing multiple 24228c2ecf20Sopenharmony_cisimilar hierarchies repeating the same steps on each hierarchy 24238c2ecf20Sopenharmony_ciwhenever a hierarchy management operation was necessary. 24248c2ecf20Sopenharmony_ci 24258c2ecf20Sopenharmony_ciFurthermore, support for multiple hierarchies came at a steep cost. 24268c2ecf20Sopenharmony_ciIt greatly complicated cgroup core implementation but more importantly 24278c2ecf20Sopenharmony_cithe support for multiple hierarchies restricted how cgroup could be 24288c2ecf20Sopenharmony_ciused in general and what controllers was able to do. 24298c2ecf20Sopenharmony_ci 24308c2ecf20Sopenharmony_ciThere was no limit on how many hierarchies there might be, which meant 24318c2ecf20Sopenharmony_cithat a thread's cgroup membership couldn't be described in finite 24328c2ecf20Sopenharmony_cilength. The key might contain any number of entries and was unlimited 24338c2ecf20Sopenharmony_ciin length, which made it highly awkward to manipulate and led to 24348c2ecf20Sopenharmony_ciaddition of controllers which existed only to identify membership, 24358c2ecf20Sopenharmony_ciwhich in turn exacerbated the original problem of proliferating number 24368c2ecf20Sopenharmony_ciof hierarchies. 24378c2ecf20Sopenharmony_ci 24388c2ecf20Sopenharmony_ciAlso, as a controller couldn't have any expectation regarding the 24398c2ecf20Sopenharmony_citopologies of hierarchies other controllers might be on, each 24408c2ecf20Sopenharmony_cicontroller had to assume that all other controllers were attached to 24418c2ecf20Sopenharmony_cicompletely orthogonal hierarchies. This made it impossible, or at 24428c2ecf20Sopenharmony_cileast very cumbersome, for controllers to cooperate with each other. 24438c2ecf20Sopenharmony_ci 24448c2ecf20Sopenharmony_ciIn most use cases, putting controllers on hierarchies which are 24458c2ecf20Sopenharmony_cicompletely orthogonal to each other isn't necessary. What usually is 24468c2ecf20Sopenharmony_cicalled for is the ability to have differing levels of granularity 24478c2ecf20Sopenharmony_cidepending on the specific controller. In other words, hierarchy may 24488c2ecf20Sopenharmony_cibe collapsed from leaf towards root when viewed from specific 24498c2ecf20Sopenharmony_cicontrollers. For example, a given configuration might not care about 24508c2ecf20Sopenharmony_cihow memory is distributed beyond a certain level while still wanting 24518c2ecf20Sopenharmony_cito control how CPU cycles are distributed. 24528c2ecf20Sopenharmony_ci 24538c2ecf20Sopenharmony_ci 24548c2ecf20Sopenharmony_ciThread Granularity 24558c2ecf20Sopenharmony_ci------------------ 24568c2ecf20Sopenharmony_ci 24578c2ecf20Sopenharmony_cicgroup v1 allowed threads of a process to belong to different cgroups. 24588c2ecf20Sopenharmony_ciThis didn't make sense for some controllers and those controllers 24598c2ecf20Sopenharmony_ciended up implementing different ways to ignore such situations but 24608c2ecf20Sopenharmony_cimuch more importantly it blurred the line between API exposed to 24618c2ecf20Sopenharmony_ciindividual applications and system management interface. 24628c2ecf20Sopenharmony_ci 24638c2ecf20Sopenharmony_ciGenerally, in-process knowledge is available only to the process 24648c2ecf20Sopenharmony_ciitself; thus, unlike service-level organization of processes, 24658c2ecf20Sopenharmony_cicategorizing threads of a process requires active participation from 24668c2ecf20Sopenharmony_cithe application which owns the target process. 24678c2ecf20Sopenharmony_ci 24688c2ecf20Sopenharmony_cicgroup v1 had an ambiguously defined delegation model which got abused 24698c2ecf20Sopenharmony_ciin combination with thread granularity. cgroups were delegated to 24708c2ecf20Sopenharmony_ciindividual applications so that they can create and manage their own 24718c2ecf20Sopenharmony_cisub-hierarchies and control resource distributions along them. This 24728c2ecf20Sopenharmony_cieffectively raised cgroup to the status of a syscall-like API exposed 24738c2ecf20Sopenharmony_cito lay programs. 24748c2ecf20Sopenharmony_ci 24758c2ecf20Sopenharmony_ciFirst of all, cgroup has a fundamentally inadequate interface to be 24768c2ecf20Sopenharmony_ciexposed this way. For a process to access its own knobs, it has to 24778c2ecf20Sopenharmony_ciextract the path on the target hierarchy from /proc/self/cgroup, 24788c2ecf20Sopenharmony_ciconstruct the path by appending the name of the knob to the path, open 24798c2ecf20Sopenharmony_ciand then read and/or write to it. This is not only extremely clunky 24808c2ecf20Sopenharmony_ciand unusual but also inherently racy. There is no conventional way to 24818c2ecf20Sopenharmony_cidefine transaction across the required steps and nothing can guarantee 24828c2ecf20Sopenharmony_cithat the process would actually be operating on its own sub-hierarchy. 24838c2ecf20Sopenharmony_ci 24848c2ecf20Sopenharmony_cicgroup controllers implemented a number of knobs which would never be 24858c2ecf20Sopenharmony_ciaccepted as public APIs because they were just adding control knobs to 24868c2ecf20Sopenharmony_cisystem-management pseudo filesystem. cgroup ended up with interface 24878c2ecf20Sopenharmony_ciknobs which were not properly abstracted or refined and directly 24888c2ecf20Sopenharmony_cirevealed kernel internal details. These knobs got exposed to 24898c2ecf20Sopenharmony_ciindividual applications through the ill-defined delegation mechanism 24908c2ecf20Sopenharmony_cieffectively abusing cgroup as a shortcut to implementing public APIs 24918c2ecf20Sopenharmony_ciwithout going through the required scrutiny. 24928c2ecf20Sopenharmony_ci 24938c2ecf20Sopenharmony_ciThis was painful for both userland and kernel. Userland ended up with 24948c2ecf20Sopenharmony_cimisbehaving and poorly abstracted interfaces and kernel exposing and 24958c2ecf20Sopenharmony_cilocked into constructs inadvertently. 24968c2ecf20Sopenharmony_ci 24978c2ecf20Sopenharmony_ci 24988c2ecf20Sopenharmony_ciCompetition Between Inner Nodes and Threads 24998c2ecf20Sopenharmony_ci------------------------------------------- 25008c2ecf20Sopenharmony_ci 25018c2ecf20Sopenharmony_cicgroup v1 allowed threads to be in any cgroups which created an 25028c2ecf20Sopenharmony_ciinteresting problem where threads belonging to a parent cgroup and its 25038c2ecf20Sopenharmony_cichildren cgroups competed for resources. This was nasty as two 25048c2ecf20Sopenharmony_cidifferent types of entities competed and there was no obvious way to 25058c2ecf20Sopenharmony_cisettle it. Different controllers did different things. 25068c2ecf20Sopenharmony_ci 25078c2ecf20Sopenharmony_ciThe cpu controller considered threads and cgroups as equivalents and 25088c2ecf20Sopenharmony_cimapped nice levels to cgroup weights. This worked for some cases but 25098c2ecf20Sopenharmony_cifell flat when children wanted to be allocated specific ratios of CPU 25108c2ecf20Sopenharmony_cicycles and the number of internal threads fluctuated - the ratios 25118c2ecf20Sopenharmony_ciconstantly changed as the number of competing entities fluctuated. 25128c2ecf20Sopenharmony_ciThere also were other issues. The mapping from nice level to weight 25138c2ecf20Sopenharmony_ciwasn't obvious or universal, and there were various other knobs which 25148c2ecf20Sopenharmony_cisimply weren't available for threads. 25158c2ecf20Sopenharmony_ci 25168c2ecf20Sopenharmony_ciThe io controller implicitly created a hidden leaf node for each 25178c2ecf20Sopenharmony_cicgroup to host the threads. The hidden leaf had its own copies of all 25188c2ecf20Sopenharmony_cithe knobs with ``leaf_`` prefixed. While this allowed equivalent 25198c2ecf20Sopenharmony_cicontrol over internal threads, it was with serious drawbacks. It 25208c2ecf20Sopenharmony_cialways added an extra layer of nesting which wouldn't be necessary 25218c2ecf20Sopenharmony_ciotherwise, made the interface messy and significantly complicated the 25228c2ecf20Sopenharmony_ciimplementation. 25238c2ecf20Sopenharmony_ci 25248c2ecf20Sopenharmony_ciThe memory controller didn't have a way to control what happened 25258c2ecf20Sopenharmony_cibetween internal tasks and child cgroups and the behavior was not 25268c2ecf20Sopenharmony_ciclearly defined. There were attempts to add ad-hoc behaviors and 25278c2ecf20Sopenharmony_ciknobs to tailor the behavior to specific workloads which would have 25288c2ecf20Sopenharmony_ciled to problems extremely difficult to resolve in the long term. 25298c2ecf20Sopenharmony_ci 25308c2ecf20Sopenharmony_ciMultiple controllers struggled with internal tasks and came up with 25318c2ecf20Sopenharmony_cidifferent ways to deal with it; unfortunately, all the approaches were 25328c2ecf20Sopenharmony_ciseverely flawed and, furthermore, the widely different behaviors 25338c2ecf20Sopenharmony_cimade cgroup as a whole highly inconsistent. 25348c2ecf20Sopenharmony_ci 25358c2ecf20Sopenharmony_ciThis clearly is a problem which needs to be addressed from cgroup core 25368c2ecf20Sopenharmony_ciin a uniform way. 25378c2ecf20Sopenharmony_ci 25388c2ecf20Sopenharmony_ci 25398c2ecf20Sopenharmony_ciOther Interface Issues 25408c2ecf20Sopenharmony_ci---------------------- 25418c2ecf20Sopenharmony_ci 25428c2ecf20Sopenharmony_cicgroup v1 grew without oversight and developed a large number of 25438c2ecf20Sopenharmony_ciidiosyncrasies and inconsistencies. One issue on the cgroup core side 25448c2ecf20Sopenharmony_ciwas how an empty cgroup was notified - a userland helper binary was 25458c2ecf20Sopenharmony_ciforked and executed for each event. The event delivery wasn't 25468c2ecf20Sopenharmony_cirecursive or delegatable. The limitations of the mechanism also led 25478c2ecf20Sopenharmony_cito in-kernel event delivery filtering mechanism further complicating 25488c2ecf20Sopenharmony_cithe interface. 25498c2ecf20Sopenharmony_ci 25508c2ecf20Sopenharmony_ciController interfaces were problematic too. An extreme example is 25518c2ecf20Sopenharmony_cicontrollers completely ignoring hierarchical organization and treating 25528c2ecf20Sopenharmony_ciall cgroups as if they were all located directly under the root 25538c2ecf20Sopenharmony_cicgroup. Some controllers exposed a large amount of inconsistent 25548c2ecf20Sopenharmony_ciimplementation details to userland. 25558c2ecf20Sopenharmony_ci 25568c2ecf20Sopenharmony_ciThere also was no consistency across controllers. When a new cgroup 25578c2ecf20Sopenharmony_ciwas created, some controllers defaulted to not imposing extra 25588c2ecf20Sopenharmony_cirestrictions while others disallowed any resource usage until 25598c2ecf20Sopenharmony_ciexplicitly configured. Configuration knobs for the same type of 25608c2ecf20Sopenharmony_cicontrol used widely differing naming schemes and formats. Statistics 25618c2ecf20Sopenharmony_ciand information knobs were named arbitrarily and used different 25628c2ecf20Sopenharmony_ciformats and units even in the same controller. 25638c2ecf20Sopenharmony_ci 25648c2ecf20Sopenharmony_cicgroup v2 establishes common conventions where appropriate and updates 25658c2ecf20Sopenharmony_cicontrollers so that they expose minimal and consistent interfaces. 25668c2ecf20Sopenharmony_ci 25678c2ecf20Sopenharmony_ci 25688c2ecf20Sopenharmony_ciController Issues and Remedies 25698c2ecf20Sopenharmony_ci------------------------------ 25708c2ecf20Sopenharmony_ci 25718c2ecf20Sopenharmony_ciMemory 25728c2ecf20Sopenharmony_ci~~~~~~ 25738c2ecf20Sopenharmony_ci 25748c2ecf20Sopenharmony_ciThe original lower boundary, the soft limit, is defined as a limit 25758c2ecf20Sopenharmony_cithat is per default unset. As a result, the set of cgroups that 25768c2ecf20Sopenharmony_ciglobal reclaim prefers is opt-in, rather than opt-out. The costs for 25778c2ecf20Sopenharmony_cioptimizing these mostly negative lookups are so high that the 25788c2ecf20Sopenharmony_ciimplementation, despite its enormous size, does not even provide the 25798c2ecf20Sopenharmony_cibasic desirable behavior. First off, the soft limit has no 25808c2ecf20Sopenharmony_cihierarchical meaning. All configured groups are organized in a global 25818c2ecf20Sopenharmony_cirbtree and treated like equal peers, regardless where they are located 25828c2ecf20Sopenharmony_ciin the hierarchy. This makes subtree delegation impossible. Second, 25838c2ecf20Sopenharmony_cithe soft limit reclaim pass is so aggressive that it not just 25848c2ecf20Sopenharmony_ciintroduces high allocation latencies into the system, but also impacts 25858c2ecf20Sopenharmony_cisystem performance due to overreclaim, to the point where the feature 25868c2ecf20Sopenharmony_cibecomes self-defeating. 25878c2ecf20Sopenharmony_ci 25888c2ecf20Sopenharmony_ciThe memory.low boundary on the other hand is a top-down allocated 25898c2ecf20Sopenharmony_cireserve. A cgroup enjoys reclaim protection when it's within its 25908c2ecf20Sopenharmony_cieffective low, which makes delegation of subtrees possible. It also 25918c2ecf20Sopenharmony_cienjoys having reclaim pressure proportional to its overage when 25928c2ecf20Sopenharmony_ciabove its effective low. 25938c2ecf20Sopenharmony_ci 25948c2ecf20Sopenharmony_ciThe original high boundary, the hard limit, is defined as a strict 25958c2ecf20Sopenharmony_cilimit that can not budge, even if the OOM killer has to be called. 25968c2ecf20Sopenharmony_ciBut this generally goes against the goal of making the most out of the 25978c2ecf20Sopenharmony_ciavailable memory. The memory consumption of workloads varies during 25988c2ecf20Sopenharmony_ciruntime, and that requires users to overcommit. But doing that with a 25998c2ecf20Sopenharmony_cistrict upper limit requires either a fairly accurate prediction of the 26008c2ecf20Sopenharmony_ciworking set size or adding slack to the limit. Since working set size 26018c2ecf20Sopenharmony_ciestimation is hard and error prone, and getting it wrong results in 26028c2ecf20Sopenharmony_ciOOM kills, most users tend to err on the side of a looser limit and 26038c2ecf20Sopenharmony_ciend up wasting precious resources. 26048c2ecf20Sopenharmony_ci 26058c2ecf20Sopenharmony_ciThe memory.high boundary on the other hand can be set much more 26068c2ecf20Sopenharmony_ciconservatively. When hit, it throttles allocations by forcing them 26078c2ecf20Sopenharmony_ciinto direct reclaim to work off the excess, but it never invokes the 26088c2ecf20Sopenharmony_ciOOM killer. As a result, a high boundary that is chosen too 26098c2ecf20Sopenharmony_ciaggressively will not terminate the processes, but instead it will 26108c2ecf20Sopenharmony_cilead to gradual performance degradation. The user can monitor this 26118c2ecf20Sopenharmony_ciand make corrections until the minimal memory footprint that still 26128c2ecf20Sopenharmony_cigives acceptable performance is found. 26138c2ecf20Sopenharmony_ci 26148c2ecf20Sopenharmony_ciIn extreme cases, with many concurrent allocations and a complete 26158c2ecf20Sopenharmony_cibreakdown of reclaim progress within the group, the high boundary can 26168c2ecf20Sopenharmony_cibe exceeded. But even then it's mostly better to satisfy the 26178c2ecf20Sopenharmony_ciallocation from the slack available in other groups or the rest of the 26188c2ecf20Sopenharmony_cisystem than killing the group. Otherwise, memory.max is there to 26198c2ecf20Sopenharmony_cilimit this type of spillover and ultimately contain buggy or even 26208c2ecf20Sopenharmony_cimalicious applications. 26218c2ecf20Sopenharmony_ci 26228c2ecf20Sopenharmony_ciSetting the original memory.limit_in_bytes below the current usage was 26238c2ecf20Sopenharmony_cisubject to a race condition, where concurrent charges could cause the 26248c2ecf20Sopenharmony_cilimit setting to fail. memory.max on the other hand will first set the 26258c2ecf20Sopenharmony_cilimit to prevent new charges, and then reclaim and OOM kill until the 26268c2ecf20Sopenharmony_cinew limit is met - or the task writing to memory.max is killed. 26278c2ecf20Sopenharmony_ci 26288c2ecf20Sopenharmony_ciThe combined memory+swap accounting and limiting is replaced by real 26298c2ecf20Sopenharmony_cicontrol over swap space. 26308c2ecf20Sopenharmony_ci 26318c2ecf20Sopenharmony_ciThe main argument for a combined memory+swap facility in the original 26328c2ecf20Sopenharmony_cicgroup design was that global or parental pressure would always be 26338c2ecf20Sopenharmony_ciable to swap all anonymous memory of a child group, regardless of the 26348c2ecf20Sopenharmony_cichild's own (possibly untrusted) configuration. However, untrusted 26358c2ecf20Sopenharmony_cigroups can sabotage swapping by other means - such as referencing its 26368c2ecf20Sopenharmony_cianonymous memory in a tight loop - and an admin can not assume full 26378c2ecf20Sopenharmony_ciswappability when overcommitting untrusted jobs. 26388c2ecf20Sopenharmony_ci 26398c2ecf20Sopenharmony_ciFor trusted jobs, on the other hand, a combined counter is not an 26408c2ecf20Sopenharmony_ciintuitive userspace interface, and it flies in the face of the idea 26418c2ecf20Sopenharmony_cithat cgroup controllers should account and limit specific physical 26428c2ecf20Sopenharmony_ciresources. Swap space is a resource like all others in the system, 26438c2ecf20Sopenharmony_ciand that's why unified hierarchy allows distributing it separately. 2644