162306a36Sopenharmony_ci.. _cgroup-v2:
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci================
462306a36Sopenharmony_ciControl Group v2
562306a36Sopenharmony_ci================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ci:Date: October, 2015
862306a36Sopenharmony_ci:Author: Tejun Heo <tj@kernel.org>
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciThis is the authoritative documentation on the design, interface and
1162306a36Sopenharmony_ciconventions of cgroup v2.  It describes all userland-visible aspects
1262306a36Sopenharmony_ciof cgroup including core and specific controller behaviors.  All
1362306a36Sopenharmony_cifuture changes must be reflected in this document.  Documentation for
1462306a36Sopenharmony_civ1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
1562306a36Sopenharmony_ci
1662306a36Sopenharmony_ci.. CONTENTS
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ci   1. Introduction
1962306a36Sopenharmony_ci     1-1. Terminology
2062306a36Sopenharmony_ci     1-2. What is cgroup?
2162306a36Sopenharmony_ci   2. Basic Operations
2262306a36Sopenharmony_ci     2-1. Mounting
2362306a36Sopenharmony_ci     2-2. Organizing Processes and Threads
2462306a36Sopenharmony_ci       2-2-1. Processes
2562306a36Sopenharmony_ci       2-2-2. Threads
2662306a36Sopenharmony_ci     2-3. [Un]populated Notification
2762306a36Sopenharmony_ci     2-4. Controlling Controllers
2862306a36Sopenharmony_ci       2-4-1. Enabling and Disabling
2962306a36Sopenharmony_ci       2-4-2. Top-down Constraint
3062306a36Sopenharmony_ci       2-4-3. No Internal Process Constraint
3162306a36Sopenharmony_ci     2-5. Delegation
3262306a36Sopenharmony_ci       2-5-1. Model of Delegation
3362306a36Sopenharmony_ci       2-5-2. Delegation Containment
3462306a36Sopenharmony_ci     2-6. Guidelines
3562306a36Sopenharmony_ci       2-6-1. Organize Once and Control
3662306a36Sopenharmony_ci       2-6-2. Avoid Name Collisions
3762306a36Sopenharmony_ci   3. Resource Distribution Models
3862306a36Sopenharmony_ci     3-1. Weights
3962306a36Sopenharmony_ci     3-2. Limits
4062306a36Sopenharmony_ci     3-3. Protections
4162306a36Sopenharmony_ci     3-4. Allocations
4262306a36Sopenharmony_ci   4. Interface Files
4362306a36Sopenharmony_ci     4-1. Format
4462306a36Sopenharmony_ci     4-2. Conventions
4562306a36Sopenharmony_ci     4-3. Core Interface Files
4662306a36Sopenharmony_ci   5. Controllers
4762306a36Sopenharmony_ci     5-1. CPU
4862306a36Sopenharmony_ci       5-1-1. CPU Interface Files
4962306a36Sopenharmony_ci     5-2. Memory
5062306a36Sopenharmony_ci       5-2-1. Memory Interface Files
5162306a36Sopenharmony_ci       5-2-2. Usage Guidelines
5262306a36Sopenharmony_ci       5-2-3. Memory Ownership
5362306a36Sopenharmony_ci     5-3. IO
5462306a36Sopenharmony_ci       5-3-1. IO Interface Files
5562306a36Sopenharmony_ci       5-3-2. Writeback
5662306a36Sopenharmony_ci       5-3-3. IO Latency
5762306a36Sopenharmony_ci         5-3-3-1. How IO Latency Throttling Works
5862306a36Sopenharmony_ci         5-3-3-2. IO Latency Interface Files
5962306a36Sopenharmony_ci       5-3-4. IO Priority
6062306a36Sopenharmony_ci     5-4. PID
6162306a36Sopenharmony_ci       5-4-1. PID Interface Files
6262306a36Sopenharmony_ci     5-5. Cpuset
6362306a36Sopenharmony_ci       5.5-1. Cpuset Interface Files
6462306a36Sopenharmony_ci     5-6. Device
6562306a36Sopenharmony_ci     5-7. RDMA
6662306a36Sopenharmony_ci       5-7-1. RDMA Interface Files
6762306a36Sopenharmony_ci     5-8. HugeTLB
6862306a36Sopenharmony_ci       5.8-1. HugeTLB Interface Files
6962306a36Sopenharmony_ci     5-9. Misc
7062306a36Sopenharmony_ci       5.9-1 Miscellaneous cgroup Interface Files
7162306a36Sopenharmony_ci       5.9-2 Migration and Ownership
7262306a36Sopenharmony_ci     5-10. Others
7362306a36Sopenharmony_ci       5-10-1. perf_event
7462306a36Sopenharmony_ci     5-N. Non-normative information
7562306a36Sopenharmony_ci       5-N-1. CPU controller root cgroup process behaviour
7662306a36Sopenharmony_ci       5-N-2. IO controller root cgroup process behaviour
7762306a36Sopenharmony_ci   6. Namespace
7862306a36Sopenharmony_ci     6-1. Basics
7962306a36Sopenharmony_ci     6-2. The Root and Views
8062306a36Sopenharmony_ci     6-3. Migration and setns(2)
8162306a36Sopenharmony_ci     6-4. Interaction with Other Namespaces
8262306a36Sopenharmony_ci   P. Information on Kernel Programming
8362306a36Sopenharmony_ci     P-1. Filesystem Support for Writeback
8462306a36Sopenharmony_ci   D. Deprecated v1 Core Features
8562306a36Sopenharmony_ci   R. Issues with v1 and Rationales for v2
8662306a36Sopenharmony_ci     R-1. Multiple Hierarchies
8762306a36Sopenharmony_ci     R-2. Thread Granularity
8862306a36Sopenharmony_ci     R-3. Competition Between Inner Nodes and Threads
8962306a36Sopenharmony_ci     R-4. Other Interface Issues
9062306a36Sopenharmony_ci     R-5. Controller Issues and Remedies
9162306a36Sopenharmony_ci       R-5-1. Memory
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ciIntroduction
9562306a36Sopenharmony_ci============
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ciTerminology
9862306a36Sopenharmony_ci-----------
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ci"cgroup" stands for "control group" and is never capitalized.  The
10162306a36Sopenharmony_cisingular form is used to designate the whole feature and also as a
10262306a36Sopenharmony_ciqualifier as in "cgroup controllers".  When explicitly referring to
10362306a36Sopenharmony_cimultiple individual control groups, the plural form "cgroups" is used.
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciWhat is cgroup?
10762306a36Sopenharmony_ci---------------
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_cicgroup is a mechanism to organize processes hierarchically and
11062306a36Sopenharmony_cidistribute system resources along the hierarchy in a controlled and
11162306a36Sopenharmony_ciconfigurable manner.
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_cicgroup is largely composed of two parts - the core and controllers.
11462306a36Sopenharmony_cicgroup core is primarily responsible for hierarchically organizing
11562306a36Sopenharmony_ciprocesses.  A cgroup controller is usually responsible for
11662306a36Sopenharmony_cidistributing a specific type of system resource along the hierarchy
11762306a36Sopenharmony_cialthough there are utility controllers which serve purposes other than
11862306a36Sopenharmony_ciresource distribution.
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_cicgroups form a tree structure and every process in the system belongs
12162306a36Sopenharmony_cito one and only one cgroup.  All threads of a process belong to the
12262306a36Sopenharmony_cisame cgroup.  On creation, all processes are put in the cgroup that
12362306a36Sopenharmony_cithe parent process belongs to at the time.  A process can be migrated
12462306a36Sopenharmony_cito another cgroup.  Migration of a process doesn't affect already
12562306a36Sopenharmony_ciexisting descendant processes.
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ciFollowing certain structural constraints, controllers may be enabled or
12862306a36Sopenharmony_cidisabled selectively on a cgroup.  All controller behaviors are
12962306a36Sopenharmony_cihierarchical - if a controller is enabled on a cgroup, it affects all
13062306a36Sopenharmony_ciprocesses which belong to the cgroups consisting the inclusive
13162306a36Sopenharmony_cisub-hierarchy of the cgroup.  When a controller is enabled on a nested
13262306a36Sopenharmony_cicgroup, it always restricts the resource distribution further.  The
13362306a36Sopenharmony_cirestrictions set closer to the root in the hierarchy can not be
13462306a36Sopenharmony_cioverridden from further away.
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ciBasic Operations
13862306a36Sopenharmony_ci================
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ciMounting
14162306a36Sopenharmony_ci--------
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ciUnlike v1, cgroup v2 has only single hierarchy.  The cgroup v2
14462306a36Sopenharmony_cihierarchy can be mounted with the following mount command::
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ci  # mount -t cgroup2 none $MOUNT_POINT
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_cicgroup2 filesystem has the magic number 0x63677270 ("cgrp").  All
14962306a36Sopenharmony_cicontrollers which support v2 and are not bound to a v1 hierarchy are
15062306a36Sopenharmony_ciautomatically bound to the v2 hierarchy and show up at the root.
15162306a36Sopenharmony_ciControllers which are not in active use in the v2 hierarchy can be
15262306a36Sopenharmony_cibound to other hierarchies.  This allows mixing v2 hierarchy with the
15362306a36Sopenharmony_cilegacy v1 multiple hierarchies in a fully backward compatible way.
15462306a36Sopenharmony_ci
15562306a36Sopenharmony_ciA controller can be moved across hierarchies only after the controller
15662306a36Sopenharmony_ciis no longer referenced in its current hierarchy.  Because per-cgroup
15762306a36Sopenharmony_cicontroller states are destroyed asynchronously and controllers may
15862306a36Sopenharmony_cihave lingering references, a controller may not show up immediately on
15962306a36Sopenharmony_cithe v2 hierarchy after the final umount of the previous hierarchy.
16062306a36Sopenharmony_ciSimilarly, a controller should be fully disabled to be moved out of
16162306a36Sopenharmony_cithe unified hierarchy and it may take some time for the disabled
16262306a36Sopenharmony_cicontroller to become available for other hierarchies; furthermore, due
16362306a36Sopenharmony_cito inter-controller dependencies, other controllers may need to be
16462306a36Sopenharmony_cidisabled too.
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_ciWhile useful for development and manual configurations, moving
16762306a36Sopenharmony_cicontrollers dynamically between the v2 and other hierarchies is
16862306a36Sopenharmony_cistrongly discouraged for production use.  It is recommended to decide
16962306a36Sopenharmony_cithe hierarchies and controller associations before starting using the
17062306a36Sopenharmony_cicontrollers after system boot.
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciDuring transition to v2, system management software might still
17362306a36Sopenharmony_ciautomount the v1 cgroup filesystem and so hijack all controllers
17462306a36Sopenharmony_ciduring boot, before manual intervention is possible. To make testing
17562306a36Sopenharmony_ciand experimenting easier, the kernel parameter cgroup_no_v1= allows
17662306a36Sopenharmony_cidisabling controllers in v1 and make them always available in v2.
17762306a36Sopenharmony_ci
17862306a36Sopenharmony_cicgroup v2 currently supports the following mount options.
17962306a36Sopenharmony_ci
18062306a36Sopenharmony_ci  nsdelegate
18162306a36Sopenharmony_ci	Consider cgroup namespaces as delegation boundaries.  This
18262306a36Sopenharmony_ci	option is system wide and can only be set on mount or modified
18362306a36Sopenharmony_ci	through remount from the init namespace.  The mount option is
18462306a36Sopenharmony_ci	ignored on non-init namespace mounts.  Please refer to the
18562306a36Sopenharmony_ci	Delegation section for details.
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci  favordynmods
18862306a36Sopenharmony_ci        Reduce the latencies of dynamic cgroup modifications such as
18962306a36Sopenharmony_ci        task migrations and controller on/offs at the cost of making
19062306a36Sopenharmony_ci        hot path operations such as forks and exits more expensive.
19162306a36Sopenharmony_ci        The static usage pattern of creating a cgroup, enabling
19262306a36Sopenharmony_ci        controllers, and then seeding it with CLONE_INTO_CGROUP is
19362306a36Sopenharmony_ci        not affected by this option.
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ci  memory_localevents
19662306a36Sopenharmony_ci        Only populate memory.events with data for the current cgroup,
19762306a36Sopenharmony_ci        and not any subtrees. This is legacy behaviour, the default
19862306a36Sopenharmony_ci        behaviour without this option is to include subtree counts.
19962306a36Sopenharmony_ci        This option is system wide and can only be set on mount or
20062306a36Sopenharmony_ci        modified through remount from the init namespace. The mount
20162306a36Sopenharmony_ci        option is ignored on non-init namespace mounts.
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ci  memory_recursiveprot
20462306a36Sopenharmony_ci        Recursively apply memory.min and memory.low protection to
20562306a36Sopenharmony_ci        entire subtrees, without requiring explicit downward
20662306a36Sopenharmony_ci        propagation into leaf cgroups.  This allows protecting entire
20762306a36Sopenharmony_ci        subtrees from one another, while retaining free competition
20862306a36Sopenharmony_ci        within those subtrees.  This should have been the default
20962306a36Sopenharmony_ci        behavior but is a mount-option to avoid regressing setups
21062306a36Sopenharmony_ci        relying on the original semantics (e.g. specifying bogusly
21162306a36Sopenharmony_ci        high 'bypass' protection values at higher tree levels).
21262306a36Sopenharmony_ci
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ciOrganizing Processes and Threads
21562306a36Sopenharmony_ci--------------------------------
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ciProcesses
21862306a36Sopenharmony_ci~~~~~~~~~
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciInitially, only the root cgroup exists to which all processes belong.
22162306a36Sopenharmony_ciA child cgroup can be created by creating a sub-directory::
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ci  # mkdir $CGROUP_NAME
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ciA given cgroup may have multiple child cgroups forming a tree
22662306a36Sopenharmony_cistructure.  Each cgroup has a read-writable interface file
22762306a36Sopenharmony_ci"cgroup.procs".  When read, it lists the PIDs of all processes which
22862306a36Sopenharmony_cibelong to the cgroup one-per-line.  The PIDs are not ordered and the
22962306a36Sopenharmony_cisame PID may show up more than once if the process got moved to
23062306a36Sopenharmony_cianother cgroup and then back or the PID got recycled while reading.
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ciA process can be migrated into a cgroup by writing its PID to the
23362306a36Sopenharmony_citarget cgroup's "cgroup.procs" file.  Only one process can be migrated
23462306a36Sopenharmony_cion a single write(2) call.  If a process is composed of multiple
23562306a36Sopenharmony_cithreads, writing the PID of any thread migrates all threads of the
23662306a36Sopenharmony_ciprocess.
23762306a36Sopenharmony_ci
23862306a36Sopenharmony_ciWhen a process forks a child process, the new process is born into the
23962306a36Sopenharmony_cicgroup that the forking process belongs to at the time of the
24062306a36Sopenharmony_cioperation.  After exit, a process stays associated with the cgroup
24162306a36Sopenharmony_cithat it belonged to at the time of exit until it's reaped; however, a
24262306a36Sopenharmony_cizombie process does not appear in "cgroup.procs" and thus can't be
24362306a36Sopenharmony_cimoved to another cgroup.
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciA cgroup which doesn't have any children or live processes can be
24662306a36Sopenharmony_cidestroyed by removing the directory.  Note that a cgroup which doesn't
24762306a36Sopenharmony_cihave any children and is associated only with zombie processes is
24862306a36Sopenharmony_ciconsidered empty and can be removed::
24962306a36Sopenharmony_ci
25062306a36Sopenharmony_ci  # rmdir $CGROUP_NAME
25162306a36Sopenharmony_ci
25262306a36Sopenharmony_ci"/proc/$PID/cgroup" lists a process's cgroup membership.  If legacy
25362306a36Sopenharmony_cicgroup is in use in the system, this file may contain multiple lines,
25462306a36Sopenharmony_cione for each hierarchy.  The entry for cgroup v2 is always in the
25562306a36Sopenharmony_ciformat "0::$PATH"::
25662306a36Sopenharmony_ci
25762306a36Sopenharmony_ci  # cat /proc/842/cgroup
25862306a36Sopenharmony_ci  ...
25962306a36Sopenharmony_ci  0::/test-cgroup/test-cgroup-nested
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ciIf the process becomes a zombie and the cgroup it was associated with
26262306a36Sopenharmony_ciis removed subsequently, " (deleted)" is appended to the path::
26362306a36Sopenharmony_ci
26462306a36Sopenharmony_ci  # cat /proc/842/cgroup
26562306a36Sopenharmony_ci  ...
26662306a36Sopenharmony_ci  0::/test-cgroup/test-cgroup-nested (deleted)
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciThreads
27062306a36Sopenharmony_ci~~~~~~~
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_cicgroup v2 supports thread granularity for a subset of controllers to
27362306a36Sopenharmony_cisupport use cases requiring hierarchical resource distribution across
27462306a36Sopenharmony_cithe threads of a group of processes.  By default, all threads of a
27562306a36Sopenharmony_ciprocess belong to the same cgroup, which also serves as the resource
27662306a36Sopenharmony_cidomain to host resource consumptions which are not specific to a
27762306a36Sopenharmony_ciprocess or thread.  The thread mode allows threads to be spread across
27862306a36Sopenharmony_cia subtree while still maintaining the common resource domain for them.
27962306a36Sopenharmony_ci
28062306a36Sopenharmony_ciControllers which support thread mode are called threaded controllers.
28162306a36Sopenharmony_ciThe ones which don't are called domain controllers.
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_ciMarking a cgroup threaded makes it join the resource domain of its
28462306a36Sopenharmony_ciparent as a threaded cgroup.  The parent may be another threaded
28562306a36Sopenharmony_cicgroup whose resource domain is further up in the hierarchy.  The root
28662306a36Sopenharmony_ciof a threaded subtree, that is, the nearest ancestor which is not
28762306a36Sopenharmony_cithreaded, is called threaded domain or thread root interchangeably and
28862306a36Sopenharmony_ciserves as the resource domain for the entire subtree.
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciInside a threaded subtree, threads of a process can be put in
29162306a36Sopenharmony_cidifferent cgroups and are not subject to the no internal process
29262306a36Sopenharmony_ciconstraint - threaded controllers can be enabled on non-leaf cgroups
29362306a36Sopenharmony_ciwhether they have threads in them or not.
29462306a36Sopenharmony_ci
29562306a36Sopenharmony_ciAs the threaded domain cgroup hosts all the domain resource
29662306a36Sopenharmony_ciconsumptions of the subtree, it is considered to have internal
29762306a36Sopenharmony_ciresource consumptions whether there are processes in it or not and
29862306a36Sopenharmony_cican't have populated child cgroups which aren't threaded.  Because the
29962306a36Sopenharmony_ciroot cgroup is not subject to no internal process constraint, it can
30062306a36Sopenharmony_ciserve both as a threaded domain and a parent to domain cgroups.
30162306a36Sopenharmony_ci
30262306a36Sopenharmony_ciThe current operation mode or type of the cgroup is shown in the
30362306a36Sopenharmony_ci"cgroup.type" file which indicates whether the cgroup is a normal
30462306a36Sopenharmony_cidomain, a domain which is serving as the domain of a threaded subtree,
30562306a36Sopenharmony_cior a threaded cgroup.
30662306a36Sopenharmony_ci
30762306a36Sopenharmony_ciOn creation, a cgroup is always a domain cgroup and can be made
30862306a36Sopenharmony_cithreaded by writing "threaded" to the "cgroup.type" file.  The
30962306a36Sopenharmony_cioperation is single direction::
31062306a36Sopenharmony_ci
31162306a36Sopenharmony_ci  # echo threaded > cgroup.type
31262306a36Sopenharmony_ci
31362306a36Sopenharmony_ciOnce threaded, the cgroup can't be made a domain again.  To enable the
31462306a36Sopenharmony_cithread mode, the following conditions must be met.
31562306a36Sopenharmony_ci
31662306a36Sopenharmony_ci- As the cgroup will join the parent's resource domain.  The parent
31762306a36Sopenharmony_ci  must either be a valid (threaded) domain or a threaded cgroup.
31862306a36Sopenharmony_ci
31962306a36Sopenharmony_ci- When the parent is an unthreaded domain, it must not have any domain
32062306a36Sopenharmony_ci  controllers enabled or populated domain children.  The root is
32162306a36Sopenharmony_ci  exempt from this requirement.
32262306a36Sopenharmony_ci
32362306a36Sopenharmony_ciTopology-wise, a cgroup can be in an invalid state.  Please consider
32462306a36Sopenharmony_cithe following topology::
32562306a36Sopenharmony_ci
32662306a36Sopenharmony_ci  A (threaded domain) - B (threaded) - C (domain, just created)
32762306a36Sopenharmony_ci
32862306a36Sopenharmony_ciC is created as a domain but isn't connected to a parent which can
32962306a36Sopenharmony_cihost child domains.  C can't be used until it is turned into a
33062306a36Sopenharmony_cithreaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
33162306a36Sopenharmony_cithese cases.  Operations which fail due to invalid topology use
33262306a36Sopenharmony_ciEOPNOTSUPP as the errno.
33362306a36Sopenharmony_ci
33462306a36Sopenharmony_ciA domain cgroup is turned into a threaded domain when one of its child
33562306a36Sopenharmony_cicgroup becomes threaded or threaded controllers are enabled in the
33662306a36Sopenharmony_ci"cgroup.subtree_control" file while there are processes in the cgroup.
33762306a36Sopenharmony_ciA threaded domain reverts to a normal domain when the conditions
33862306a36Sopenharmony_ciclear.
33962306a36Sopenharmony_ci
34062306a36Sopenharmony_ciWhen read, "cgroup.threads" contains the list of the thread IDs of all
34162306a36Sopenharmony_cithreads in the cgroup.  Except that the operations are per-thread
34262306a36Sopenharmony_ciinstead of per-process, "cgroup.threads" has the same format and
34362306a36Sopenharmony_cibehaves the same way as "cgroup.procs".  While "cgroup.threads" can be
34462306a36Sopenharmony_ciwritten to in any cgroup, as it can only move threads inside the same
34562306a36Sopenharmony_cithreaded domain, its operations are confined inside each threaded
34662306a36Sopenharmony_cisubtree.
34762306a36Sopenharmony_ci
34862306a36Sopenharmony_ciThe threaded domain cgroup serves as the resource domain for the whole
34962306a36Sopenharmony_cisubtree, and, while the threads can be scattered across the subtree,
35062306a36Sopenharmony_ciall the processes are considered to be in the threaded domain cgroup.
35162306a36Sopenharmony_ci"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
35262306a36Sopenharmony_ciprocesses in the subtree and is not readable in the subtree proper.
35362306a36Sopenharmony_ciHowever, "cgroup.procs" can be written to from anywhere in the subtree
35462306a36Sopenharmony_cito migrate all threads of the matching process to the cgroup.
35562306a36Sopenharmony_ci
35662306a36Sopenharmony_ciOnly threaded controllers can be enabled in a threaded subtree.  When
35762306a36Sopenharmony_cia threaded controller is enabled inside a threaded subtree, it only
35862306a36Sopenharmony_ciaccounts for and controls resource consumptions associated with the
35962306a36Sopenharmony_cithreads in the cgroup and its descendants.  All consumptions which
36062306a36Sopenharmony_ciaren't tied to a specific thread belong to the threaded domain cgroup.
36162306a36Sopenharmony_ci
36262306a36Sopenharmony_ciBecause a threaded subtree is exempt from no internal process
36362306a36Sopenharmony_ciconstraint, a threaded controller must be able to handle competition
36462306a36Sopenharmony_cibetween threads in a non-leaf cgroup and its child cgroups.  Each
36562306a36Sopenharmony_cithreaded controller defines how such competitions are handled.
36662306a36Sopenharmony_ci
36762306a36Sopenharmony_ci
36862306a36Sopenharmony_ci[Un]populated Notification
36962306a36Sopenharmony_ci--------------------------
37062306a36Sopenharmony_ci
37162306a36Sopenharmony_ciEach non-root cgroup has a "cgroup.events" file which contains
37262306a36Sopenharmony_ci"populated" field indicating whether the cgroup's sub-hierarchy has
37362306a36Sopenharmony_cilive processes in it.  Its value is 0 if there is no live process in
37462306a36Sopenharmony_cithe cgroup and its descendants; otherwise, 1.  poll and [id]notify
37562306a36Sopenharmony_cievents are triggered when the value changes.  This can be used, for
37662306a36Sopenharmony_ciexample, to start a clean-up operation after all processes of a given
37762306a36Sopenharmony_cisub-hierarchy have exited.  The populated state updates and
37862306a36Sopenharmony_cinotifications are recursive.  Consider the following sub-hierarchy
37962306a36Sopenharmony_ciwhere the numbers in the parentheses represent the numbers of processes
38062306a36Sopenharmony_ciin each cgroup::
38162306a36Sopenharmony_ci
38262306a36Sopenharmony_ci  A(4) - B(0) - C(1)
38362306a36Sopenharmony_ci              \ D(0)
38462306a36Sopenharmony_ci
38562306a36Sopenharmony_ciA, B and C's "populated" fields would be 1 while D's 0.  After the one
38662306a36Sopenharmony_ciprocess in C exits, B and C's "populated" fields would flip to "0" and
38762306a36Sopenharmony_cifile modified events will be generated on the "cgroup.events" files of
38862306a36Sopenharmony_ciboth cgroups.
38962306a36Sopenharmony_ci
39062306a36Sopenharmony_ci
39162306a36Sopenharmony_ciControlling Controllers
39262306a36Sopenharmony_ci-----------------------
39362306a36Sopenharmony_ci
39462306a36Sopenharmony_ciEnabling and Disabling
39562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
39662306a36Sopenharmony_ci
39762306a36Sopenharmony_ciEach cgroup has a "cgroup.controllers" file which lists all
39862306a36Sopenharmony_cicontrollers available for the cgroup to enable::
39962306a36Sopenharmony_ci
40062306a36Sopenharmony_ci  # cat cgroup.controllers
40162306a36Sopenharmony_ci  cpu io memory
40262306a36Sopenharmony_ci
40362306a36Sopenharmony_ciNo controller is enabled by default.  Controllers can be enabled and
40462306a36Sopenharmony_cidisabled by writing to the "cgroup.subtree_control" file::
40562306a36Sopenharmony_ci
40662306a36Sopenharmony_ci  # echo "+cpu +memory -io" > cgroup.subtree_control
40762306a36Sopenharmony_ci
40862306a36Sopenharmony_ciOnly controllers which are listed in "cgroup.controllers" can be
40962306a36Sopenharmony_cienabled.  When multiple operations are specified as above, either they
41062306a36Sopenharmony_ciall succeed or fail.  If multiple operations on the same controller
41162306a36Sopenharmony_ciare specified, the last one is effective.
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ciEnabling a controller in a cgroup indicates that the distribution of
41462306a36Sopenharmony_cithe target resource across its immediate children will be controlled.
41562306a36Sopenharmony_ciConsider the following sub-hierarchy.  The enabled controllers are
41662306a36Sopenharmony_cilisted in parentheses::
41762306a36Sopenharmony_ci
41862306a36Sopenharmony_ci  A(cpu,memory) - B(memory) - C()
41962306a36Sopenharmony_ci                            \ D()
42062306a36Sopenharmony_ci
42162306a36Sopenharmony_ciAs A has "cpu" and "memory" enabled, A will control the distribution
42262306a36Sopenharmony_ciof CPU cycles and memory to its children, in this case, B.  As B has
42362306a36Sopenharmony_ci"memory" enabled but not "CPU", C and D will compete freely on CPU
42462306a36Sopenharmony_cicycles but their division of memory available to B will be controlled.
42562306a36Sopenharmony_ci
42662306a36Sopenharmony_ciAs a controller regulates the distribution of the target resource to
42762306a36Sopenharmony_cithe cgroup's children, enabling it creates the controller's interface
42862306a36Sopenharmony_cifiles in the child cgroups.  In the above example, enabling "cpu" on B
42962306a36Sopenharmony_ciwould create the "cpu." prefixed controller interface files in C and
43062306a36Sopenharmony_ciD.  Likewise, disabling "memory" from B would remove the "memory."
43162306a36Sopenharmony_ciprefixed controller interface files from C and D.  This means that the
43262306a36Sopenharmony_cicontroller interface files - anything which doesn't start with
43362306a36Sopenharmony_ci"cgroup." are owned by the parent rather than the cgroup itself.
43462306a36Sopenharmony_ci
43562306a36Sopenharmony_ci
43662306a36Sopenharmony_ciTop-down Constraint
43762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
43862306a36Sopenharmony_ci
43962306a36Sopenharmony_ciResources are distributed top-down and a cgroup can further distribute
44062306a36Sopenharmony_cia resource only if the resource has been distributed to it from the
44162306a36Sopenharmony_ciparent.  This means that all non-root "cgroup.subtree_control" files
44262306a36Sopenharmony_cican only contain controllers which are enabled in the parent's
44362306a36Sopenharmony_ci"cgroup.subtree_control" file.  A controller can be enabled only if
44462306a36Sopenharmony_cithe parent has the controller enabled and a controller can't be
44562306a36Sopenharmony_cidisabled if one or more children have it enabled.
44662306a36Sopenharmony_ci
44762306a36Sopenharmony_ci
44862306a36Sopenharmony_ciNo Internal Process Constraint
44962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45062306a36Sopenharmony_ci
45162306a36Sopenharmony_ciNon-root cgroups can distribute domain resources to their children
45262306a36Sopenharmony_cionly when they don't have any processes of their own.  In other words,
45362306a36Sopenharmony_cionly domain cgroups which don't contain any processes can have domain
45462306a36Sopenharmony_cicontrollers enabled in their "cgroup.subtree_control" files.
45562306a36Sopenharmony_ci
45662306a36Sopenharmony_ciThis guarantees that, when a domain controller is looking at the part
45762306a36Sopenharmony_ciof the hierarchy which has it enabled, processes are always only on
45862306a36Sopenharmony_cithe leaves.  This rules out situations where child cgroups compete
45962306a36Sopenharmony_ciagainst internal processes of the parent.
46062306a36Sopenharmony_ci
46162306a36Sopenharmony_ciThe root cgroup is exempt from this restriction.  Root contains
46262306a36Sopenharmony_ciprocesses and anonymous resource consumption which can't be associated
46362306a36Sopenharmony_ciwith any other cgroups and requires special treatment from most
46462306a36Sopenharmony_cicontrollers.  How resource consumption in the root cgroup is governed
46562306a36Sopenharmony_ciis up to each controller (for more information on this topic please
46662306a36Sopenharmony_cirefer to the Non-normative information section in the Controllers
46762306a36Sopenharmony_cichapter).
46862306a36Sopenharmony_ci
46962306a36Sopenharmony_ciNote that the restriction doesn't get in the way if there is no
47062306a36Sopenharmony_cienabled controller in the cgroup's "cgroup.subtree_control".  This is
47162306a36Sopenharmony_ciimportant as otherwise it wouldn't be possible to create children of a
47262306a36Sopenharmony_cipopulated cgroup.  To control resource distribution of a cgroup, the
47362306a36Sopenharmony_cicgroup must create children and transfer all its processes to the
47462306a36Sopenharmony_cichildren before enabling controllers in its "cgroup.subtree_control"
47562306a36Sopenharmony_cifile.
47662306a36Sopenharmony_ci
47762306a36Sopenharmony_ci
47862306a36Sopenharmony_ciDelegation
47962306a36Sopenharmony_ci----------
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ciModel of Delegation
48262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
48362306a36Sopenharmony_ci
48462306a36Sopenharmony_ciA cgroup can be delegated in two ways.  First, to a less privileged
48562306a36Sopenharmony_ciuser by granting write access of the directory and its "cgroup.procs",
48662306a36Sopenharmony_ci"cgroup.threads" and "cgroup.subtree_control" files to the user.
48762306a36Sopenharmony_ciSecond, if the "nsdelegate" mount option is set, automatically to a
48862306a36Sopenharmony_cicgroup namespace on namespace creation.
48962306a36Sopenharmony_ci
49062306a36Sopenharmony_ciBecause the resource control interface files in a given directory
49162306a36Sopenharmony_cicontrol the distribution of the parent's resources, the delegatee
49262306a36Sopenharmony_cishouldn't be allowed to write to them.  For the first method, this is
49362306a36Sopenharmony_ciachieved by not granting access to these files.  For the second, the
49462306a36Sopenharmony_cikernel rejects writes to all files other than "cgroup.procs" and
49562306a36Sopenharmony_ci"cgroup.subtree_control" on a namespace root from inside the
49662306a36Sopenharmony_cinamespace.
49762306a36Sopenharmony_ci
49862306a36Sopenharmony_ciThe end results are equivalent for both delegation types.  Once
49962306a36Sopenharmony_cidelegated, the user can build sub-hierarchy under the directory,
50062306a36Sopenharmony_ciorganize processes inside it as it sees fit and further distribute the
50162306a36Sopenharmony_ciresources it received from the parent.  The limits and other settings
50262306a36Sopenharmony_ciof all resource controllers are hierarchical and regardless of what
50362306a36Sopenharmony_cihappens in the delegated sub-hierarchy, nothing can escape the
50462306a36Sopenharmony_ciresource restrictions imposed by the parent.
50562306a36Sopenharmony_ci
50662306a36Sopenharmony_ciCurrently, cgroup doesn't impose any restrictions on the number of
50762306a36Sopenharmony_cicgroups in or nesting depth of a delegated sub-hierarchy; however,
50862306a36Sopenharmony_cithis may be limited explicitly in the future.
50962306a36Sopenharmony_ci
51062306a36Sopenharmony_ci
51162306a36Sopenharmony_ciDelegation Containment
51262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
51362306a36Sopenharmony_ci
51462306a36Sopenharmony_ciA delegated sub-hierarchy is contained in the sense that processes
51562306a36Sopenharmony_cican't be moved into or out of the sub-hierarchy by the delegatee.
51662306a36Sopenharmony_ci
51762306a36Sopenharmony_ciFor delegations to a less privileged user, this is achieved by
51862306a36Sopenharmony_cirequiring the following conditions for a process with a non-root euid
51962306a36Sopenharmony_cito migrate a target process into a cgroup by writing its PID to the
52062306a36Sopenharmony_ci"cgroup.procs" file.
52162306a36Sopenharmony_ci
52262306a36Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file.
52362306a36Sopenharmony_ci
52462306a36Sopenharmony_ci- The writer must have write access to the "cgroup.procs" file of the
52562306a36Sopenharmony_ci  common ancestor of the source and destination cgroups.
52662306a36Sopenharmony_ci
52762306a36Sopenharmony_ciThe above two constraints ensure that while a delegatee may migrate
52862306a36Sopenharmony_ciprocesses around freely in the delegated sub-hierarchy it can't pull
52962306a36Sopenharmony_ciin from or push out to outside the sub-hierarchy.
53062306a36Sopenharmony_ci
53162306a36Sopenharmony_ciFor an example, let's assume cgroups C0 and C1 have been delegated to
53262306a36Sopenharmony_ciuser U0 who created C00, C01 under C0 and C10 under C1 as follows and
53362306a36Sopenharmony_ciall processes under C0 and C1 belong to U0::
53462306a36Sopenharmony_ci
53562306a36Sopenharmony_ci  ~~~~~~~~~~~~~ - C0 - C00
53662306a36Sopenharmony_ci  ~ cgroup    ~      \ C01
53762306a36Sopenharmony_ci  ~ hierarchy ~
53862306a36Sopenharmony_ci  ~~~~~~~~~~~~~ - C1 - C10
53962306a36Sopenharmony_ci
54062306a36Sopenharmony_ciLet's also say U0 wants to write the PID of a process which is
54162306a36Sopenharmony_cicurrently in C10 into "C00/cgroup.procs".  U0 has write access to the
54262306a36Sopenharmony_cifile; however, the common ancestor of the source cgroup C10 and the
54362306a36Sopenharmony_cidestination cgroup C00 is above the points of delegation and U0 would
54462306a36Sopenharmony_cinot have write access to its "cgroup.procs" files and thus the write
54562306a36Sopenharmony_ciwill be denied with -EACCES.
54662306a36Sopenharmony_ci
54762306a36Sopenharmony_ciFor delegations to namespaces, containment is achieved by requiring
54862306a36Sopenharmony_cithat both the source and destination cgroups are reachable from the
54962306a36Sopenharmony_cinamespace of the process which is attempting the migration.  If either
55062306a36Sopenharmony_ciis not reachable, the migration is rejected with -ENOENT.
55162306a36Sopenharmony_ci
55262306a36Sopenharmony_ci
55362306a36Sopenharmony_ciGuidelines
55462306a36Sopenharmony_ci----------
55562306a36Sopenharmony_ci
55662306a36Sopenharmony_ciOrganize Once and Control
55762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~
55862306a36Sopenharmony_ci
55962306a36Sopenharmony_ciMigrating a process across cgroups is a relatively expensive operation
56062306a36Sopenharmony_ciand stateful resources such as memory are not moved together with the
56162306a36Sopenharmony_ciprocess.  This is an explicit design decision as there often exist
56262306a36Sopenharmony_ciinherent trade-offs between migration and various hot paths in terms
56362306a36Sopenharmony_ciof synchronization cost.
56462306a36Sopenharmony_ci
56562306a36Sopenharmony_ciAs such, migrating processes across cgroups frequently as a means to
56662306a36Sopenharmony_ciapply different resource restrictions is discouraged.  A workload
56762306a36Sopenharmony_cishould be assigned to a cgroup according to the system's logical and
56862306a36Sopenharmony_ciresource structure once on start-up.  Dynamic adjustments to resource
56962306a36Sopenharmony_cidistribution can be made by changing controller configuration through
57062306a36Sopenharmony_cithe interface files.
57162306a36Sopenharmony_ci
57262306a36Sopenharmony_ci
57362306a36Sopenharmony_ciAvoid Name Collisions
57462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~
57562306a36Sopenharmony_ci
57662306a36Sopenharmony_ciInterface files for a cgroup and its children cgroups occupy the same
57762306a36Sopenharmony_cidirectory and it is possible to create children cgroups which collide
57862306a36Sopenharmony_ciwith interface files.
57962306a36Sopenharmony_ci
58062306a36Sopenharmony_ciAll cgroup core interface files are prefixed with "cgroup." and each
58162306a36Sopenharmony_cicontroller's interface files are prefixed with the controller name and
58262306a36Sopenharmony_cia dot.  A controller's name is composed of lower case alphabets and
58362306a36Sopenharmony_ci'_'s but never begins with an '_' so it can be used as the prefix
58462306a36Sopenharmony_cicharacter for collision avoidance.  Also, interface file names won't
58562306a36Sopenharmony_cistart or end with terms which are often used in categorizing workloads
58662306a36Sopenharmony_cisuch as job, service, slice, unit or workload.
58762306a36Sopenharmony_ci
58862306a36Sopenharmony_cicgroup doesn't do anything to prevent name collisions and it's the
58962306a36Sopenharmony_ciuser's responsibility to avoid them.
59062306a36Sopenharmony_ci
59162306a36Sopenharmony_ci
59262306a36Sopenharmony_ciResource Distribution Models
59362306a36Sopenharmony_ci============================
59462306a36Sopenharmony_ci
59562306a36Sopenharmony_cicgroup controllers implement several resource distribution schemes
59662306a36Sopenharmony_cidepending on the resource type and expected use cases.  This section
59762306a36Sopenharmony_cidescribes major schemes in use along with their expected behaviors.
59862306a36Sopenharmony_ci
59962306a36Sopenharmony_ci
60062306a36Sopenharmony_ciWeights
60162306a36Sopenharmony_ci-------
60262306a36Sopenharmony_ci
60362306a36Sopenharmony_ciA parent's resource is distributed by adding up the weights of all
60462306a36Sopenharmony_ciactive children and giving each the fraction matching the ratio of its
60562306a36Sopenharmony_ciweight against the sum.  As only children which can make use of the
60662306a36Sopenharmony_ciresource at the moment participate in the distribution, this is
60762306a36Sopenharmony_ciwork-conserving.  Due to the dynamic nature, this model is usually
60862306a36Sopenharmony_ciused for stateless resources.
60962306a36Sopenharmony_ci
61062306a36Sopenharmony_ciAll weights are in the range [1, 10000] with the default at 100.  This
61162306a36Sopenharmony_ciallows symmetric multiplicative biases in both directions at fine
61262306a36Sopenharmony_cienough granularity while staying in the intuitive range.
61362306a36Sopenharmony_ci
61462306a36Sopenharmony_ciAs long as the weight is in range, all configuration combinations are
61562306a36Sopenharmony_civalid and there is no reason to reject configuration changes or
61662306a36Sopenharmony_ciprocess migrations.
61762306a36Sopenharmony_ci
61862306a36Sopenharmony_ci"cpu.weight" proportionally distributes CPU cycles to active children
61962306a36Sopenharmony_ciand is an example of this type.
62062306a36Sopenharmony_ci
62162306a36Sopenharmony_ci
62262306a36Sopenharmony_ci.. _cgroupv2-limits-distributor:
62362306a36Sopenharmony_ci
62462306a36Sopenharmony_ciLimits
62562306a36Sopenharmony_ci------
62662306a36Sopenharmony_ci
62762306a36Sopenharmony_ciA child can only consume up to the configured amount of the resource.
62862306a36Sopenharmony_ciLimits can be over-committed - the sum of the limits of children can
62962306a36Sopenharmony_ciexceed the amount of resource available to the parent.
63062306a36Sopenharmony_ci
63162306a36Sopenharmony_ciLimits are in the range [0, max] and defaults to "max", which is noop.
63262306a36Sopenharmony_ci
63362306a36Sopenharmony_ciAs limits can be over-committed, all configuration combinations are
63462306a36Sopenharmony_civalid and there is no reason to reject configuration changes or
63562306a36Sopenharmony_ciprocess migrations.
63662306a36Sopenharmony_ci
63762306a36Sopenharmony_ci"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
63862306a36Sopenharmony_cion an IO device and is an example of this type.
63962306a36Sopenharmony_ci
64062306a36Sopenharmony_ci.. _cgroupv2-protections-distributor:
64162306a36Sopenharmony_ci
64262306a36Sopenharmony_ciProtections
64362306a36Sopenharmony_ci-----------
64462306a36Sopenharmony_ci
64562306a36Sopenharmony_ciA cgroup is protected up to the configured amount of the resource
64662306a36Sopenharmony_cias long as the usages of all its ancestors are under their
64762306a36Sopenharmony_ciprotected levels.  Protections can be hard guarantees or best effort
64862306a36Sopenharmony_cisoft boundaries.  Protections can also be over-committed in which case
64962306a36Sopenharmony_cionly up to the amount available to the parent is protected among
65062306a36Sopenharmony_cichildren.
65162306a36Sopenharmony_ci
65262306a36Sopenharmony_ciProtections are in the range [0, max] and defaults to 0, which is
65362306a36Sopenharmony_cinoop.
65462306a36Sopenharmony_ci
65562306a36Sopenharmony_ciAs protections can be over-committed, all configuration combinations
65662306a36Sopenharmony_ciare valid and there is no reason to reject configuration changes or
65762306a36Sopenharmony_ciprocess migrations.
65862306a36Sopenharmony_ci
65962306a36Sopenharmony_ci"memory.low" implements best-effort memory protection and is an
66062306a36Sopenharmony_ciexample of this type.
66162306a36Sopenharmony_ci
66262306a36Sopenharmony_ci
66362306a36Sopenharmony_ciAllocations
66462306a36Sopenharmony_ci-----------
66562306a36Sopenharmony_ci
66662306a36Sopenharmony_ciA cgroup is exclusively allocated a certain amount of a finite
66762306a36Sopenharmony_ciresource.  Allocations can't be over-committed - the sum of the
66862306a36Sopenharmony_ciallocations of children can not exceed the amount of resource
66962306a36Sopenharmony_ciavailable to the parent.
67062306a36Sopenharmony_ci
67162306a36Sopenharmony_ciAllocations are in the range [0, max] and defaults to 0, which is no
67262306a36Sopenharmony_ciresource.
67362306a36Sopenharmony_ci
67462306a36Sopenharmony_ciAs allocations can't be over-committed, some configuration
67562306a36Sopenharmony_cicombinations are invalid and should be rejected.  Also, if the
67662306a36Sopenharmony_ciresource is mandatory for execution of processes, process migrations
67762306a36Sopenharmony_cimay be rejected.
67862306a36Sopenharmony_ci
67962306a36Sopenharmony_ci"cpu.rt.max" hard-allocates realtime slices and is an example of this
68062306a36Sopenharmony_citype.
68162306a36Sopenharmony_ci
68262306a36Sopenharmony_ci
68362306a36Sopenharmony_ciInterface Files
68462306a36Sopenharmony_ci===============
68562306a36Sopenharmony_ci
68662306a36Sopenharmony_ciFormat
68762306a36Sopenharmony_ci------
68862306a36Sopenharmony_ci
68962306a36Sopenharmony_ciAll interface files should be in one of the following formats whenever
69062306a36Sopenharmony_cipossible::
69162306a36Sopenharmony_ci
69262306a36Sopenharmony_ci  New-line separated values
69362306a36Sopenharmony_ci  (when only one value can be written at once)
69462306a36Sopenharmony_ci
69562306a36Sopenharmony_ci	VAL0\n
69662306a36Sopenharmony_ci	VAL1\n
69762306a36Sopenharmony_ci	...
69862306a36Sopenharmony_ci
69962306a36Sopenharmony_ci  Space separated values
70062306a36Sopenharmony_ci  (when read-only or multiple values can be written at once)
70162306a36Sopenharmony_ci
70262306a36Sopenharmony_ci	VAL0 VAL1 ...\n
70362306a36Sopenharmony_ci
70462306a36Sopenharmony_ci  Flat keyed
70562306a36Sopenharmony_ci
70662306a36Sopenharmony_ci	KEY0 VAL0\n
70762306a36Sopenharmony_ci	KEY1 VAL1\n
70862306a36Sopenharmony_ci	...
70962306a36Sopenharmony_ci
71062306a36Sopenharmony_ci  Nested keyed
71162306a36Sopenharmony_ci
71262306a36Sopenharmony_ci	KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
71362306a36Sopenharmony_ci	KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
71462306a36Sopenharmony_ci	...
71562306a36Sopenharmony_ci
71662306a36Sopenharmony_ciFor a writable file, the format for writing should generally match
71762306a36Sopenharmony_cireading; however, controllers may allow omitting later fields or
71862306a36Sopenharmony_ciimplement restricted shortcuts for most common use cases.
71962306a36Sopenharmony_ci
72062306a36Sopenharmony_ciFor both flat and nested keyed files, only the values for a single key
72162306a36Sopenharmony_cican be written at a time.  For nested keyed files, the sub key pairs
72262306a36Sopenharmony_cimay be specified in any order and not all pairs have to be specified.
72362306a36Sopenharmony_ci
72462306a36Sopenharmony_ci
72562306a36Sopenharmony_ciConventions
72662306a36Sopenharmony_ci-----------
72762306a36Sopenharmony_ci
72862306a36Sopenharmony_ci- Settings for a single feature should be contained in a single file.
72962306a36Sopenharmony_ci
73062306a36Sopenharmony_ci- The root cgroup should be exempt from resource control and thus
73162306a36Sopenharmony_ci  shouldn't have resource control interface files.
73262306a36Sopenharmony_ci
73362306a36Sopenharmony_ci- The default time unit is microseconds.  If a different unit is ever
73462306a36Sopenharmony_ci  used, an explicit unit suffix must be present.
73562306a36Sopenharmony_ci
73662306a36Sopenharmony_ci- A parts-per quantity should use a percentage decimal with at least
73762306a36Sopenharmony_ci  two digit fractional part - e.g. 13.40.
73862306a36Sopenharmony_ci
73962306a36Sopenharmony_ci- If a controller implements weight based resource distribution, its
74062306a36Sopenharmony_ci  interface file should be named "weight" and have the range [1,
74162306a36Sopenharmony_ci  10000] with 100 as the default.  The values are chosen to allow
74262306a36Sopenharmony_ci  enough and symmetric bias in both directions while keeping it
74362306a36Sopenharmony_ci  intuitive (the default is 100%).
74462306a36Sopenharmony_ci
74562306a36Sopenharmony_ci- If a controller implements an absolute resource guarantee and/or
74662306a36Sopenharmony_ci  limit, the interface files should be named "min" and "max"
74762306a36Sopenharmony_ci  respectively.  If a controller implements best effort resource
74862306a36Sopenharmony_ci  guarantee and/or limit, the interface files should be named "low"
74962306a36Sopenharmony_ci  and "high" respectively.
75062306a36Sopenharmony_ci
75162306a36Sopenharmony_ci  In the above four control files, the special token "max" should be
75262306a36Sopenharmony_ci  used to represent upward infinity for both reading and writing.
75362306a36Sopenharmony_ci
75462306a36Sopenharmony_ci- If a setting has a configurable default value and keyed specific
75562306a36Sopenharmony_ci  overrides, the default entry should be keyed with "default" and
75662306a36Sopenharmony_ci  appear as the first entry in the file.
75762306a36Sopenharmony_ci
75862306a36Sopenharmony_ci  The default value can be updated by writing either "default $VAL" or
75962306a36Sopenharmony_ci  "$VAL".
76062306a36Sopenharmony_ci
76162306a36Sopenharmony_ci  When writing to update a specific override, "default" can be used as
76262306a36Sopenharmony_ci  the value to indicate removal of the override.  Override entries
76362306a36Sopenharmony_ci  with "default" as the value must not appear when read.
76462306a36Sopenharmony_ci
76562306a36Sopenharmony_ci  For example, a setting which is keyed by major:minor device numbers
76662306a36Sopenharmony_ci  with integer values may look like the following::
76762306a36Sopenharmony_ci
76862306a36Sopenharmony_ci    # cat cgroup-example-interface-file
76962306a36Sopenharmony_ci    default 150
77062306a36Sopenharmony_ci    8:0 300
77162306a36Sopenharmony_ci
77262306a36Sopenharmony_ci  The default value can be updated by::
77362306a36Sopenharmony_ci
77462306a36Sopenharmony_ci    # echo 125 > cgroup-example-interface-file
77562306a36Sopenharmony_ci
77662306a36Sopenharmony_ci  or::
77762306a36Sopenharmony_ci
77862306a36Sopenharmony_ci    # echo "default 125" > cgroup-example-interface-file
77962306a36Sopenharmony_ci
78062306a36Sopenharmony_ci  An override can be set by::
78162306a36Sopenharmony_ci
78262306a36Sopenharmony_ci    # echo "8:16 170" > cgroup-example-interface-file
78362306a36Sopenharmony_ci
78462306a36Sopenharmony_ci  and cleared by::
78562306a36Sopenharmony_ci
78662306a36Sopenharmony_ci    # echo "8:0 default" > cgroup-example-interface-file
78762306a36Sopenharmony_ci    # cat cgroup-example-interface-file
78862306a36Sopenharmony_ci    default 125
78962306a36Sopenharmony_ci    8:16 170
79062306a36Sopenharmony_ci
79162306a36Sopenharmony_ci- For events which are not very high frequency, an interface file
79262306a36Sopenharmony_ci  "events" should be created which lists event key value pairs.
79362306a36Sopenharmony_ci  Whenever a notifiable event happens, file modified event should be
79462306a36Sopenharmony_ci  generated on the file.
79562306a36Sopenharmony_ci
79662306a36Sopenharmony_ci
79762306a36Sopenharmony_ciCore Interface Files
79862306a36Sopenharmony_ci--------------------
79962306a36Sopenharmony_ci
80062306a36Sopenharmony_ciAll cgroup core files are prefixed with "cgroup."
80162306a36Sopenharmony_ci
80262306a36Sopenharmony_ci  cgroup.type
80362306a36Sopenharmony_ci	A read-write single value file which exists on non-root
80462306a36Sopenharmony_ci	cgroups.
80562306a36Sopenharmony_ci
80662306a36Sopenharmony_ci	When read, it indicates the current type of the cgroup, which
80762306a36Sopenharmony_ci	can be one of the following values.
80862306a36Sopenharmony_ci
80962306a36Sopenharmony_ci	- "domain" : A normal valid domain cgroup.
81062306a36Sopenharmony_ci
81162306a36Sopenharmony_ci	- "domain threaded" : A threaded domain cgroup which is
81262306a36Sopenharmony_ci          serving as the root of a threaded subtree.
81362306a36Sopenharmony_ci
81462306a36Sopenharmony_ci	- "domain invalid" : A cgroup which is in an invalid state.
81562306a36Sopenharmony_ci	  It can't be populated or have controllers enabled.  It may
81662306a36Sopenharmony_ci	  be allowed to become a threaded cgroup.
81762306a36Sopenharmony_ci
81862306a36Sopenharmony_ci	- "threaded" : A threaded cgroup which is a member of a
81962306a36Sopenharmony_ci          threaded subtree.
82062306a36Sopenharmony_ci
82162306a36Sopenharmony_ci	A cgroup can be turned into a threaded cgroup by writing
82262306a36Sopenharmony_ci	"threaded" to this file.
82362306a36Sopenharmony_ci
82462306a36Sopenharmony_ci  cgroup.procs
82562306a36Sopenharmony_ci	A read-write new-line separated values file which exists on
82662306a36Sopenharmony_ci	all cgroups.
82762306a36Sopenharmony_ci
82862306a36Sopenharmony_ci	When read, it lists the PIDs of all processes which belong to
82962306a36Sopenharmony_ci	the cgroup one-per-line.  The PIDs are not ordered and the
83062306a36Sopenharmony_ci	same PID may show up more than once if the process got moved
83162306a36Sopenharmony_ci	to another cgroup and then back or the PID got recycled while
83262306a36Sopenharmony_ci	reading.
83362306a36Sopenharmony_ci
83462306a36Sopenharmony_ci	A PID can be written to migrate the process associated with
83562306a36Sopenharmony_ci	the PID to the cgroup.  The writer should match all of the
83662306a36Sopenharmony_ci	following conditions.
83762306a36Sopenharmony_ci
83862306a36Sopenharmony_ci	- It must have write access to the "cgroup.procs" file.
83962306a36Sopenharmony_ci
84062306a36Sopenharmony_ci	- It must have write access to the "cgroup.procs" file of the
84162306a36Sopenharmony_ci	  common ancestor of the source and destination cgroups.
84262306a36Sopenharmony_ci
84362306a36Sopenharmony_ci	When delegating a sub-hierarchy, write access to this file
84462306a36Sopenharmony_ci	should be granted along with the containing directory.
84562306a36Sopenharmony_ci
84662306a36Sopenharmony_ci	In a threaded cgroup, reading this file fails with EOPNOTSUPP
84762306a36Sopenharmony_ci	as all the processes belong to the thread root.  Writing is
84862306a36Sopenharmony_ci	supported and moves every thread of the process to the cgroup.
84962306a36Sopenharmony_ci
85062306a36Sopenharmony_ci  cgroup.threads
85162306a36Sopenharmony_ci	A read-write new-line separated values file which exists on
85262306a36Sopenharmony_ci	all cgroups.
85362306a36Sopenharmony_ci
85462306a36Sopenharmony_ci	When read, it lists the TIDs of all threads which belong to
85562306a36Sopenharmony_ci	the cgroup one-per-line.  The TIDs are not ordered and the
85662306a36Sopenharmony_ci	same TID may show up more than once if the thread got moved to
85762306a36Sopenharmony_ci	another cgroup and then back or the TID got recycled while
85862306a36Sopenharmony_ci	reading.
85962306a36Sopenharmony_ci
86062306a36Sopenharmony_ci	A TID can be written to migrate the thread associated with the
86162306a36Sopenharmony_ci	TID to the cgroup.  The writer should match all of the
86262306a36Sopenharmony_ci	following conditions.
86362306a36Sopenharmony_ci
86462306a36Sopenharmony_ci	- It must have write access to the "cgroup.threads" file.
86562306a36Sopenharmony_ci
86662306a36Sopenharmony_ci	- The cgroup that the thread is currently in must be in the
86762306a36Sopenharmony_ci          same resource domain as the destination cgroup.
86862306a36Sopenharmony_ci
86962306a36Sopenharmony_ci	- It must have write access to the "cgroup.procs" file of the
87062306a36Sopenharmony_ci	  common ancestor of the source and destination cgroups.
87162306a36Sopenharmony_ci
87262306a36Sopenharmony_ci	When delegating a sub-hierarchy, write access to this file
87362306a36Sopenharmony_ci	should be granted along with the containing directory.
87462306a36Sopenharmony_ci
87562306a36Sopenharmony_ci  cgroup.controllers
87662306a36Sopenharmony_ci	A read-only space separated values file which exists on all
87762306a36Sopenharmony_ci	cgroups.
87862306a36Sopenharmony_ci
87962306a36Sopenharmony_ci	It shows space separated list of all controllers available to
88062306a36Sopenharmony_ci	the cgroup.  The controllers are not ordered.
88162306a36Sopenharmony_ci
88262306a36Sopenharmony_ci  cgroup.subtree_control
88362306a36Sopenharmony_ci	A read-write space separated values file which exists on all
88462306a36Sopenharmony_ci	cgroups.  Starts out empty.
88562306a36Sopenharmony_ci
88662306a36Sopenharmony_ci	When read, it shows space separated list of the controllers
88762306a36Sopenharmony_ci	which are enabled to control resource distribution from the
88862306a36Sopenharmony_ci	cgroup to its children.
88962306a36Sopenharmony_ci
89062306a36Sopenharmony_ci	Space separated list of controllers prefixed with '+' or '-'
89162306a36Sopenharmony_ci	can be written to enable or disable controllers.  A controller
89262306a36Sopenharmony_ci	name prefixed with '+' enables the controller and '-'
89362306a36Sopenharmony_ci	disables.  If a controller appears more than once on the list,
89462306a36Sopenharmony_ci	the last one is effective.  When multiple enable and disable
89562306a36Sopenharmony_ci	operations are specified, either all succeed or all fail.
89662306a36Sopenharmony_ci
89762306a36Sopenharmony_ci  cgroup.events
89862306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups.
89962306a36Sopenharmony_ci	The following entries are defined.  Unless specified
90062306a36Sopenharmony_ci	otherwise, a value change in this file generates a file
90162306a36Sopenharmony_ci	modified event.
90262306a36Sopenharmony_ci
90362306a36Sopenharmony_ci	  populated
90462306a36Sopenharmony_ci		1 if the cgroup or its descendants contains any live
90562306a36Sopenharmony_ci		processes; otherwise, 0.
90662306a36Sopenharmony_ci	  frozen
90762306a36Sopenharmony_ci		1 if the cgroup is frozen; otherwise, 0.
90862306a36Sopenharmony_ci
90962306a36Sopenharmony_ci  cgroup.max.descendants
91062306a36Sopenharmony_ci	A read-write single value files.  The default is "max".
91162306a36Sopenharmony_ci
91262306a36Sopenharmony_ci	Maximum allowed number of descent cgroups.
91362306a36Sopenharmony_ci	If the actual number of descendants is equal or larger,
91462306a36Sopenharmony_ci	an attempt to create a new cgroup in the hierarchy will fail.
91562306a36Sopenharmony_ci
91662306a36Sopenharmony_ci  cgroup.max.depth
91762306a36Sopenharmony_ci	A read-write single value files.  The default is "max".
91862306a36Sopenharmony_ci
91962306a36Sopenharmony_ci	Maximum allowed descent depth below the current cgroup.
92062306a36Sopenharmony_ci	If the actual descent depth is equal or larger,
92162306a36Sopenharmony_ci	an attempt to create a new child cgroup will fail.
92262306a36Sopenharmony_ci
92362306a36Sopenharmony_ci  cgroup.stat
92462306a36Sopenharmony_ci	A read-only flat-keyed file with the following entries:
92562306a36Sopenharmony_ci
92662306a36Sopenharmony_ci	  nr_descendants
92762306a36Sopenharmony_ci		Total number of visible descendant cgroups.
92862306a36Sopenharmony_ci
92962306a36Sopenharmony_ci	  nr_dying_descendants
93062306a36Sopenharmony_ci		Total number of dying descendant cgroups. A cgroup becomes
93162306a36Sopenharmony_ci		dying after being deleted by a user. The cgroup will remain
93262306a36Sopenharmony_ci		in dying state for some time undefined time (which can depend
93362306a36Sopenharmony_ci		on system load) before being completely destroyed.
93462306a36Sopenharmony_ci
93562306a36Sopenharmony_ci		A process can't enter a dying cgroup under any circumstances,
93662306a36Sopenharmony_ci		a dying cgroup can't revive.
93762306a36Sopenharmony_ci
93862306a36Sopenharmony_ci		A dying cgroup can consume system resources not exceeding
93962306a36Sopenharmony_ci		limits, which were active at the moment of cgroup deletion.
94062306a36Sopenharmony_ci
94162306a36Sopenharmony_ci  cgroup.freeze
94262306a36Sopenharmony_ci	A read-write single value file which exists on non-root cgroups.
94362306a36Sopenharmony_ci	Allowed values are "0" and "1". The default is "0".
94462306a36Sopenharmony_ci
94562306a36Sopenharmony_ci	Writing "1" to the file causes freezing of the cgroup and all
94662306a36Sopenharmony_ci	descendant cgroups. This means that all belonging processes will
94762306a36Sopenharmony_ci	be stopped and will not run until the cgroup will be explicitly
94862306a36Sopenharmony_ci	unfrozen. Freezing of the cgroup may take some time; when this action
94962306a36Sopenharmony_ci	is completed, the "frozen" value in the cgroup.events control file
95062306a36Sopenharmony_ci	will be updated to "1" and the corresponding notification will be
95162306a36Sopenharmony_ci	issued.
95262306a36Sopenharmony_ci
95362306a36Sopenharmony_ci	A cgroup can be frozen either by its own settings, or by settings
95462306a36Sopenharmony_ci	of any ancestor cgroups. If any of ancestor cgroups is frozen, the
95562306a36Sopenharmony_ci	cgroup will remain frozen.
95662306a36Sopenharmony_ci
95762306a36Sopenharmony_ci	Processes in the frozen cgroup can be killed by a fatal signal.
95862306a36Sopenharmony_ci	They also can enter and leave a frozen cgroup: either by an explicit
95962306a36Sopenharmony_ci	move by a user, or if freezing of the cgroup races with fork().
96062306a36Sopenharmony_ci	If a process is moved to a frozen cgroup, it stops. If a process is
96162306a36Sopenharmony_ci	moved out of a frozen cgroup, it becomes running.
96262306a36Sopenharmony_ci
96362306a36Sopenharmony_ci	Frozen status of a cgroup doesn't affect any cgroup tree operations:
96462306a36Sopenharmony_ci	it's possible to delete a frozen (and empty) cgroup, as well as
96562306a36Sopenharmony_ci	create new sub-cgroups.
96662306a36Sopenharmony_ci
96762306a36Sopenharmony_ci  cgroup.kill
96862306a36Sopenharmony_ci	A write-only single value file which exists in non-root cgroups.
96962306a36Sopenharmony_ci	The only allowed value is "1".
97062306a36Sopenharmony_ci
97162306a36Sopenharmony_ci	Writing "1" to the file causes the cgroup and all descendant cgroups to
97262306a36Sopenharmony_ci	be killed. This means that all processes located in the affected cgroup
97362306a36Sopenharmony_ci	tree will be killed via SIGKILL.
97462306a36Sopenharmony_ci
97562306a36Sopenharmony_ci	Killing a cgroup tree will deal with concurrent forks appropriately and
97662306a36Sopenharmony_ci	is protected against migrations.
97762306a36Sopenharmony_ci
97862306a36Sopenharmony_ci	In a threaded cgroup, writing this file fails with EOPNOTSUPP as
97962306a36Sopenharmony_ci	killing cgroups is a process directed operation, i.e. it affects
98062306a36Sopenharmony_ci	the whole thread-group.
98162306a36Sopenharmony_ci
98262306a36Sopenharmony_ci  cgroup.pressure
98362306a36Sopenharmony_ci	A read-write single value file that allowed values are "0" and "1".
98462306a36Sopenharmony_ci	The default is "1".
98562306a36Sopenharmony_ci
98662306a36Sopenharmony_ci	Writing "0" to the file will disable the cgroup PSI accounting.
98762306a36Sopenharmony_ci	Writing "1" to the file will re-enable the cgroup PSI accounting.
98862306a36Sopenharmony_ci
98962306a36Sopenharmony_ci	This control attribute is not hierarchical, so disable or enable PSI
99062306a36Sopenharmony_ci	accounting in a cgroup does not affect PSI accounting in descendants
99162306a36Sopenharmony_ci	and doesn't need pass enablement via ancestors from root.
99262306a36Sopenharmony_ci
99362306a36Sopenharmony_ci	The reason this control attribute exists is that PSI accounts stalls for
99462306a36Sopenharmony_ci	each cgroup separately and aggregates it at each level of the hierarchy.
99562306a36Sopenharmony_ci	This may cause non-negligible overhead for some workloads when under
99662306a36Sopenharmony_ci	deep level of the hierarchy, in which case this control attribute can
99762306a36Sopenharmony_ci	be used to disable PSI accounting in the non-leaf cgroups.
99862306a36Sopenharmony_ci
99962306a36Sopenharmony_ci  irq.pressure
100062306a36Sopenharmony_ci	A read-write nested-keyed file.
100162306a36Sopenharmony_ci
100262306a36Sopenharmony_ci	Shows pressure stall information for IRQ/SOFTIRQ. See
100362306a36Sopenharmony_ci	:ref:`Documentation/accounting/psi.rst <psi>` for details.
100462306a36Sopenharmony_ci
100562306a36Sopenharmony_ciControllers
100662306a36Sopenharmony_ci===========
100762306a36Sopenharmony_ci
100862306a36Sopenharmony_ci.. _cgroup-v2-cpu:
100962306a36Sopenharmony_ci
101062306a36Sopenharmony_ciCPU
101162306a36Sopenharmony_ci---
101262306a36Sopenharmony_ci
101362306a36Sopenharmony_ciThe "cpu" controllers regulates distribution of CPU cycles.  This
101462306a36Sopenharmony_cicontroller implements weight and absolute bandwidth limit models for
101562306a36Sopenharmony_cinormal scheduling policy and absolute bandwidth allocation model for
101662306a36Sopenharmony_cirealtime scheduling policy.
101762306a36Sopenharmony_ci
101862306a36Sopenharmony_ciIn all the above models, cycles distribution is defined only on a temporal
101962306a36Sopenharmony_cibase and it does not account for the frequency at which tasks are executed.
102062306a36Sopenharmony_ciThe (optional) utilization clamping support allows to hint the schedutil
102162306a36Sopenharmony_cicpufreq governor about the minimum desired frequency which should always be
102262306a36Sopenharmony_ciprovided by a CPU, as well as the maximum desired frequency, which should not
102362306a36Sopenharmony_cibe exceeded by a CPU.
102462306a36Sopenharmony_ci
102562306a36Sopenharmony_ciWARNING: cgroup2 doesn't yet support control of realtime processes and
102662306a36Sopenharmony_cithe cpu controller can only be enabled when all RT processes are in
102762306a36Sopenharmony_cithe root cgroup.  Be aware that system management software may already
102862306a36Sopenharmony_cihave placed RT processes into nonroot cgroups during the system boot
102962306a36Sopenharmony_ciprocess, and these processes may need to be moved to the root cgroup
103062306a36Sopenharmony_cibefore the cpu controller can be enabled.
103162306a36Sopenharmony_ci
103262306a36Sopenharmony_ci
103362306a36Sopenharmony_ciCPU Interface Files
103462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
103562306a36Sopenharmony_ci
103662306a36Sopenharmony_ciAll time durations are in microseconds.
103762306a36Sopenharmony_ci
103862306a36Sopenharmony_ci  cpu.stat
103962306a36Sopenharmony_ci	A read-only flat-keyed file.
104062306a36Sopenharmony_ci	This file exists whether the controller is enabled or not.
104162306a36Sopenharmony_ci
104262306a36Sopenharmony_ci	It always reports the following three stats:
104362306a36Sopenharmony_ci
104462306a36Sopenharmony_ci	- usage_usec
104562306a36Sopenharmony_ci	- user_usec
104662306a36Sopenharmony_ci	- system_usec
104762306a36Sopenharmony_ci
104862306a36Sopenharmony_ci	and the following five when the controller is enabled:
104962306a36Sopenharmony_ci
105062306a36Sopenharmony_ci	- nr_periods
105162306a36Sopenharmony_ci	- nr_throttled
105262306a36Sopenharmony_ci	- throttled_usec
105362306a36Sopenharmony_ci	- nr_bursts
105462306a36Sopenharmony_ci	- burst_usec
105562306a36Sopenharmony_ci
105662306a36Sopenharmony_ci  cpu.weight
105762306a36Sopenharmony_ci	A read-write single value file which exists on non-root
105862306a36Sopenharmony_ci	cgroups.  The default is "100".
105962306a36Sopenharmony_ci
106062306a36Sopenharmony_ci	The weight in the range [1, 10000].
106162306a36Sopenharmony_ci
106262306a36Sopenharmony_ci  cpu.weight.nice
106362306a36Sopenharmony_ci	A read-write single value file which exists on non-root
106462306a36Sopenharmony_ci	cgroups.  The default is "0".
106562306a36Sopenharmony_ci
106662306a36Sopenharmony_ci	The nice value is in the range [-20, 19].
106762306a36Sopenharmony_ci
106862306a36Sopenharmony_ci	This interface file is an alternative interface for
106962306a36Sopenharmony_ci	"cpu.weight" and allows reading and setting weight using the
107062306a36Sopenharmony_ci	same values used by nice(2).  Because the range is smaller and
107162306a36Sopenharmony_ci	granularity is coarser for the nice values, the read value is
107262306a36Sopenharmony_ci	the closest approximation of the current weight.
107362306a36Sopenharmony_ci
107462306a36Sopenharmony_ci  cpu.max
107562306a36Sopenharmony_ci	A read-write two value file which exists on non-root cgroups.
107662306a36Sopenharmony_ci	The default is "max 100000".
107762306a36Sopenharmony_ci
107862306a36Sopenharmony_ci	The maximum bandwidth limit.  It's in the following format::
107962306a36Sopenharmony_ci
108062306a36Sopenharmony_ci	  $MAX $PERIOD
108162306a36Sopenharmony_ci
108262306a36Sopenharmony_ci	which indicates that the group may consume up to $MAX in each
108362306a36Sopenharmony_ci	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
108462306a36Sopenharmony_ci	one number is written, $MAX is updated.
108562306a36Sopenharmony_ci
108662306a36Sopenharmony_ci  cpu.max.burst
108762306a36Sopenharmony_ci	A read-write single value file which exists on non-root
108862306a36Sopenharmony_ci	cgroups.  The default is "0".
108962306a36Sopenharmony_ci
109062306a36Sopenharmony_ci	The burst in the range [0, $MAX].
109162306a36Sopenharmony_ci
109262306a36Sopenharmony_ci  cpu.pressure
109362306a36Sopenharmony_ci	A read-write nested-keyed file.
109462306a36Sopenharmony_ci
109562306a36Sopenharmony_ci	Shows pressure stall information for CPU. See
109662306a36Sopenharmony_ci	:ref:`Documentation/accounting/psi.rst <psi>` for details.
109762306a36Sopenharmony_ci
109862306a36Sopenharmony_ci  cpu.uclamp.min
109962306a36Sopenharmony_ci        A read-write single value file which exists on non-root cgroups.
110062306a36Sopenharmony_ci        The default is "0", i.e. no utilization boosting.
110162306a36Sopenharmony_ci
110262306a36Sopenharmony_ci        The requested minimum utilization (protection) as a percentage
110362306a36Sopenharmony_ci        rational number, e.g. 12.34 for 12.34%.
110462306a36Sopenharmony_ci
110562306a36Sopenharmony_ci        This interface allows reading and setting minimum utilization clamp
110662306a36Sopenharmony_ci        values similar to the sched_setattr(2). This minimum utilization
110762306a36Sopenharmony_ci        value is used to clamp the task specific minimum utilization clamp.
110862306a36Sopenharmony_ci
110962306a36Sopenharmony_ci        The requested minimum utilization (protection) is always capped by
111062306a36Sopenharmony_ci        the current value for the maximum utilization (limit), i.e.
111162306a36Sopenharmony_ci        `cpu.uclamp.max`.
111262306a36Sopenharmony_ci
111362306a36Sopenharmony_ci  cpu.uclamp.max
111462306a36Sopenharmony_ci        A read-write single value file which exists on non-root cgroups.
111562306a36Sopenharmony_ci        The default is "max". i.e. no utilization capping
111662306a36Sopenharmony_ci
111762306a36Sopenharmony_ci        The requested maximum utilization (limit) as a percentage rational
111862306a36Sopenharmony_ci        number, e.g. 98.76 for 98.76%.
111962306a36Sopenharmony_ci
112062306a36Sopenharmony_ci        This interface allows reading and setting maximum utilization clamp
112162306a36Sopenharmony_ci        values similar to the sched_setattr(2). This maximum utilization
112262306a36Sopenharmony_ci        value is used to clamp the task specific maximum utilization clamp.
112362306a36Sopenharmony_ci
112462306a36Sopenharmony_ci
112562306a36Sopenharmony_ci
112662306a36Sopenharmony_ciMemory
112762306a36Sopenharmony_ci------
112862306a36Sopenharmony_ci
112962306a36Sopenharmony_ciThe "memory" controller regulates distribution of memory.  Memory is
113062306a36Sopenharmony_cistateful and implements both limit and protection models.  Due to the
113162306a36Sopenharmony_ciintertwining between memory usage and reclaim pressure and the
113262306a36Sopenharmony_cistateful nature of memory, the distribution model is relatively
113362306a36Sopenharmony_cicomplex.
113462306a36Sopenharmony_ci
113562306a36Sopenharmony_ciWhile not completely water-tight, all major memory usages by a given
113662306a36Sopenharmony_cicgroup are tracked so that the total memory consumption can be
113762306a36Sopenharmony_ciaccounted and controlled to a reasonable extent.  Currently, the
113862306a36Sopenharmony_cifollowing types of memory usages are tracked.
113962306a36Sopenharmony_ci
114062306a36Sopenharmony_ci- Userland memory - page cache and anonymous memory.
114162306a36Sopenharmony_ci
114262306a36Sopenharmony_ci- Kernel data structures such as dentries and inodes.
114362306a36Sopenharmony_ci
114462306a36Sopenharmony_ci- TCP socket buffers.
114562306a36Sopenharmony_ci
114662306a36Sopenharmony_ciThe above list may expand in the future for better coverage.
114762306a36Sopenharmony_ci
114862306a36Sopenharmony_ci
114962306a36Sopenharmony_ciMemory Interface Files
115062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
115162306a36Sopenharmony_ci
115262306a36Sopenharmony_ciAll memory amounts are in bytes.  If a value which is not aligned to
115362306a36Sopenharmony_ciPAGE_SIZE is written, the value may be rounded up to the closest
115462306a36Sopenharmony_ciPAGE_SIZE multiple when read back.
115562306a36Sopenharmony_ci
115662306a36Sopenharmony_ci  memory.current
115762306a36Sopenharmony_ci	A read-only single value file which exists on non-root
115862306a36Sopenharmony_ci	cgroups.
115962306a36Sopenharmony_ci
116062306a36Sopenharmony_ci	The total amount of memory currently being used by the cgroup
116162306a36Sopenharmony_ci	and its descendants.
116262306a36Sopenharmony_ci
116362306a36Sopenharmony_ci  memory.min
116462306a36Sopenharmony_ci	A read-write single value file which exists on non-root
116562306a36Sopenharmony_ci	cgroups.  The default is "0".
116662306a36Sopenharmony_ci
116762306a36Sopenharmony_ci	Hard memory protection.  If the memory usage of a cgroup
116862306a36Sopenharmony_ci	is within its effective min boundary, the cgroup's memory
116962306a36Sopenharmony_ci	won't be reclaimed under any conditions. If there is no
117062306a36Sopenharmony_ci	unprotected reclaimable memory available, OOM killer
117162306a36Sopenharmony_ci	is invoked. Above the effective min boundary (or
117262306a36Sopenharmony_ci	effective low boundary if it is higher), pages are reclaimed
117362306a36Sopenharmony_ci	proportionally to the overage, reducing reclaim pressure for
117462306a36Sopenharmony_ci	smaller overages.
117562306a36Sopenharmony_ci
117662306a36Sopenharmony_ci	Effective min boundary is limited by memory.min values of
117762306a36Sopenharmony_ci	all ancestor cgroups. If there is memory.min overcommitment
117862306a36Sopenharmony_ci	(child cgroup or cgroups are requiring more protected memory
117962306a36Sopenharmony_ci	than parent will allow), then each child cgroup will get
118062306a36Sopenharmony_ci	the part of parent's protection proportional to its
118162306a36Sopenharmony_ci	actual memory usage below memory.min.
118262306a36Sopenharmony_ci
118362306a36Sopenharmony_ci	Putting more memory than generally available under this
118462306a36Sopenharmony_ci	protection is discouraged and may lead to constant OOMs.
118562306a36Sopenharmony_ci
118662306a36Sopenharmony_ci	If a memory cgroup is not populated with processes,
118762306a36Sopenharmony_ci	its memory.min is ignored.
118862306a36Sopenharmony_ci
118962306a36Sopenharmony_ci  memory.low
119062306a36Sopenharmony_ci	A read-write single value file which exists on non-root
119162306a36Sopenharmony_ci	cgroups.  The default is "0".
119262306a36Sopenharmony_ci
119362306a36Sopenharmony_ci	Best-effort memory protection.  If the memory usage of a
119462306a36Sopenharmony_ci	cgroup is within its effective low boundary, the cgroup's
119562306a36Sopenharmony_ci	memory won't be reclaimed unless there is no reclaimable
119662306a36Sopenharmony_ci	memory available in unprotected cgroups.
119762306a36Sopenharmony_ci	Above the effective low	boundary (or 
119862306a36Sopenharmony_ci	effective min boundary if it is higher), pages are reclaimed
119962306a36Sopenharmony_ci	proportionally to the overage, reducing reclaim pressure for
120062306a36Sopenharmony_ci	smaller overages.
120162306a36Sopenharmony_ci
120262306a36Sopenharmony_ci	Effective low boundary is limited by memory.low values of
120362306a36Sopenharmony_ci	all ancestor cgroups. If there is memory.low overcommitment
120462306a36Sopenharmony_ci	(child cgroup or cgroups are requiring more protected memory
120562306a36Sopenharmony_ci	than parent will allow), then each child cgroup will get
120662306a36Sopenharmony_ci	the part of parent's protection proportional to its
120762306a36Sopenharmony_ci	actual memory usage below memory.low.
120862306a36Sopenharmony_ci
120962306a36Sopenharmony_ci	Putting more memory than generally available under this
121062306a36Sopenharmony_ci	protection is discouraged.
121162306a36Sopenharmony_ci
121262306a36Sopenharmony_ci  memory.high
121362306a36Sopenharmony_ci	A read-write single value file which exists on non-root
121462306a36Sopenharmony_ci	cgroups.  The default is "max".
121562306a36Sopenharmony_ci
121662306a36Sopenharmony_ci	Memory usage throttle limit.  If a cgroup's usage goes
121762306a36Sopenharmony_ci	over the high boundary, the processes of the cgroup are
121862306a36Sopenharmony_ci	throttled and put under heavy reclaim pressure.
121962306a36Sopenharmony_ci
122062306a36Sopenharmony_ci	Going over the high limit never invokes the OOM killer and
122162306a36Sopenharmony_ci	under extreme conditions the limit may be breached. The high
122262306a36Sopenharmony_ci	limit should be used in scenarios where an external process
122362306a36Sopenharmony_ci	monitors the limited cgroup to alleviate heavy reclaim
122462306a36Sopenharmony_ci	pressure.
122562306a36Sopenharmony_ci
122662306a36Sopenharmony_ci  memory.max
122762306a36Sopenharmony_ci	A read-write single value file which exists on non-root
122862306a36Sopenharmony_ci	cgroups.  The default is "max".
122962306a36Sopenharmony_ci
123062306a36Sopenharmony_ci	Memory usage hard limit.  This is the main mechanism to limit
123162306a36Sopenharmony_ci	memory usage of a cgroup.  If a cgroup's memory usage reaches
123262306a36Sopenharmony_ci	this limit and can't be reduced, the OOM killer is invoked in
123362306a36Sopenharmony_ci	the cgroup. Under certain circumstances, the usage may go
123462306a36Sopenharmony_ci	over the limit temporarily.
123562306a36Sopenharmony_ci
123662306a36Sopenharmony_ci	In default configuration regular 0-order allocations always
123762306a36Sopenharmony_ci	succeed unless OOM killer chooses current task as a victim.
123862306a36Sopenharmony_ci
123962306a36Sopenharmony_ci	Some kinds of allocations don't invoke the OOM killer.
124062306a36Sopenharmony_ci	Caller could retry them differently, return into userspace
124162306a36Sopenharmony_ci	as -ENOMEM or silently ignore in cases like disk readahead.
124262306a36Sopenharmony_ci
124362306a36Sopenharmony_ci  memory.reclaim
124462306a36Sopenharmony_ci	A write-only nested-keyed file which exists for all cgroups.
124562306a36Sopenharmony_ci
124662306a36Sopenharmony_ci	This is a simple interface to trigger memory reclaim in the
124762306a36Sopenharmony_ci	target cgroup.
124862306a36Sopenharmony_ci
124962306a36Sopenharmony_ci	This file accepts a single key, the number of bytes to reclaim.
125062306a36Sopenharmony_ci	No nested keys are currently supported.
125162306a36Sopenharmony_ci
125262306a36Sopenharmony_ci	Example::
125362306a36Sopenharmony_ci
125462306a36Sopenharmony_ci	  echo "1G" > memory.reclaim
125562306a36Sopenharmony_ci
125662306a36Sopenharmony_ci	The interface can be later extended with nested keys to
125762306a36Sopenharmony_ci	configure the reclaim behavior. For example, specify the
125862306a36Sopenharmony_ci	type of memory to reclaim from (anon, file, ..).
125962306a36Sopenharmony_ci
126062306a36Sopenharmony_ci	Please note that the kernel can over or under reclaim from
126162306a36Sopenharmony_ci	the target cgroup. If less bytes are reclaimed than the
126262306a36Sopenharmony_ci	specified amount, -EAGAIN is returned.
126362306a36Sopenharmony_ci
126462306a36Sopenharmony_ci	Please note that the proactive reclaim (triggered by this
126562306a36Sopenharmony_ci	interface) is not meant to indicate memory pressure on the
126662306a36Sopenharmony_ci	memory cgroup. Therefore socket memory balancing triggered by
126762306a36Sopenharmony_ci	the memory reclaim normally is not exercised in this case.
126862306a36Sopenharmony_ci	This means that the networking layer will not adapt based on
126962306a36Sopenharmony_ci	reclaim induced by memory.reclaim.
127062306a36Sopenharmony_ci
127162306a36Sopenharmony_ci  memory.peak
127262306a36Sopenharmony_ci	A read-only single value file which exists on non-root
127362306a36Sopenharmony_ci	cgroups.
127462306a36Sopenharmony_ci
127562306a36Sopenharmony_ci	The max memory usage recorded for the cgroup and its
127662306a36Sopenharmony_ci	descendants since the creation of the cgroup.
127762306a36Sopenharmony_ci
127862306a36Sopenharmony_ci  memory.oom.group
127962306a36Sopenharmony_ci	A read-write single value file which exists on non-root
128062306a36Sopenharmony_ci	cgroups.  The default value is "0".
128162306a36Sopenharmony_ci
128262306a36Sopenharmony_ci	Determines whether the cgroup should be treated as
128362306a36Sopenharmony_ci	an indivisible workload by the OOM killer. If set,
128462306a36Sopenharmony_ci	all tasks belonging to the cgroup or to its descendants
128562306a36Sopenharmony_ci	(if the memory cgroup is not a leaf cgroup) are killed
128662306a36Sopenharmony_ci	together or not at all. This can be used to avoid
128762306a36Sopenharmony_ci	partial kills to guarantee workload integrity.
128862306a36Sopenharmony_ci
128962306a36Sopenharmony_ci	Tasks with the OOM protection (oom_score_adj set to -1000)
129062306a36Sopenharmony_ci	are treated as an exception and are never killed.
129162306a36Sopenharmony_ci
129262306a36Sopenharmony_ci	If the OOM killer is invoked in a cgroup, it's not going
129362306a36Sopenharmony_ci	to kill any tasks outside of this cgroup, regardless
129462306a36Sopenharmony_ci	memory.oom.group values of ancestor cgroups.
129562306a36Sopenharmony_ci
129662306a36Sopenharmony_ci  memory.events
129762306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups.
129862306a36Sopenharmony_ci	The following entries are defined.  Unless specified
129962306a36Sopenharmony_ci	otherwise, a value change in this file generates a file
130062306a36Sopenharmony_ci	modified event.
130162306a36Sopenharmony_ci
130262306a36Sopenharmony_ci	Note that all fields in this file are hierarchical and the
130362306a36Sopenharmony_ci	file modified event can be generated due to an event down the
130462306a36Sopenharmony_ci	hierarchy. For the local events at the cgroup level see
130562306a36Sopenharmony_ci	memory.events.local.
130662306a36Sopenharmony_ci
130762306a36Sopenharmony_ci	  low
130862306a36Sopenharmony_ci		The number of times the cgroup is reclaimed due to
130962306a36Sopenharmony_ci		high memory pressure even though its usage is under
131062306a36Sopenharmony_ci		the low boundary.  This usually indicates that the low
131162306a36Sopenharmony_ci		boundary is over-committed.
131262306a36Sopenharmony_ci
131362306a36Sopenharmony_ci	  high
131462306a36Sopenharmony_ci		The number of times processes of the cgroup are
131562306a36Sopenharmony_ci		throttled and routed to perform direct memory reclaim
131662306a36Sopenharmony_ci		because the high memory boundary was exceeded.  For a
131762306a36Sopenharmony_ci		cgroup whose memory usage is capped by the high limit
131862306a36Sopenharmony_ci		rather than global memory pressure, this event's
131962306a36Sopenharmony_ci		occurrences are expected.
132062306a36Sopenharmony_ci
132162306a36Sopenharmony_ci	  max
132262306a36Sopenharmony_ci		The number of times the cgroup's memory usage was
132362306a36Sopenharmony_ci		about to go over the max boundary.  If direct reclaim
132462306a36Sopenharmony_ci		fails to bring it down, the cgroup goes to OOM state.
132562306a36Sopenharmony_ci
132662306a36Sopenharmony_ci	  oom
132762306a36Sopenharmony_ci		The number of time the cgroup's memory usage was
132862306a36Sopenharmony_ci		reached the limit and allocation was about to fail.
132962306a36Sopenharmony_ci
133062306a36Sopenharmony_ci		This event is not raised if the OOM killer is not
133162306a36Sopenharmony_ci		considered as an option, e.g. for failed high-order
133262306a36Sopenharmony_ci		allocations or if caller asked to not retry attempts.
133362306a36Sopenharmony_ci
133462306a36Sopenharmony_ci	  oom_kill
133562306a36Sopenharmony_ci		The number of processes belonging to this cgroup
133662306a36Sopenharmony_ci		killed by any kind of OOM killer.
133762306a36Sopenharmony_ci
133862306a36Sopenharmony_ci          oom_group_kill
133962306a36Sopenharmony_ci                The number of times a group OOM has occurred.
134062306a36Sopenharmony_ci
134162306a36Sopenharmony_ci  memory.events.local
134262306a36Sopenharmony_ci	Similar to memory.events but the fields in the file are local
134362306a36Sopenharmony_ci	to the cgroup i.e. not hierarchical. The file modified event
134462306a36Sopenharmony_ci	generated on this file reflects only the local events.
134562306a36Sopenharmony_ci
134662306a36Sopenharmony_ci  memory.stat
134762306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups.
134862306a36Sopenharmony_ci
134962306a36Sopenharmony_ci	This breaks down the cgroup's memory footprint into different
135062306a36Sopenharmony_ci	types of memory, type-specific details, and other information
135162306a36Sopenharmony_ci	on the state and past events of the memory management system.
135262306a36Sopenharmony_ci
135362306a36Sopenharmony_ci	All memory amounts are in bytes.
135462306a36Sopenharmony_ci
135562306a36Sopenharmony_ci	The entries are ordered to be human readable, and new entries
135662306a36Sopenharmony_ci	can show up in the middle. Don't rely on items remaining in a
135762306a36Sopenharmony_ci	fixed position; use the keys to look up specific values!
135862306a36Sopenharmony_ci
135962306a36Sopenharmony_ci	If the entry has no per-node counter (or not show in the
136062306a36Sopenharmony_ci	memory.numa_stat). We use 'npn' (non-per-node) as the tag
136162306a36Sopenharmony_ci	to indicate that it will not show in the memory.numa_stat.
136262306a36Sopenharmony_ci
136362306a36Sopenharmony_ci	  anon
136462306a36Sopenharmony_ci		Amount of memory used in anonymous mappings such as
136562306a36Sopenharmony_ci		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
136662306a36Sopenharmony_ci
136762306a36Sopenharmony_ci	  file
136862306a36Sopenharmony_ci		Amount of memory used to cache filesystem data,
136962306a36Sopenharmony_ci		including tmpfs and shared memory.
137062306a36Sopenharmony_ci
137162306a36Sopenharmony_ci	  kernel (npn)
137262306a36Sopenharmony_ci		Amount of total kernel memory, including
137362306a36Sopenharmony_ci		(kernel_stack, pagetables, percpu, vmalloc, slab) in
137462306a36Sopenharmony_ci		addition to other kernel memory use cases.
137562306a36Sopenharmony_ci
137662306a36Sopenharmony_ci	  kernel_stack
137762306a36Sopenharmony_ci		Amount of memory allocated to kernel stacks.
137862306a36Sopenharmony_ci
137962306a36Sopenharmony_ci	  pagetables
138062306a36Sopenharmony_ci                Amount of memory allocated for page tables.
138162306a36Sopenharmony_ci
138262306a36Sopenharmony_ci	  sec_pagetables
138362306a36Sopenharmony_ci		Amount of memory allocated for secondary page tables,
138462306a36Sopenharmony_ci		this currently includes KVM mmu allocations on x86
138562306a36Sopenharmony_ci		and arm64.
138662306a36Sopenharmony_ci
138762306a36Sopenharmony_ci	  percpu (npn)
138862306a36Sopenharmony_ci		Amount of memory used for storing per-cpu kernel
138962306a36Sopenharmony_ci		data structures.
139062306a36Sopenharmony_ci
139162306a36Sopenharmony_ci	  sock (npn)
139262306a36Sopenharmony_ci		Amount of memory used in network transmission buffers
139362306a36Sopenharmony_ci
139462306a36Sopenharmony_ci	  vmalloc (npn)
139562306a36Sopenharmony_ci		Amount of memory used for vmap backed memory.
139662306a36Sopenharmony_ci
139762306a36Sopenharmony_ci	  shmem
139862306a36Sopenharmony_ci		Amount of cached filesystem data that is swap-backed,
139962306a36Sopenharmony_ci		such as tmpfs, shm segments, shared anonymous mmap()s
140062306a36Sopenharmony_ci
140162306a36Sopenharmony_ci	  zswap
140262306a36Sopenharmony_ci		Amount of memory consumed by the zswap compression backend.
140362306a36Sopenharmony_ci
140462306a36Sopenharmony_ci	  zswapped
140562306a36Sopenharmony_ci		Amount of application memory swapped out to zswap.
140662306a36Sopenharmony_ci
140762306a36Sopenharmony_ci	  file_mapped
140862306a36Sopenharmony_ci		Amount of cached filesystem data mapped with mmap()
140962306a36Sopenharmony_ci
141062306a36Sopenharmony_ci	  file_dirty
141162306a36Sopenharmony_ci		Amount of cached filesystem data that was modified but
141262306a36Sopenharmony_ci		not yet written back to disk
141362306a36Sopenharmony_ci
141462306a36Sopenharmony_ci	  file_writeback
141562306a36Sopenharmony_ci		Amount of cached filesystem data that was modified and
141662306a36Sopenharmony_ci		is currently being written back to disk
141762306a36Sopenharmony_ci
141862306a36Sopenharmony_ci	  swapcached
141962306a36Sopenharmony_ci		Amount of swap cached in memory. The swapcache is accounted
142062306a36Sopenharmony_ci		against both memory and swap usage.
142162306a36Sopenharmony_ci
142262306a36Sopenharmony_ci	  anon_thp
142362306a36Sopenharmony_ci		Amount of memory used in anonymous mappings backed by
142462306a36Sopenharmony_ci		transparent hugepages
142562306a36Sopenharmony_ci
142662306a36Sopenharmony_ci	  file_thp
142762306a36Sopenharmony_ci		Amount of cached filesystem data backed by transparent
142862306a36Sopenharmony_ci		hugepages
142962306a36Sopenharmony_ci
143062306a36Sopenharmony_ci	  shmem_thp
143162306a36Sopenharmony_ci		Amount of shm, tmpfs, shared anonymous mmap()s backed by
143262306a36Sopenharmony_ci		transparent hugepages
143362306a36Sopenharmony_ci
143462306a36Sopenharmony_ci	  inactive_anon, active_anon, inactive_file, active_file, unevictable
143562306a36Sopenharmony_ci		Amount of memory, swap-backed and filesystem-backed,
143662306a36Sopenharmony_ci		on the internal memory management lists used by the
143762306a36Sopenharmony_ci		page reclaim algorithm.
143862306a36Sopenharmony_ci
143962306a36Sopenharmony_ci		As these represent internal list state (eg. shmem pages are on anon
144062306a36Sopenharmony_ci		memory management lists), inactive_foo + active_foo may not be equal to
144162306a36Sopenharmony_ci		the value for the foo counter, since the foo counter is type-based, not
144262306a36Sopenharmony_ci		list-based.
144362306a36Sopenharmony_ci
144462306a36Sopenharmony_ci	  slab_reclaimable
144562306a36Sopenharmony_ci		Part of "slab" that might be reclaimed, such as
144662306a36Sopenharmony_ci		dentries and inodes.
144762306a36Sopenharmony_ci
144862306a36Sopenharmony_ci	  slab_unreclaimable
144962306a36Sopenharmony_ci		Part of "slab" that cannot be reclaimed on memory
145062306a36Sopenharmony_ci		pressure.
145162306a36Sopenharmony_ci
145262306a36Sopenharmony_ci	  slab (npn)
145362306a36Sopenharmony_ci		Amount of memory used for storing in-kernel data
145462306a36Sopenharmony_ci		structures.
145562306a36Sopenharmony_ci
145662306a36Sopenharmony_ci	  workingset_refault_anon
145762306a36Sopenharmony_ci		Number of refaults of previously evicted anonymous pages.
145862306a36Sopenharmony_ci
145962306a36Sopenharmony_ci	  workingset_refault_file
146062306a36Sopenharmony_ci		Number of refaults of previously evicted file pages.
146162306a36Sopenharmony_ci
146262306a36Sopenharmony_ci	  workingset_activate_anon
146362306a36Sopenharmony_ci		Number of refaulted anonymous pages that were immediately
146462306a36Sopenharmony_ci		activated.
146562306a36Sopenharmony_ci
146662306a36Sopenharmony_ci	  workingset_activate_file
146762306a36Sopenharmony_ci		Number of refaulted file pages that were immediately activated.
146862306a36Sopenharmony_ci
146962306a36Sopenharmony_ci	  workingset_restore_anon
147062306a36Sopenharmony_ci		Number of restored anonymous pages which have been detected as
147162306a36Sopenharmony_ci		an active workingset before they got reclaimed.
147262306a36Sopenharmony_ci
147362306a36Sopenharmony_ci	  workingset_restore_file
147462306a36Sopenharmony_ci		Number of restored file pages which have been detected as an
147562306a36Sopenharmony_ci		active workingset before they got reclaimed.
147662306a36Sopenharmony_ci
147762306a36Sopenharmony_ci	  workingset_nodereclaim
147862306a36Sopenharmony_ci		Number of times a shadow node has been reclaimed
147962306a36Sopenharmony_ci
148062306a36Sopenharmony_ci	  pgscan (npn)
148162306a36Sopenharmony_ci		Amount of scanned pages (in an inactive LRU list)
148262306a36Sopenharmony_ci
148362306a36Sopenharmony_ci	  pgsteal (npn)
148462306a36Sopenharmony_ci		Amount of reclaimed pages
148562306a36Sopenharmony_ci
148662306a36Sopenharmony_ci	  pgscan_kswapd (npn)
148762306a36Sopenharmony_ci		Amount of scanned pages by kswapd (in an inactive LRU list)
148862306a36Sopenharmony_ci
148962306a36Sopenharmony_ci	  pgscan_direct (npn)
149062306a36Sopenharmony_ci		Amount of scanned pages directly  (in an inactive LRU list)
149162306a36Sopenharmony_ci
149262306a36Sopenharmony_ci	  pgscan_khugepaged (npn)
149362306a36Sopenharmony_ci		Amount of scanned pages by khugepaged  (in an inactive LRU list)
149462306a36Sopenharmony_ci
149562306a36Sopenharmony_ci	  pgsteal_kswapd (npn)
149662306a36Sopenharmony_ci		Amount of reclaimed pages by kswapd
149762306a36Sopenharmony_ci
149862306a36Sopenharmony_ci	  pgsteal_direct (npn)
149962306a36Sopenharmony_ci		Amount of reclaimed pages directly
150062306a36Sopenharmony_ci
150162306a36Sopenharmony_ci	  pgsteal_khugepaged (npn)
150262306a36Sopenharmony_ci		Amount of reclaimed pages by khugepaged
150362306a36Sopenharmony_ci
150462306a36Sopenharmony_ci	  pgfault (npn)
150562306a36Sopenharmony_ci		Total number of page faults incurred
150662306a36Sopenharmony_ci
150762306a36Sopenharmony_ci	  pgmajfault (npn)
150862306a36Sopenharmony_ci		Number of major page faults incurred
150962306a36Sopenharmony_ci
151062306a36Sopenharmony_ci	  pgrefill (npn)
151162306a36Sopenharmony_ci		Amount of scanned pages (in an active LRU list)
151262306a36Sopenharmony_ci
151362306a36Sopenharmony_ci	  pgactivate (npn)
151462306a36Sopenharmony_ci		Amount of pages moved to the active LRU list
151562306a36Sopenharmony_ci
151662306a36Sopenharmony_ci	  pgdeactivate (npn)
151762306a36Sopenharmony_ci		Amount of pages moved to the inactive LRU list
151862306a36Sopenharmony_ci
151962306a36Sopenharmony_ci	  pglazyfree (npn)
152062306a36Sopenharmony_ci		Amount of pages postponed to be freed under memory pressure
152162306a36Sopenharmony_ci
152262306a36Sopenharmony_ci	  pglazyfreed (npn)
152362306a36Sopenharmony_ci		Amount of reclaimed lazyfree pages
152462306a36Sopenharmony_ci
152562306a36Sopenharmony_ci	  thp_fault_alloc (npn)
152662306a36Sopenharmony_ci		Number of transparent hugepages which were allocated to satisfy
152762306a36Sopenharmony_ci		a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
152862306a36Sopenharmony_ci                is not set.
152962306a36Sopenharmony_ci
153062306a36Sopenharmony_ci	  thp_collapse_alloc (npn)
153162306a36Sopenharmony_ci		Number of transparent hugepages which were allocated to allow
153262306a36Sopenharmony_ci		collapsing an existing range of pages. This counter is not
153362306a36Sopenharmony_ci		present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
153462306a36Sopenharmony_ci
153562306a36Sopenharmony_ci  memory.numa_stat
153662306a36Sopenharmony_ci	A read-only nested-keyed file which exists on non-root cgroups.
153762306a36Sopenharmony_ci
153862306a36Sopenharmony_ci	This breaks down the cgroup's memory footprint into different
153962306a36Sopenharmony_ci	types of memory, type-specific details, and other information
154062306a36Sopenharmony_ci	per node on the state of the memory management system.
154162306a36Sopenharmony_ci
154262306a36Sopenharmony_ci	This is useful for providing visibility into the NUMA locality
154362306a36Sopenharmony_ci	information within an memcg since the pages are allowed to be
154462306a36Sopenharmony_ci	allocated from any physical node. One of the use case is evaluating
154562306a36Sopenharmony_ci	application performance by combining this information with the
154662306a36Sopenharmony_ci	application's CPU allocation.
154762306a36Sopenharmony_ci
154862306a36Sopenharmony_ci	All memory amounts are in bytes.
154962306a36Sopenharmony_ci
155062306a36Sopenharmony_ci	The output format of memory.numa_stat is::
155162306a36Sopenharmony_ci
155262306a36Sopenharmony_ci	  type N0=<bytes in node 0> N1=<bytes in node 1> ...
155362306a36Sopenharmony_ci
155462306a36Sopenharmony_ci	The entries are ordered to be human readable, and new entries
155562306a36Sopenharmony_ci	can show up in the middle. Don't rely on items remaining in a
155662306a36Sopenharmony_ci	fixed position; use the keys to look up specific values!
155762306a36Sopenharmony_ci
155862306a36Sopenharmony_ci	The entries can refer to the memory.stat.
155962306a36Sopenharmony_ci
156062306a36Sopenharmony_ci  memory.swap.current
156162306a36Sopenharmony_ci	A read-only single value file which exists on non-root
156262306a36Sopenharmony_ci	cgroups.
156362306a36Sopenharmony_ci
156462306a36Sopenharmony_ci	The total amount of swap currently being used by the cgroup
156562306a36Sopenharmony_ci	and its descendants.
156662306a36Sopenharmony_ci
156762306a36Sopenharmony_ci  memory.swap.high
156862306a36Sopenharmony_ci	A read-write single value file which exists on non-root
156962306a36Sopenharmony_ci	cgroups.  The default is "max".
157062306a36Sopenharmony_ci
157162306a36Sopenharmony_ci	Swap usage throttle limit.  If a cgroup's swap usage exceeds
157262306a36Sopenharmony_ci	this limit, all its further allocations will be throttled to
157362306a36Sopenharmony_ci	allow userspace to implement custom out-of-memory procedures.
157462306a36Sopenharmony_ci
157562306a36Sopenharmony_ci	This limit marks a point of no return for the cgroup. It is NOT
157662306a36Sopenharmony_ci	designed to manage the amount of swapping a workload does
157762306a36Sopenharmony_ci	during regular operation. Compare to memory.swap.max, which
157862306a36Sopenharmony_ci	prohibits swapping past a set amount, but lets the cgroup
157962306a36Sopenharmony_ci	continue unimpeded as long as other memory can be reclaimed.
158062306a36Sopenharmony_ci
158162306a36Sopenharmony_ci	Healthy workloads are not expected to reach this limit.
158262306a36Sopenharmony_ci
158362306a36Sopenharmony_ci  memory.swap.peak
158462306a36Sopenharmony_ci	A read-only single value file which exists on non-root
158562306a36Sopenharmony_ci	cgroups.
158662306a36Sopenharmony_ci
158762306a36Sopenharmony_ci	The max swap usage recorded for the cgroup and its
158862306a36Sopenharmony_ci	descendants since the creation of the cgroup.
158962306a36Sopenharmony_ci
159062306a36Sopenharmony_ci  memory.swap.max
159162306a36Sopenharmony_ci	A read-write single value file which exists on non-root
159262306a36Sopenharmony_ci	cgroups.  The default is "max".
159362306a36Sopenharmony_ci
159462306a36Sopenharmony_ci	Swap usage hard limit.  If a cgroup's swap usage reaches this
159562306a36Sopenharmony_ci	limit, anonymous memory of the cgroup will not be swapped out.
159662306a36Sopenharmony_ci
159762306a36Sopenharmony_ci  memory.swap.events
159862306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups.
159962306a36Sopenharmony_ci	The following entries are defined.  Unless specified
160062306a36Sopenharmony_ci	otherwise, a value change in this file generates a file
160162306a36Sopenharmony_ci	modified event.
160262306a36Sopenharmony_ci
160362306a36Sopenharmony_ci	  high
160462306a36Sopenharmony_ci		The number of times the cgroup's swap usage was over
160562306a36Sopenharmony_ci		the high threshold.
160662306a36Sopenharmony_ci
160762306a36Sopenharmony_ci	  max
160862306a36Sopenharmony_ci		The number of times the cgroup's swap usage was about
160962306a36Sopenharmony_ci		to go over the max boundary and swap allocation
161062306a36Sopenharmony_ci		failed.
161162306a36Sopenharmony_ci
161262306a36Sopenharmony_ci	  fail
161362306a36Sopenharmony_ci		The number of times swap allocation failed either
161462306a36Sopenharmony_ci		because of running out of swap system-wide or max
161562306a36Sopenharmony_ci		limit.
161662306a36Sopenharmony_ci
161762306a36Sopenharmony_ci	When reduced under the current usage, the existing swap
161862306a36Sopenharmony_ci	entries are reclaimed gradually and the swap usage may stay
161962306a36Sopenharmony_ci	higher than the limit for an extended period of time.  This
162062306a36Sopenharmony_ci	reduces the impact on the workload and memory management.
162162306a36Sopenharmony_ci
162262306a36Sopenharmony_ci  memory.zswap.current
162362306a36Sopenharmony_ci	A read-only single value file which exists on non-root
162462306a36Sopenharmony_ci	cgroups.
162562306a36Sopenharmony_ci
162662306a36Sopenharmony_ci	The total amount of memory consumed by the zswap compression
162762306a36Sopenharmony_ci	backend.
162862306a36Sopenharmony_ci
162962306a36Sopenharmony_ci  memory.zswap.max
163062306a36Sopenharmony_ci	A read-write single value file which exists on non-root
163162306a36Sopenharmony_ci	cgroups.  The default is "max".
163262306a36Sopenharmony_ci
163362306a36Sopenharmony_ci	Zswap usage hard limit. If a cgroup's zswap pool reaches this
163462306a36Sopenharmony_ci	limit, it will refuse to take any more stores before existing
163562306a36Sopenharmony_ci	entries fault back in or are written out to disk.
163662306a36Sopenharmony_ci
163762306a36Sopenharmony_ci  memory.pressure
163862306a36Sopenharmony_ci	A read-only nested-keyed file.
163962306a36Sopenharmony_ci
164062306a36Sopenharmony_ci	Shows pressure stall information for memory. See
164162306a36Sopenharmony_ci	:ref:`Documentation/accounting/psi.rst <psi>` for details.
164262306a36Sopenharmony_ci
164362306a36Sopenharmony_ci
164462306a36Sopenharmony_ciUsage Guidelines
164562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~
164662306a36Sopenharmony_ci
164762306a36Sopenharmony_ci"memory.high" is the main mechanism to control memory usage.
164862306a36Sopenharmony_ciOver-committing on high limit (sum of high limits > available memory)
164962306a36Sopenharmony_ciand letting global memory pressure to distribute memory according to
165062306a36Sopenharmony_ciusage is a viable strategy.
165162306a36Sopenharmony_ci
165262306a36Sopenharmony_ciBecause breach of the high limit doesn't trigger the OOM killer but
165362306a36Sopenharmony_cithrottles the offending cgroup, a management agent has ample
165462306a36Sopenharmony_ciopportunities to monitor and take appropriate actions such as granting
165562306a36Sopenharmony_cimore memory or terminating the workload.
165662306a36Sopenharmony_ci
165762306a36Sopenharmony_ciDetermining whether a cgroup has enough memory is not trivial as
165862306a36Sopenharmony_cimemory usage doesn't indicate whether the workload can benefit from
165962306a36Sopenharmony_cimore memory.  For example, a workload which writes data received from
166062306a36Sopenharmony_cinetwork to a file can use all available memory but can also operate as
166162306a36Sopenharmony_ciperformant with a small amount of memory.  A measure of memory
166262306a36Sopenharmony_cipressure - how much the workload is being impacted due to lack of
166362306a36Sopenharmony_cimemory - is necessary to determine whether a workload needs more
166462306a36Sopenharmony_cimemory; unfortunately, memory pressure monitoring mechanism isn't
166562306a36Sopenharmony_ciimplemented yet.
166662306a36Sopenharmony_ci
166762306a36Sopenharmony_ci
166862306a36Sopenharmony_ciMemory Ownership
166962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~
167062306a36Sopenharmony_ci
167162306a36Sopenharmony_ciA memory area is charged to the cgroup which instantiated it and stays
167262306a36Sopenharmony_cicharged to the cgroup until the area is released.  Migrating a process
167362306a36Sopenharmony_cito a different cgroup doesn't move the memory usages that it
167462306a36Sopenharmony_ciinstantiated while in the previous cgroup to the new cgroup.
167562306a36Sopenharmony_ci
167662306a36Sopenharmony_ciA memory area may be used by processes belonging to different cgroups.
167762306a36Sopenharmony_ciTo which cgroup the area will be charged is in-deterministic; however,
167862306a36Sopenharmony_ciover time, the memory area is likely to end up in a cgroup which has
167962306a36Sopenharmony_cienough memory allowance to avoid high reclaim pressure.
168062306a36Sopenharmony_ci
168162306a36Sopenharmony_ciIf a cgroup sweeps a considerable amount of memory which is expected
168262306a36Sopenharmony_cito be accessed repeatedly by other cgroups, it may make sense to use
168362306a36Sopenharmony_ciPOSIX_FADV_DONTNEED to relinquish the ownership of memory areas
168462306a36Sopenharmony_cibelonging to the affected files to ensure correct memory ownership.
168562306a36Sopenharmony_ci
168662306a36Sopenharmony_ci
168762306a36Sopenharmony_ciIO
168862306a36Sopenharmony_ci--
168962306a36Sopenharmony_ci
169062306a36Sopenharmony_ciThe "io" controller regulates the distribution of IO resources.  This
169162306a36Sopenharmony_cicontroller implements both weight based and absolute bandwidth or IOPS
169262306a36Sopenharmony_cilimit distribution; however, weight based distribution is available
169362306a36Sopenharmony_cionly if cfq-iosched is in use and neither scheme is available for
169462306a36Sopenharmony_ciblk-mq devices.
169562306a36Sopenharmony_ci
169662306a36Sopenharmony_ci
169762306a36Sopenharmony_ciIO Interface Files
169862306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~
169962306a36Sopenharmony_ci
170062306a36Sopenharmony_ci  io.stat
170162306a36Sopenharmony_ci	A read-only nested-keyed file.
170262306a36Sopenharmony_ci
170362306a36Sopenharmony_ci	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
170462306a36Sopenharmony_ci	The following nested keys are defined.
170562306a36Sopenharmony_ci
170662306a36Sopenharmony_ci	  ======	=====================
170762306a36Sopenharmony_ci	  rbytes	Bytes read
170862306a36Sopenharmony_ci	  wbytes	Bytes written
170962306a36Sopenharmony_ci	  rios		Number of read IOs
171062306a36Sopenharmony_ci	  wios		Number of write IOs
171162306a36Sopenharmony_ci	  dbytes	Bytes discarded
171262306a36Sopenharmony_ci	  dios		Number of discard IOs
171362306a36Sopenharmony_ci	  ======	=====================
171462306a36Sopenharmony_ci
171562306a36Sopenharmony_ci	An example read output follows::
171662306a36Sopenharmony_ci
171762306a36Sopenharmony_ci	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
171862306a36Sopenharmony_ci	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
171962306a36Sopenharmony_ci
172062306a36Sopenharmony_ci  io.cost.qos
172162306a36Sopenharmony_ci	A read-write nested-keyed file which exists only on the root
172262306a36Sopenharmony_ci	cgroup.
172362306a36Sopenharmony_ci
172462306a36Sopenharmony_ci	This file configures the Quality of Service of the IO cost
172562306a36Sopenharmony_ci	model based controller (CONFIG_BLK_CGROUP_IOCOST) which
172662306a36Sopenharmony_ci	currently implements "io.weight" proportional control.  Lines
172762306a36Sopenharmony_ci	are keyed by $MAJ:$MIN device numbers and not ordered.  The
172862306a36Sopenharmony_ci	line for a given device is populated on the first write for
172962306a36Sopenharmony_ci	the device on "io.cost.qos" or "io.cost.model".  The following
173062306a36Sopenharmony_ci	nested keys are defined.
173162306a36Sopenharmony_ci
173262306a36Sopenharmony_ci	  ======	=====================================
173362306a36Sopenharmony_ci	  enable	Weight-based control enable
173462306a36Sopenharmony_ci	  ctrl		"auto" or "user"
173562306a36Sopenharmony_ci	  rpct		Read latency percentile    [0, 100]
173662306a36Sopenharmony_ci	  rlat		Read latency threshold
173762306a36Sopenharmony_ci	  wpct		Write latency percentile   [0, 100]
173862306a36Sopenharmony_ci	  wlat		Write latency threshold
173962306a36Sopenharmony_ci	  min		Minimum scaling percentage [1, 10000]
174062306a36Sopenharmony_ci	  max		Maximum scaling percentage [1, 10000]
174162306a36Sopenharmony_ci	  ======	=====================================
174262306a36Sopenharmony_ci
174362306a36Sopenharmony_ci	The controller is disabled by default and can be enabled by
174462306a36Sopenharmony_ci	setting "enable" to 1.  "rpct" and "wpct" parameters default
174562306a36Sopenharmony_ci	to zero and the controller uses internal device saturation
174662306a36Sopenharmony_ci	state to adjust the overall IO rate between "min" and "max".
174762306a36Sopenharmony_ci
174862306a36Sopenharmony_ci	When a better control quality is needed, latency QoS
174962306a36Sopenharmony_ci	parameters can be configured.  For example::
175062306a36Sopenharmony_ci
175162306a36Sopenharmony_ci	  8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
175262306a36Sopenharmony_ci
175362306a36Sopenharmony_ci	shows that on sdb, the controller is enabled, will consider
175462306a36Sopenharmony_ci	the device saturated if the 95th percentile of read completion
175562306a36Sopenharmony_ci	latencies is above 75ms or write 150ms, and adjust the overall
175662306a36Sopenharmony_ci	IO issue rate between 50% and 150% accordingly.
175762306a36Sopenharmony_ci
175862306a36Sopenharmony_ci	The lower the saturation point, the better the latency QoS at
175962306a36Sopenharmony_ci	the cost of aggregate bandwidth.  The narrower the allowed
176062306a36Sopenharmony_ci	adjustment range between "min" and "max", the more conformant
176162306a36Sopenharmony_ci	to the cost model the IO behavior.  Note that the IO issue
176262306a36Sopenharmony_ci	base rate may be far off from 100% and setting "min" and "max"
176362306a36Sopenharmony_ci	blindly can lead to a significant loss of device capacity or
176462306a36Sopenharmony_ci	control quality.  "min" and "max" are useful for regulating
176562306a36Sopenharmony_ci	devices which show wide temporary behavior changes - e.g. a
176662306a36Sopenharmony_ci	ssd which accepts writes at the line speed for a while and
176762306a36Sopenharmony_ci	then completely stalls for multiple seconds.
176862306a36Sopenharmony_ci
176962306a36Sopenharmony_ci	When "ctrl" is "auto", the parameters are controlled by the
177062306a36Sopenharmony_ci	kernel and may change automatically.  Setting "ctrl" to "user"
177162306a36Sopenharmony_ci	or setting any of the percentile and latency parameters puts
177262306a36Sopenharmony_ci	it into "user" mode and disables the automatic changes.  The
177362306a36Sopenharmony_ci	automatic mode can be restored by setting "ctrl" to "auto".
177462306a36Sopenharmony_ci
177562306a36Sopenharmony_ci  io.cost.model
177662306a36Sopenharmony_ci	A read-write nested-keyed file which exists only on the root
177762306a36Sopenharmony_ci	cgroup.
177862306a36Sopenharmony_ci
177962306a36Sopenharmony_ci	This file configures the cost model of the IO cost model based
178062306a36Sopenharmony_ci	controller (CONFIG_BLK_CGROUP_IOCOST) which currently
178162306a36Sopenharmony_ci	implements "io.weight" proportional control.  Lines are keyed
178262306a36Sopenharmony_ci	by $MAJ:$MIN device numbers and not ordered.  The line for a
178362306a36Sopenharmony_ci	given device is populated on the first write for the device on
178462306a36Sopenharmony_ci	"io.cost.qos" or "io.cost.model".  The following nested keys
178562306a36Sopenharmony_ci	are defined.
178662306a36Sopenharmony_ci
178762306a36Sopenharmony_ci	  =====		================================
178862306a36Sopenharmony_ci	  ctrl		"auto" or "user"
178962306a36Sopenharmony_ci	  model		The cost model in use - "linear"
179062306a36Sopenharmony_ci	  =====		================================
179162306a36Sopenharmony_ci
179262306a36Sopenharmony_ci	When "ctrl" is "auto", the kernel may change all parameters
179362306a36Sopenharmony_ci	dynamically.  When "ctrl" is set to "user" or any other
179462306a36Sopenharmony_ci	parameters are written to, "ctrl" become "user" and the
179562306a36Sopenharmony_ci	automatic changes are disabled.
179662306a36Sopenharmony_ci
179762306a36Sopenharmony_ci	When "model" is "linear", the following model parameters are
179862306a36Sopenharmony_ci	defined.
179962306a36Sopenharmony_ci
180062306a36Sopenharmony_ci	  =============	========================================
180162306a36Sopenharmony_ci	  [r|w]bps	The maximum sequential IO throughput
180262306a36Sopenharmony_ci	  [r|w]seqiops	The maximum 4k sequential IOs per second
180362306a36Sopenharmony_ci	  [r|w]randiops	The maximum 4k random IOs per second
180462306a36Sopenharmony_ci	  =============	========================================
180562306a36Sopenharmony_ci
180662306a36Sopenharmony_ci	From the above, the builtin linear model determines the base
180762306a36Sopenharmony_ci	costs of a sequential and random IO and the cost coefficient
180862306a36Sopenharmony_ci	for the IO size.  While simple, this model can cover most
180962306a36Sopenharmony_ci	common device classes acceptably.
181062306a36Sopenharmony_ci
181162306a36Sopenharmony_ci	The IO cost model isn't expected to be accurate in absolute
181262306a36Sopenharmony_ci	sense and is scaled to the device behavior dynamically.
181362306a36Sopenharmony_ci
181462306a36Sopenharmony_ci	If needed, tools/cgroup/iocost_coef_gen.py can be used to
181562306a36Sopenharmony_ci	generate device-specific coefficients.
181662306a36Sopenharmony_ci
181762306a36Sopenharmony_ci  io.weight
181862306a36Sopenharmony_ci	A read-write flat-keyed file which exists on non-root cgroups.
181962306a36Sopenharmony_ci	The default is "default 100".
182062306a36Sopenharmony_ci
182162306a36Sopenharmony_ci	The first line is the default weight applied to devices
182262306a36Sopenharmony_ci	without specific override.  The rest are overrides keyed by
182362306a36Sopenharmony_ci	$MAJ:$MIN device numbers and not ordered.  The weights are in
182462306a36Sopenharmony_ci	the range [1, 10000] and specifies the relative amount IO time
182562306a36Sopenharmony_ci	the cgroup can use in relation to its siblings.
182662306a36Sopenharmony_ci
182762306a36Sopenharmony_ci	The default weight can be updated by writing either "default
182862306a36Sopenharmony_ci	$WEIGHT" or simply "$WEIGHT".  Overrides can be set by writing
182962306a36Sopenharmony_ci	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
183062306a36Sopenharmony_ci
183162306a36Sopenharmony_ci	An example read output follows::
183262306a36Sopenharmony_ci
183362306a36Sopenharmony_ci	  default 100
183462306a36Sopenharmony_ci	  8:16 200
183562306a36Sopenharmony_ci	  8:0 50
183662306a36Sopenharmony_ci
183762306a36Sopenharmony_ci  io.max
183862306a36Sopenharmony_ci	A read-write nested-keyed file which exists on non-root
183962306a36Sopenharmony_ci	cgroups.
184062306a36Sopenharmony_ci
184162306a36Sopenharmony_ci	BPS and IOPS based IO limit.  Lines are keyed by $MAJ:$MIN
184262306a36Sopenharmony_ci	device numbers and not ordered.  The following nested keys are
184362306a36Sopenharmony_ci	defined.
184462306a36Sopenharmony_ci
184562306a36Sopenharmony_ci	  =====		==================================
184662306a36Sopenharmony_ci	  rbps		Max read bytes per second
184762306a36Sopenharmony_ci	  wbps		Max write bytes per second
184862306a36Sopenharmony_ci	  riops		Max read IO operations per second
184962306a36Sopenharmony_ci	  wiops		Max write IO operations per second
185062306a36Sopenharmony_ci	  =====		==================================
185162306a36Sopenharmony_ci
185262306a36Sopenharmony_ci	When writing, any number of nested key-value pairs can be
185362306a36Sopenharmony_ci	specified in any order.  "max" can be specified as the value
185462306a36Sopenharmony_ci	to remove a specific limit.  If the same key is specified
185562306a36Sopenharmony_ci	multiple times, the outcome is undefined.
185662306a36Sopenharmony_ci
185762306a36Sopenharmony_ci	BPS and IOPS are measured in each IO direction and IOs are
185862306a36Sopenharmony_ci	delayed if limit is reached.  Temporary bursts are allowed.
185962306a36Sopenharmony_ci
186062306a36Sopenharmony_ci	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
186162306a36Sopenharmony_ci
186262306a36Sopenharmony_ci	  echo "8:16 rbps=2097152 wiops=120" > io.max
186362306a36Sopenharmony_ci
186462306a36Sopenharmony_ci	Reading returns the following::
186562306a36Sopenharmony_ci
186662306a36Sopenharmony_ci	  8:16 rbps=2097152 wbps=max riops=max wiops=120
186762306a36Sopenharmony_ci
186862306a36Sopenharmony_ci	Write IOPS limit can be removed by writing the following::
186962306a36Sopenharmony_ci
187062306a36Sopenharmony_ci	  echo "8:16 wiops=max" > io.max
187162306a36Sopenharmony_ci
187262306a36Sopenharmony_ci	Reading now returns the following::
187362306a36Sopenharmony_ci
187462306a36Sopenharmony_ci	  8:16 rbps=2097152 wbps=max riops=max wiops=max
187562306a36Sopenharmony_ci
187662306a36Sopenharmony_ci  io.pressure
187762306a36Sopenharmony_ci	A read-only nested-keyed file.
187862306a36Sopenharmony_ci
187962306a36Sopenharmony_ci	Shows pressure stall information for IO. See
188062306a36Sopenharmony_ci	:ref:`Documentation/accounting/psi.rst <psi>` for details.
188162306a36Sopenharmony_ci
188262306a36Sopenharmony_ci
188362306a36Sopenharmony_ciWriteback
188462306a36Sopenharmony_ci~~~~~~~~~
188562306a36Sopenharmony_ci
188662306a36Sopenharmony_ciPage cache is dirtied through buffered writes and shared mmaps and
188762306a36Sopenharmony_ciwritten asynchronously to the backing filesystem by the writeback
188862306a36Sopenharmony_cimechanism.  Writeback sits between the memory and IO domains and
188962306a36Sopenharmony_ciregulates the proportion of dirty memory by balancing dirtying and
189062306a36Sopenharmony_ciwrite IOs.
189162306a36Sopenharmony_ci
189262306a36Sopenharmony_ciThe io controller, in conjunction with the memory controller,
189362306a36Sopenharmony_ciimplements control of page cache writeback IOs.  The memory controller
189462306a36Sopenharmony_cidefines the memory domain that dirty memory ratio is calculated and
189562306a36Sopenharmony_cimaintained for and the io controller defines the io domain which
189662306a36Sopenharmony_ciwrites out dirty pages for the memory domain.  Both system-wide and
189762306a36Sopenharmony_ciper-cgroup dirty memory states are examined and the more restrictive
189862306a36Sopenharmony_ciof the two is enforced.
189962306a36Sopenharmony_ci
190062306a36Sopenharmony_cicgroup writeback requires explicit support from the underlying
190162306a36Sopenharmony_cifilesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
190262306a36Sopenharmony_cibtrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are 
190362306a36Sopenharmony_ciattributed to the root cgroup.
190462306a36Sopenharmony_ci
190562306a36Sopenharmony_ciThere are inherent differences in memory and writeback management
190662306a36Sopenharmony_ciwhich affects how cgroup ownership is tracked.  Memory is tracked per
190762306a36Sopenharmony_cipage while writeback per inode.  For the purpose of writeback, an
190862306a36Sopenharmony_ciinode is assigned to a cgroup and all IO requests to write dirty pages
190962306a36Sopenharmony_cifrom the inode are attributed to that cgroup.
191062306a36Sopenharmony_ci
191162306a36Sopenharmony_ciAs cgroup ownership for memory is tracked per page, there can be pages
191262306a36Sopenharmony_ciwhich are associated with different cgroups than the one the inode is
191362306a36Sopenharmony_ciassociated with.  These are called foreign pages.  The writeback
191462306a36Sopenharmony_ciconstantly keeps track of foreign pages and, if a particular foreign
191562306a36Sopenharmony_cicgroup becomes the majority over a certain period of time, switches
191662306a36Sopenharmony_cithe ownership of the inode to that cgroup.
191762306a36Sopenharmony_ci
191862306a36Sopenharmony_ciWhile this model is enough for most use cases where a given inode is
191962306a36Sopenharmony_cimostly dirtied by a single cgroup even when the main writing cgroup
192062306a36Sopenharmony_cichanges over time, use cases where multiple cgroups write to a single
192162306a36Sopenharmony_ciinode simultaneously are not supported well.  In such circumstances, a
192262306a36Sopenharmony_cisignificant portion of IOs are likely to be attributed incorrectly.
192362306a36Sopenharmony_ciAs memory controller assigns page ownership on the first use and
192462306a36Sopenharmony_cidoesn't update it until the page is released, even if writeback
192562306a36Sopenharmony_cistrictly follows page ownership, multiple cgroups dirtying overlapping
192662306a36Sopenharmony_ciareas wouldn't work as expected.  It's recommended to avoid such usage
192762306a36Sopenharmony_cipatterns.
192862306a36Sopenharmony_ci
192962306a36Sopenharmony_ciThe sysctl knobs which affect writeback behavior are applied to cgroup
193062306a36Sopenharmony_ciwriteback as follows.
193162306a36Sopenharmony_ci
193262306a36Sopenharmony_ci  vm.dirty_background_ratio, vm.dirty_ratio
193362306a36Sopenharmony_ci	These ratios apply the same to cgroup writeback with the
193462306a36Sopenharmony_ci	amount of available memory capped by limits imposed by the
193562306a36Sopenharmony_ci	memory controller and system-wide clean memory.
193662306a36Sopenharmony_ci
193762306a36Sopenharmony_ci  vm.dirty_background_bytes, vm.dirty_bytes
193862306a36Sopenharmony_ci	For cgroup writeback, this is calculated into ratio against
193962306a36Sopenharmony_ci	total available memory and applied the same way as
194062306a36Sopenharmony_ci	vm.dirty[_background]_ratio.
194162306a36Sopenharmony_ci
194262306a36Sopenharmony_ci
194362306a36Sopenharmony_ciIO Latency
194462306a36Sopenharmony_ci~~~~~~~~~~
194562306a36Sopenharmony_ci
194662306a36Sopenharmony_ciThis is a cgroup v2 controller for IO workload protection.  You provide a group
194762306a36Sopenharmony_ciwith a latency target, and if the average latency exceeds that target the
194862306a36Sopenharmony_cicontroller will throttle any peers that have a lower latency target than the
194962306a36Sopenharmony_ciprotected workload.
195062306a36Sopenharmony_ci
195162306a36Sopenharmony_ciThe limits are only applied at the peer level in the hierarchy.  This means that
195262306a36Sopenharmony_ciin the diagram below, only groups A, B, and C will influence each other, and
195362306a36Sopenharmony_cigroups D and F will influence each other.  Group G will influence nobody::
195462306a36Sopenharmony_ci
195562306a36Sopenharmony_ci			[root]
195662306a36Sopenharmony_ci		/	   |		\
195762306a36Sopenharmony_ci		A	   B		C
195862306a36Sopenharmony_ci	       /  \        |
195962306a36Sopenharmony_ci	      D    F	   G
196062306a36Sopenharmony_ci
196162306a36Sopenharmony_ci
196262306a36Sopenharmony_ciSo the ideal way to configure this is to set io.latency in groups A, B, and C.
196362306a36Sopenharmony_ciGenerally you do not want to set a value lower than the latency your device
196462306a36Sopenharmony_cisupports.  Experiment to find the value that works best for your workload.
196562306a36Sopenharmony_ciStart at higher than the expected latency for your device and watch the
196662306a36Sopenharmony_ciavg_lat value in io.stat for your workload group to get an idea of the
196762306a36Sopenharmony_cilatency you see during normal operation.  Use the avg_lat value as a basis for
196862306a36Sopenharmony_ciyour real setting, setting at 10-15% higher than the value in io.stat.
196962306a36Sopenharmony_ci
197062306a36Sopenharmony_ciHow IO Latency Throttling Works
197162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
197262306a36Sopenharmony_ci
197362306a36Sopenharmony_ciio.latency is work conserving; so as long as everybody is meeting their latency
197462306a36Sopenharmony_citarget the controller doesn't do anything.  Once a group starts missing its
197562306a36Sopenharmony_citarget it begins throttling any peer group that has a higher target than itself.
197662306a36Sopenharmony_ciThis throttling takes 2 forms:
197762306a36Sopenharmony_ci
197862306a36Sopenharmony_ci- Queue depth throttling.  This is the number of outstanding IO's a group is
197962306a36Sopenharmony_ci  allowed to have.  We will clamp down relatively quickly, starting at no limit
198062306a36Sopenharmony_ci  and going all the way down to 1 IO at a time.
198162306a36Sopenharmony_ci
198262306a36Sopenharmony_ci- Artificial delay induction.  There are certain types of IO that cannot be
198362306a36Sopenharmony_ci  throttled without possibly adversely affecting higher priority groups.  This
198462306a36Sopenharmony_ci  includes swapping and metadata IO.  These types of IO are allowed to occur
198562306a36Sopenharmony_ci  normally, however they are "charged" to the originating group.  If the
198662306a36Sopenharmony_ci  originating group is being throttled you will see the use_delay and delay
198762306a36Sopenharmony_ci  fields in io.stat increase.  The delay value is how many microseconds that are
198862306a36Sopenharmony_ci  being added to any process that runs in this group.  Because this number can
198962306a36Sopenharmony_ci  grow quite large if there is a lot of swapping or metadata IO occurring we
199062306a36Sopenharmony_ci  limit the individual delay events to 1 second at a time.
199162306a36Sopenharmony_ci
199262306a36Sopenharmony_ciOnce the victimized group starts meeting its latency target again it will start
199362306a36Sopenharmony_ciunthrottling any peer groups that were throttled previously.  If the victimized
199462306a36Sopenharmony_cigroup simply stops doing IO the global counter will unthrottle appropriately.
199562306a36Sopenharmony_ci
199662306a36Sopenharmony_ciIO Latency Interface Files
199762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
199862306a36Sopenharmony_ci
199962306a36Sopenharmony_ci  io.latency
200062306a36Sopenharmony_ci	This takes a similar format as the other controllers.
200162306a36Sopenharmony_ci
200262306a36Sopenharmony_ci		"MAJOR:MINOR target=<target time in microseconds>"
200362306a36Sopenharmony_ci
200462306a36Sopenharmony_ci  io.stat
200562306a36Sopenharmony_ci	If the controller is enabled you will see extra stats in io.stat in
200662306a36Sopenharmony_ci	addition to the normal ones.
200762306a36Sopenharmony_ci
200862306a36Sopenharmony_ci	  depth
200962306a36Sopenharmony_ci		This is the current queue depth for the group.
201062306a36Sopenharmony_ci
201162306a36Sopenharmony_ci	  avg_lat
201262306a36Sopenharmony_ci		This is an exponential moving average with a decay rate of 1/exp
201362306a36Sopenharmony_ci		bound by the sampling interval.  The decay rate interval can be
201462306a36Sopenharmony_ci		calculated by multiplying the win value in io.stat by the
201562306a36Sopenharmony_ci		corresponding number of samples based on the win value.
201662306a36Sopenharmony_ci
201762306a36Sopenharmony_ci	  win
201862306a36Sopenharmony_ci		The sampling window size in milliseconds.  This is the minimum
201962306a36Sopenharmony_ci		duration of time between evaluation events.  Windows only elapse
202062306a36Sopenharmony_ci		with IO activity.  Idle periods extend the most recent window.
202162306a36Sopenharmony_ci
202262306a36Sopenharmony_ciIO Priority
202362306a36Sopenharmony_ci~~~~~~~~~~~
202462306a36Sopenharmony_ci
202562306a36Sopenharmony_ciA single attribute controls the behavior of the I/O priority cgroup policy,
202662306a36Sopenharmony_cinamely the blkio.prio.class attribute. The following values are accepted for
202762306a36Sopenharmony_cithat attribute:
202862306a36Sopenharmony_ci
202962306a36Sopenharmony_ci  no-change
203062306a36Sopenharmony_ci	Do not modify the I/O priority class.
203162306a36Sopenharmony_ci
203262306a36Sopenharmony_ci  promote-to-rt
203362306a36Sopenharmony_ci	For requests that have a non-RT I/O priority class, change it into RT.
203462306a36Sopenharmony_ci	Also change the priority level of these requests to 4. Do not modify
203562306a36Sopenharmony_ci	the I/O priority of requests that have priority class RT.
203662306a36Sopenharmony_ci
203762306a36Sopenharmony_ci  restrict-to-be
203862306a36Sopenharmony_ci	For requests that do not have an I/O priority class or that have I/O
203962306a36Sopenharmony_ci	priority class RT, change it into BE. Also change the priority level
204062306a36Sopenharmony_ci	of these requests to 0. Do not modify the I/O priority class of
204162306a36Sopenharmony_ci	requests that have priority class IDLE.
204262306a36Sopenharmony_ci
204362306a36Sopenharmony_ci  idle
204462306a36Sopenharmony_ci	Change the I/O priority class of all requests into IDLE, the lowest
204562306a36Sopenharmony_ci	I/O priority class.
204662306a36Sopenharmony_ci
204762306a36Sopenharmony_ci  none-to-rt
204862306a36Sopenharmony_ci	Deprecated. Just an alias for promote-to-rt.
204962306a36Sopenharmony_ci
205062306a36Sopenharmony_ciThe following numerical values are associated with the I/O priority policies:
205162306a36Sopenharmony_ci
205262306a36Sopenharmony_ci+----------------+---+
205362306a36Sopenharmony_ci| no-change      | 0 |
205462306a36Sopenharmony_ci+----------------+---+
205562306a36Sopenharmony_ci| rt-to-be       | 2 |
205662306a36Sopenharmony_ci+----------------+---+
205762306a36Sopenharmony_ci| all-to-idle    | 3 |
205862306a36Sopenharmony_ci+----------------+---+
205962306a36Sopenharmony_ci
206062306a36Sopenharmony_ciThe numerical value that corresponds to each I/O priority class is as follows:
206162306a36Sopenharmony_ci
206262306a36Sopenharmony_ci+-------------------------------+---+
206362306a36Sopenharmony_ci| IOPRIO_CLASS_NONE             | 0 |
206462306a36Sopenharmony_ci+-------------------------------+---+
206562306a36Sopenharmony_ci| IOPRIO_CLASS_RT (real-time)   | 1 |
206662306a36Sopenharmony_ci+-------------------------------+---+
206762306a36Sopenharmony_ci| IOPRIO_CLASS_BE (best effort) | 2 |
206862306a36Sopenharmony_ci+-------------------------------+---+
206962306a36Sopenharmony_ci| IOPRIO_CLASS_IDLE             | 3 |
207062306a36Sopenharmony_ci+-------------------------------+---+
207162306a36Sopenharmony_ci
207262306a36Sopenharmony_ciThe algorithm to set the I/O priority class for a request is as follows:
207362306a36Sopenharmony_ci
207462306a36Sopenharmony_ci- If I/O priority class policy is promote-to-rt, change the request I/O
207562306a36Sopenharmony_ci  priority class to IOPRIO_CLASS_RT and change the request I/O priority
207662306a36Sopenharmony_ci  level to 4.
207762306a36Sopenharmony_ci- If I/O priorityt class is not promote-to-rt, translate the I/O priority
207862306a36Sopenharmony_ci  class policy into a number, then change the request I/O priority class
207962306a36Sopenharmony_ci  into the maximum of the I/O priority class policy number and the numerical
208062306a36Sopenharmony_ci  I/O priority class.
208162306a36Sopenharmony_ci
208262306a36Sopenharmony_ciPID
208362306a36Sopenharmony_ci---
208462306a36Sopenharmony_ci
208562306a36Sopenharmony_ciThe process number controller is used to allow a cgroup to stop any
208662306a36Sopenharmony_cinew tasks from being fork()'d or clone()'d after a specified limit is
208762306a36Sopenharmony_cireached.
208862306a36Sopenharmony_ci
208962306a36Sopenharmony_ciThe number of tasks in a cgroup can be exhausted in ways which other
209062306a36Sopenharmony_cicontrollers cannot prevent, thus warranting its own controller.  For
209162306a36Sopenharmony_ciexample, a fork bomb is likely to exhaust the number of tasks before
209262306a36Sopenharmony_cihitting memory restrictions.
209362306a36Sopenharmony_ci
209462306a36Sopenharmony_ciNote that PIDs used in this controller refer to TIDs, process IDs as
209562306a36Sopenharmony_ciused by the kernel.
209662306a36Sopenharmony_ci
209762306a36Sopenharmony_ci
209862306a36Sopenharmony_ciPID Interface Files
209962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
210062306a36Sopenharmony_ci
210162306a36Sopenharmony_ci  pids.max
210262306a36Sopenharmony_ci	A read-write single value file which exists on non-root
210362306a36Sopenharmony_ci	cgroups.  The default is "max".
210462306a36Sopenharmony_ci
210562306a36Sopenharmony_ci	Hard limit of number of processes.
210662306a36Sopenharmony_ci
210762306a36Sopenharmony_ci  pids.current
210862306a36Sopenharmony_ci	A read-only single value file which exists on all cgroups.
210962306a36Sopenharmony_ci
211062306a36Sopenharmony_ci	The number of processes currently in the cgroup and its
211162306a36Sopenharmony_ci	descendants.
211262306a36Sopenharmony_ci
211362306a36Sopenharmony_ciOrganisational operations are not blocked by cgroup policies, so it is
211462306a36Sopenharmony_cipossible to have pids.current > pids.max.  This can be done by either
211562306a36Sopenharmony_cisetting the limit to be smaller than pids.current, or attaching enough
211662306a36Sopenharmony_ciprocesses to the cgroup such that pids.current is larger than
211762306a36Sopenharmony_cipids.max.  However, it is not possible to violate a cgroup PID policy
211862306a36Sopenharmony_cithrough fork() or clone(). These will return -EAGAIN if the creation
211962306a36Sopenharmony_ciof a new process would cause a cgroup policy to be violated.
212062306a36Sopenharmony_ci
212162306a36Sopenharmony_ci
212262306a36Sopenharmony_ciCpuset
212362306a36Sopenharmony_ci------
212462306a36Sopenharmony_ci
212562306a36Sopenharmony_ciThe "cpuset" controller provides a mechanism for constraining
212662306a36Sopenharmony_cithe CPU and memory node placement of tasks to only the resources
212762306a36Sopenharmony_cispecified in the cpuset interface files in a task's current cgroup.
212862306a36Sopenharmony_ciThis is especially valuable on large NUMA systems where placing jobs
212962306a36Sopenharmony_cion properly sized subsets of the systems with careful processor and
213062306a36Sopenharmony_cimemory placement to reduce cross-node memory access and contention
213162306a36Sopenharmony_cican improve overall system performance.
213262306a36Sopenharmony_ci
213362306a36Sopenharmony_ciThe "cpuset" controller is hierarchical.  That means the controller
213462306a36Sopenharmony_cicannot use CPUs or memory nodes not allowed in its parent.
213562306a36Sopenharmony_ci
213662306a36Sopenharmony_ci
213762306a36Sopenharmony_ciCpuset Interface Files
213862306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
213962306a36Sopenharmony_ci
214062306a36Sopenharmony_ci  cpuset.cpus
214162306a36Sopenharmony_ci	A read-write multiple values file which exists on non-root
214262306a36Sopenharmony_ci	cpuset-enabled cgroups.
214362306a36Sopenharmony_ci
214462306a36Sopenharmony_ci	It lists the requested CPUs to be used by tasks within this
214562306a36Sopenharmony_ci	cgroup.  The actual list of CPUs to be granted, however, is
214662306a36Sopenharmony_ci	subjected to constraints imposed by its parent and can differ
214762306a36Sopenharmony_ci	from the requested CPUs.
214862306a36Sopenharmony_ci
214962306a36Sopenharmony_ci	The CPU numbers are comma-separated numbers or ranges.
215062306a36Sopenharmony_ci	For example::
215162306a36Sopenharmony_ci
215262306a36Sopenharmony_ci	  # cat cpuset.cpus
215362306a36Sopenharmony_ci	  0-4,6,8-10
215462306a36Sopenharmony_ci
215562306a36Sopenharmony_ci	An empty value indicates that the cgroup is using the same
215662306a36Sopenharmony_ci	setting as the nearest cgroup ancestor with a non-empty
215762306a36Sopenharmony_ci	"cpuset.cpus" or all the available CPUs if none is found.
215862306a36Sopenharmony_ci
215962306a36Sopenharmony_ci	The value of "cpuset.cpus" stays constant until the next update
216062306a36Sopenharmony_ci	and won't be affected by any CPU hotplug events.
216162306a36Sopenharmony_ci
216262306a36Sopenharmony_ci  cpuset.cpus.effective
216362306a36Sopenharmony_ci	A read-only multiple values file which exists on all
216462306a36Sopenharmony_ci	cpuset-enabled cgroups.
216562306a36Sopenharmony_ci
216662306a36Sopenharmony_ci	It lists the onlined CPUs that are actually granted to this
216762306a36Sopenharmony_ci	cgroup by its parent.  These CPUs are allowed to be used by
216862306a36Sopenharmony_ci	tasks within the current cgroup.
216962306a36Sopenharmony_ci
217062306a36Sopenharmony_ci	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
217162306a36Sopenharmony_ci	all the CPUs from the parent cgroup that can be available to
217262306a36Sopenharmony_ci	be used by this cgroup.  Otherwise, it should be a subset of
217362306a36Sopenharmony_ci	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
217462306a36Sopenharmony_ci	can be granted.  In this case, it will be treated just like an
217562306a36Sopenharmony_ci	empty "cpuset.cpus".
217662306a36Sopenharmony_ci
217762306a36Sopenharmony_ci	Its value will be affected by CPU hotplug events.
217862306a36Sopenharmony_ci
217962306a36Sopenharmony_ci  cpuset.mems
218062306a36Sopenharmony_ci	A read-write multiple values file which exists on non-root
218162306a36Sopenharmony_ci	cpuset-enabled cgroups.
218262306a36Sopenharmony_ci
218362306a36Sopenharmony_ci	It lists the requested memory nodes to be used by tasks within
218462306a36Sopenharmony_ci	this cgroup.  The actual list of memory nodes granted, however,
218562306a36Sopenharmony_ci	is subjected to constraints imposed by its parent and can differ
218662306a36Sopenharmony_ci	from the requested memory nodes.
218762306a36Sopenharmony_ci
218862306a36Sopenharmony_ci	The memory node numbers are comma-separated numbers or ranges.
218962306a36Sopenharmony_ci	For example::
219062306a36Sopenharmony_ci
219162306a36Sopenharmony_ci	  # cat cpuset.mems
219262306a36Sopenharmony_ci	  0-1,3
219362306a36Sopenharmony_ci
219462306a36Sopenharmony_ci	An empty value indicates that the cgroup is using the same
219562306a36Sopenharmony_ci	setting as the nearest cgroup ancestor with a non-empty
219662306a36Sopenharmony_ci	"cpuset.mems" or all the available memory nodes if none
219762306a36Sopenharmony_ci	is found.
219862306a36Sopenharmony_ci
219962306a36Sopenharmony_ci	The value of "cpuset.mems" stays constant until the next update
220062306a36Sopenharmony_ci	and won't be affected by any memory nodes hotplug events.
220162306a36Sopenharmony_ci
220262306a36Sopenharmony_ci	Setting a non-empty value to "cpuset.mems" causes memory of
220362306a36Sopenharmony_ci	tasks within the cgroup to be migrated to the designated nodes if
220462306a36Sopenharmony_ci	they are currently using memory outside of the designated nodes.
220562306a36Sopenharmony_ci
220662306a36Sopenharmony_ci	There is a cost for this memory migration.  The migration
220762306a36Sopenharmony_ci	may not be complete and some memory pages may be left behind.
220862306a36Sopenharmony_ci	So it is recommended that "cpuset.mems" should be set properly
220962306a36Sopenharmony_ci	before spawning new tasks into the cpuset.  Even if there is
221062306a36Sopenharmony_ci	a need to change "cpuset.mems" with active tasks, it shouldn't
221162306a36Sopenharmony_ci	be done frequently.
221262306a36Sopenharmony_ci
221362306a36Sopenharmony_ci  cpuset.mems.effective
221462306a36Sopenharmony_ci	A read-only multiple values file which exists on all
221562306a36Sopenharmony_ci	cpuset-enabled cgroups.
221662306a36Sopenharmony_ci
221762306a36Sopenharmony_ci	It lists the onlined memory nodes that are actually granted to
221862306a36Sopenharmony_ci	this cgroup by its parent. These memory nodes are allowed to
221962306a36Sopenharmony_ci	be used by tasks within the current cgroup.
222062306a36Sopenharmony_ci
222162306a36Sopenharmony_ci	If "cpuset.mems" is empty, it shows all the memory nodes from the
222262306a36Sopenharmony_ci	parent cgroup that will be available to be used by this cgroup.
222362306a36Sopenharmony_ci	Otherwise, it should be a subset of "cpuset.mems" unless none of
222462306a36Sopenharmony_ci	the memory nodes listed in "cpuset.mems" can be granted.  In this
222562306a36Sopenharmony_ci	case, it will be treated just like an empty "cpuset.mems".
222662306a36Sopenharmony_ci
222762306a36Sopenharmony_ci	Its value will be affected by memory nodes hotplug events.
222862306a36Sopenharmony_ci
222962306a36Sopenharmony_ci  cpuset.cpus.partition
223062306a36Sopenharmony_ci	A read-write single value file which exists on non-root
223162306a36Sopenharmony_ci	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
223262306a36Sopenharmony_ci	and is not delegatable.
223362306a36Sopenharmony_ci
223462306a36Sopenharmony_ci	It accepts only the following input values when written to.
223562306a36Sopenharmony_ci
223662306a36Sopenharmony_ci	  ==========	=====================================
223762306a36Sopenharmony_ci	  "member"	Non-root member of a partition
223862306a36Sopenharmony_ci	  "root"	Partition root
223962306a36Sopenharmony_ci	  "isolated"	Partition root without load balancing
224062306a36Sopenharmony_ci	  ==========	=====================================
224162306a36Sopenharmony_ci
224262306a36Sopenharmony_ci	The root cgroup is always a partition root and its state
224362306a36Sopenharmony_ci	cannot be changed.  All other non-root cgroups start out as
224462306a36Sopenharmony_ci	"member".
224562306a36Sopenharmony_ci
224662306a36Sopenharmony_ci	When set to "root", the current cgroup is the root of a new
224762306a36Sopenharmony_ci	partition or scheduling domain that comprises itself and all
224862306a36Sopenharmony_ci	its descendants except those that are separate partition roots
224962306a36Sopenharmony_ci	themselves and their descendants.
225062306a36Sopenharmony_ci
225162306a36Sopenharmony_ci	When set to "isolated", the CPUs in that partition root will
225262306a36Sopenharmony_ci	be in an isolated state without any load balancing from the
225362306a36Sopenharmony_ci	scheduler.  Tasks placed in such a partition with multiple
225462306a36Sopenharmony_ci	CPUs should be carefully distributed and bound to each of the
225562306a36Sopenharmony_ci	individual CPUs for optimal performance.
225662306a36Sopenharmony_ci
225762306a36Sopenharmony_ci	The value shown in "cpuset.cpus.effective" of a partition root
225862306a36Sopenharmony_ci	is the CPUs that the partition root can dedicate to a potential
225962306a36Sopenharmony_ci	new child partition root. The new child subtracts available
226062306a36Sopenharmony_ci	CPUs from its parent "cpuset.cpus.effective".
226162306a36Sopenharmony_ci
226262306a36Sopenharmony_ci	A partition root ("root" or "isolated") can be in one of the
226362306a36Sopenharmony_ci	two possible states - valid or invalid.  An invalid partition
226462306a36Sopenharmony_ci	root is in a degraded state where some state information may
226562306a36Sopenharmony_ci	be retained, but behaves more like a "member".
226662306a36Sopenharmony_ci
226762306a36Sopenharmony_ci	All possible state transitions among "member", "root" and
226862306a36Sopenharmony_ci	"isolated" are allowed.
226962306a36Sopenharmony_ci
227062306a36Sopenharmony_ci	On read, the "cpuset.cpus.partition" file can show the following
227162306a36Sopenharmony_ci	values.
227262306a36Sopenharmony_ci
227362306a36Sopenharmony_ci	  =============================	=====================================
227462306a36Sopenharmony_ci	  "member"			Non-root member of a partition
227562306a36Sopenharmony_ci	  "root"			Partition root
227662306a36Sopenharmony_ci	  "isolated"			Partition root without load balancing
227762306a36Sopenharmony_ci	  "root invalid (<reason>)"	Invalid partition root
227862306a36Sopenharmony_ci	  "isolated invalid (<reason>)"	Invalid isolated partition root
227962306a36Sopenharmony_ci	  =============================	=====================================
228062306a36Sopenharmony_ci
228162306a36Sopenharmony_ci	In the case of an invalid partition root, a descriptive string on
228262306a36Sopenharmony_ci	why the partition is invalid is included within parentheses.
228362306a36Sopenharmony_ci
228462306a36Sopenharmony_ci	For a partition root to become valid, the following conditions
228562306a36Sopenharmony_ci	must be met.
228662306a36Sopenharmony_ci
228762306a36Sopenharmony_ci	1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
228862306a36Sopenharmony_ci	   are not shared by any of its siblings (exclusivity rule).
228962306a36Sopenharmony_ci	2) The parent cgroup is a valid partition root.
229062306a36Sopenharmony_ci	3) The "cpuset.cpus" is not empty and must contain at least
229162306a36Sopenharmony_ci	   one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
229262306a36Sopenharmony_ci	4) The "cpuset.cpus.effective" cannot be empty unless there is
229362306a36Sopenharmony_ci	   no task associated with this partition.
229462306a36Sopenharmony_ci
229562306a36Sopenharmony_ci	External events like hotplug or changes to "cpuset.cpus" can
229662306a36Sopenharmony_ci	cause a valid partition root to become invalid and vice versa.
229762306a36Sopenharmony_ci	Note that a task cannot be moved to a cgroup with empty
229862306a36Sopenharmony_ci	"cpuset.cpus.effective".
229962306a36Sopenharmony_ci
230062306a36Sopenharmony_ci	For a valid partition root with the sibling cpu exclusivity
230162306a36Sopenharmony_ci	rule enabled, changes made to "cpuset.cpus" that violate the
230262306a36Sopenharmony_ci	exclusivity rule will invalidate the partition as well as its
230362306a36Sopenharmony_ci	sibling partitions with conflicting cpuset.cpus values. So
230462306a36Sopenharmony_ci	care must be taking in changing "cpuset.cpus".
230562306a36Sopenharmony_ci
230662306a36Sopenharmony_ci	A valid non-root parent partition may distribute out all its CPUs
230762306a36Sopenharmony_ci	to its child partitions when there is no task associated with it.
230862306a36Sopenharmony_ci
230962306a36Sopenharmony_ci	Care must be taken to change a valid partition root to
231062306a36Sopenharmony_ci	"member" as all its child partitions, if present, will become
231162306a36Sopenharmony_ci	invalid causing disruption to tasks running in those child
231262306a36Sopenharmony_ci	partitions. These inactivated partitions could be recovered if
231362306a36Sopenharmony_ci	their parent is switched back to a partition root with a proper
231462306a36Sopenharmony_ci	set of "cpuset.cpus".
231562306a36Sopenharmony_ci
231662306a36Sopenharmony_ci	Poll and inotify events are triggered whenever the state of
231762306a36Sopenharmony_ci	"cpuset.cpus.partition" changes.  That includes changes caused
231862306a36Sopenharmony_ci	by write to "cpuset.cpus.partition", cpu hotplug or other
231962306a36Sopenharmony_ci	changes that modify the validity status of the partition.
232062306a36Sopenharmony_ci	This will allow user space agents to monitor unexpected changes
232162306a36Sopenharmony_ci	to "cpuset.cpus.partition" without the need to do continuous
232262306a36Sopenharmony_ci	polling.
232362306a36Sopenharmony_ci
232462306a36Sopenharmony_ci
232562306a36Sopenharmony_ciDevice controller
232662306a36Sopenharmony_ci-----------------
232762306a36Sopenharmony_ci
232862306a36Sopenharmony_ciDevice controller manages access to device files. It includes both
232962306a36Sopenharmony_cicreation of new device files (using mknod), and access to the
233062306a36Sopenharmony_ciexisting device files.
233162306a36Sopenharmony_ci
233262306a36Sopenharmony_ciCgroup v2 device controller has no interface files and is implemented
233362306a36Sopenharmony_cion top of cgroup BPF. To control access to device files, a user may
233462306a36Sopenharmony_cicreate bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach
233562306a36Sopenharmony_cithem to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a
233662306a36Sopenharmony_cidevice file, corresponding BPF programs will be executed, and depending
233762306a36Sopenharmony_cion the return value the attempt will succeed or fail with -EPERM.
233862306a36Sopenharmony_ci
233962306a36Sopenharmony_ciA BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the
234062306a36Sopenharmony_cibpf_cgroup_dev_ctx structure, which describes the device access attempt:
234162306a36Sopenharmony_ciaccess type (mknod/read/write) and device (type, major and minor numbers).
234262306a36Sopenharmony_ciIf the program returns 0, the attempt fails with -EPERM, otherwise it
234362306a36Sopenharmony_cisucceeds.
234462306a36Sopenharmony_ci
234562306a36Sopenharmony_ciAn example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in
234662306a36Sopenharmony_citools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
234762306a36Sopenharmony_ci
234862306a36Sopenharmony_ci
234962306a36Sopenharmony_ciRDMA
235062306a36Sopenharmony_ci----
235162306a36Sopenharmony_ci
235262306a36Sopenharmony_ciThe "rdma" controller regulates the distribution and accounting of
235362306a36Sopenharmony_ciRDMA resources.
235462306a36Sopenharmony_ci
235562306a36Sopenharmony_ciRDMA Interface Files
235662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
235762306a36Sopenharmony_ci
235862306a36Sopenharmony_ci  rdma.max
235962306a36Sopenharmony_ci	A readwrite nested-keyed file that exists for all the cgroups
236062306a36Sopenharmony_ci	except root that describes current configured resource limit
236162306a36Sopenharmony_ci	for a RDMA/IB device.
236262306a36Sopenharmony_ci
236362306a36Sopenharmony_ci	Lines are keyed by device name and are not ordered.
236462306a36Sopenharmony_ci	Each line contains space separated resource name and its configured
236562306a36Sopenharmony_ci	limit that can be distributed.
236662306a36Sopenharmony_ci
236762306a36Sopenharmony_ci	The following nested keys are defined.
236862306a36Sopenharmony_ci
236962306a36Sopenharmony_ci	  ==========	=============================
237062306a36Sopenharmony_ci	  hca_handle	Maximum number of HCA Handles
237162306a36Sopenharmony_ci	  hca_object 	Maximum number of HCA Objects
237262306a36Sopenharmony_ci	  ==========	=============================
237362306a36Sopenharmony_ci
237462306a36Sopenharmony_ci	An example for mlx4 and ocrdma device follows::
237562306a36Sopenharmony_ci
237662306a36Sopenharmony_ci	  mlx4_0 hca_handle=2 hca_object=2000
237762306a36Sopenharmony_ci	  ocrdma1 hca_handle=3 hca_object=max
237862306a36Sopenharmony_ci
237962306a36Sopenharmony_ci  rdma.current
238062306a36Sopenharmony_ci	A read-only file that describes current resource usage.
238162306a36Sopenharmony_ci	It exists for all the cgroup except root.
238262306a36Sopenharmony_ci
238362306a36Sopenharmony_ci	An example for mlx4 and ocrdma device follows::
238462306a36Sopenharmony_ci
238562306a36Sopenharmony_ci	  mlx4_0 hca_handle=1 hca_object=20
238662306a36Sopenharmony_ci	  ocrdma1 hca_handle=1 hca_object=23
238762306a36Sopenharmony_ci
238862306a36Sopenharmony_ciHugeTLB
238962306a36Sopenharmony_ci-------
239062306a36Sopenharmony_ci
239162306a36Sopenharmony_ciThe HugeTLB controller allows to limit the HugeTLB usage per control group and
239262306a36Sopenharmony_cienforces the controller limit during page fault.
239362306a36Sopenharmony_ci
239462306a36Sopenharmony_ciHugeTLB Interface Files
239562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
239662306a36Sopenharmony_ci
239762306a36Sopenharmony_ci  hugetlb.<hugepagesize>.current
239862306a36Sopenharmony_ci	Show current usage for "hugepagesize" hugetlb.  It exists for all
239962306a36Sopenharmony_ci	the cgroup except root.
240062306a36Sopenharmony_ci
240162306a36Sopenharmony_ci  hugetlb.<hugepagesize>.max
240262306a36Sopenharmony_ci	Set/show the hard limit of "hugepagesize" hugetlb usage.
240362306a36Sopenharmony_ci	The default value is "max".  It exists for all the cgroup except root.
240462306a36Sopenharmony_ci
240562306a36Sopenharmony_ci  hugetlb.<hugepagesize>.events
240662306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups.
240762306a36Sopenharmony_ci
240862306a36Sopenharmony_ci	  max
240962306a36Sopenharmony_ci		The number of allocation failure due to HugeTLB limit
241062306a36Sopenharmony_ci
241162306a36Sopenharmony_ci  hugetlb.<hugepagesize>.events.local
241262306a36Sopenharmony_ci	Similar to hugetlb.<hugepagesize>.events but the fields in the file
241362306a36Sopenharmony_ci	are local to the cgroup i.e. not hierarchical. The file modified event
241462306a36Sopenharmony_ci	generated on this file reflects only the local events.
241562306a36Sopenharmony_ci
241662306a36Sopenharmony_ci  hugetlb.<hugepagesize>.numa_stat
241762306a36Sopenharmony_ci	Similar to memory.numa_stat, it shows the numa information of the
241862306a36Sopenharmony_ci        hugetlb pages of <hugepagesize> in this cgroup.  Only active in
241962306a36Sopenharmony_ci        use hugetlb pages are included.  The per-node values are in bytes.
242062306a36Sopenharmony_ci
242162306a36Sopenharmony_ciMisc
242262306a36Sopenharmony_ci----
242362306a36Sopenharmony_ci
242462306a36Sopenharmony_ciThe Miscellaneous cgroup provides the resource limiting and tracking
242562306a36Sopenharmony_cimechanism for the scalar resources which cannot be abstracted like the other
242662306a36Sopenharmony_cicgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
242762306a36Sopenharmony_cioption.
242862306a36Sopenharmony_ci
242962306a36Sopenharmony_ciA resource can be added to the controller via enum misc_res_type{} in the
243062306a36Sopenharmony_ciinclude/linux/misc_cgroup.h file and the corresponding name via misc_res_name[]
243162306a36Sopenharmony_ciin the kernel/cgroup/misc.c file. Provider of the resource must set its
243262306a36Sopenharmony_cicapacity prior to using the resource by calling misc_cg_set_capacity().
243362306a36Sopenharmony_ci
243462306a36Sopenharmony_ciOnce a capacity is set then the resource usage can be updated using charge and
243562306a36Sopenharmony_ciuncharge APIs. All of the APIs to interact with misc controller are in
243662306a36Sopenharmony_ciinclude/linux/misc_cgroup.h.
243762306a36Sopenharmony_ci
243862306a36Sopenharmony_ciMisc Interface Files
243962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
244062306a36Sopenharmony_ci
244162306a36Sopenharmony_ciMiscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then:
244262306a36Sopenharmony_ci
244362306a36Sopenharmony_ci  misc.capacity
244462306a36Sopenharmony_ci        A read-only flat-keyed file shown only in the root cgroup.  It shows
244562306a36Sopenharmony_ci        miscellaneous scalar resources available on the platform along with
244662306a36Sopenharmony_ci        their quantities::
244762306a36Sopenharmony_ci
244862306a36Sopenharmony_ci	  $ cat misc.capacity
244962306a36Sopenharmony_ci	  res_a 50
245062306a36Sopenharmony_ci	  res_b 10
245162306a36Sopenharmony_ci
245262306a36Sopenharmony_ci  misc.current
245362306a36Sopenharmony_ci        A read-only flat-keyed file shown in the all cgroups.  It shows
245462306a36Sopenharmony_ci        the current usage of the resources in the cgroup and its children.::
245562306a36Sopenharmony_ci
245662306a36Sopenharmony_ci	  $ cat misc.current
245762306a36Sopenharmony_ci	  res_a 3
245862306a36Sopenharmony_ci	  res_b 0
245962306a36Sopenharmony_ci
246062306a36Sopenharmony_ci  misc.max
246162306a36Sopenharmony_ci        A read-write flat-keyed file shown in the non root cgroups. Allowed
246262306a36Sopenharmony_ci        maximum usage of the resources in the cgroup and its children.::
246362306a36Sopenharmony_ci
246462306a36Sopenharmony_ci	  $ cat misc.max
246562306a36Sopenharmony_ci	  res_a max
246662306a36Sopenharmony_ci	  res_b 4
246762306a36Sopenharmony_ci
246862306a36Sopenharmony_ci	Limit can be set by::
246962306a36Sopenharmony_ci
247062306a36Sopenharmony_ci	  # echo res_a 1 > misc.max
247162306a36Sopenharmony_ci
247262306a36Sopenharmony_ci	Limit can be set to max by::
247362306a36Sopenharmony_ci
247462306a36Sopenharmony_ci	  # echo res_a max > misc.max
247562306a36Sopenharmony_ci
247662306a36Sopenharmony_ci        Limits can be set higher than the capacity value in the misc.capacity
247762306a36Sopenharmony_ci        file.
247862306a36Sopenharmony_ci
247962306a36Sopenharmony_ci  misc.events
248062306a36Sopenharmony_ci	A read-only flat-keyed file which exists on non-root cgroups. The
248162306a36Sopenharmony_ci	following entries are defined. Unless specified otherwise, a value
248262306a36Sopenharmony_ci	change in this file generates a file modified event. All fields in
248362306a36Sopenharmony_ci	this file are hierarchical.
248462306a36Sopenharmony_ci
248562306a36Sopenharmony_ci	  max
248662306a36Sopenharmony_ci		The number of times the cgroup's resource usage was
248762306a36Sopenharmony_ci		about to go over the max boundary.
248862306a36Sopenharmony_ci
248962306a36Sopenharmony_ciMigration and Ownership
249062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
249162306a36Sopenharmony_ci
249262306a36Sopenharmony_ciA miscellaneous scalar resource is charged to the cgroup in which it is used
249362306a36Sopenharmony_cifirst, and stays charged to that cgroup until that resource is freed. Migrating
249462306a36Sopenharmony_cia process to a different cgroup does not move the charge to the destination
249562306a36Sopenharmony_cicgroup where the process has moved.
249662306a36Sopenharmony_ci
249762306a36Sopenharmony_ciOthers
249862306a36Sopenharmony_ci------
249962306a36Sopenharmony_ci
250062306a36Sopenharmony_ciperf_event
250162306a36Sopenharmony_ci~~~~~~~~~~
250262306a36Sopenharmony_ci
250362306a36Sopenharmony_ciperf_event controller, if not mounted on a legacy hierarchy, is
250462306a36Sopenharmony_ciautomatically enabled on the v2 hierarchy so that perf events can
250562306a36Sopenharmony_cialways be filtered by cgroup v2 path.  The controller can still be
250662306a36Sopenharmony_cimoved to a legacy hierarchy after v2 hierarchy is populated.
250762306a36Sopenharmony_ci
250862306a36Sopenharmony_ci
250962306a36Sopenharmony_ciNon-normative information
251062306a36Sopenharmony_ci-------------------------
251162306a36Sopenharmony_ci
251262306a36Sopenharmony_ciThis section contains information that isn't considered to be a part of
251362306a36Sopenharmony_cithe stable kernel API and so is subject to change.
251462306a36Sopenharmony_ci
251562306a36Sopenharmony_ci
251662306a36Sopenharmony_ciCPU controller root cgroup process behaviour
251762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
251862306a36Sopenharmony_ci
251962306a36Sopenharmony_ciWhen distributing CPU cycles in the root cgroup each thread in this
252062306a36Sopenharmony_cicgroup is treated as if it was hosted in a separate child cgroup of the
252162306a36Sopenharmony_ciroot cgroup. This child cgroup weight is dependent on its thread nice
252262306a36Sopenharmony_cilevel.
252362306a36Sopenharmony_ci
252462306a36Sopenharmony_ciFor details of this mapping see sched_prio_to_weight array in
252562306a36Sopenharmony_cikernel/sched/core.c file (values from this array should be scaled
252662306a36Sopenharmony_ciappropriately so the neutral - nice 0 - value is 100 instead of 1024).
252762306a36Sopenharmony_ci
252862306a36Sopenharmony_ci
252962306a36Sopenharmony_ciIO controller root cgroup process behaviour
253062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
253162306a36Sopenharmony_ci
253262306a36Sopenharmony_ciRoot cgroup processes are hosted in an implicit leaf child node.
253362306a36Sopenharmony_ciWhen distributing IO resources this implicit child node is taken into
253462306a36Sopenharmony_ciaccount as if it was a normal child cgroup of the root cgroup with a
253562306a36Sopenharmony_ciweight value of 200.
253662306a36Sopenharmony_ci
253762306a36Sopenharmony_ci
253862306a36Sopenharmony_ciNamespace
253962306a36Sopenharmony_ci=========
254062306a36Sopenharmony_ci
254162306a36Sopenharmony_ciBasics
254262306a36Sopenharmony_ci------
254362306a36Sopenharmony_ci
254462306a36Sopenharmony_cicgroup namespace provides a mechanism to virtualize the view of the
254562306a36Sopenharmony_ci"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
254662306a36Sopenharmony_ciflag can be used with clone(2) and unshare(2) to create a new cgroup
254762306a36Sopenharmony_cinamespace.  The process running inside the cgroup namespace will have
254862306a36Sopenharmony_ciits "/proc/$PID/cgroup" output restricted to cgroupns root.  The
254962306a36Sopenharmony_cicgroupns root is the cgroup of the process at the time of creation of
255062306a36Sopenharmony_cithe cgroup namespace.
255162306a36Sopenharmony_ci
255262306a36Sopenharmony_ciWithout cgroup namespace, the "/proc/$PID/cgroup" file shows the
255362306a36Sopenharmony_cicomplete path of the cgroup of a process.  In a container setup where
255462306a36Sopenharmony_cia set of cgroups and namespaces are intended to isolate processes the
255562306a36Sopenharmony_ci"/proc/$PID/cgroup" file may leak potential system level information
255662306a36Sopenharmony_cito the isolated processes.  For example::
255762306a36Sopenharmony_ci
255862306a36Sopenharmony_ci  # cat /proc/self/cgroup
255962306a36Sopenharmony_ci  0::/batchjobs/container_id1
256062306a36Sopenharmony_ci
256162306a36Sopenharmony_ciThe path '/batchjobs/container_id1' can be considered as system-data
256262306a36Sopenharmony_ciand undesirable to expose to the isolated processes.  cgroup namespace
256362306a36Sopenharmony_cican be used to restrict visibility of this path.  For example, before
256462306a36Sopenharmony_cicreating a cgroup namespace, one would see::
256562306a36Sopenharmony_ci
256662306a36Sopenharmony_ci  # ls -l /proc/self/ns/cgroup
256762306a36Sopenharmony_ci  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
256862306a36Sopenharmony_ci  # cat /proc/self/cgroup
256962306a36Sopenharmony_ci  0::/batchjobs/container_id1
257062306a36Sopenharmony_ci
257162306a36Sopenharmony_ciAfter unsharing a new namespace, the view changes::
257262306a36Sopenharmony_ci
257362306a36Sopenharmony_ci  # ls -l /proc/self/ns/cgroup
257462306a36Sopenharmony_ci  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
257562306a36Sopenharmony_ci  # cat /proc/self/cgroup
257662306a36Sopenharmony_ci  0::/
257762306a36Sopenharmony_ci
257862306a36Sopenharmony_ciWhen some thread from a multi-threaded process unshares its cgroup
257962306a36Sopenharmony_cinamespace, the new cgroupns gets applied to the entire process (all
258062306a36Sopenharmony_cithe threads).  This is natural for the v2 hierarchy; however, for the
258162306a36Sopenharmony_cilegacy hierarchies, this may be unexpected.
258262306a36Sopenharmony_ci
258362306a36Sopenharmony_ciA cgroup namespace is alive as long as there are processes inside or
258462306a36Sopenharmony_cimounts pinning it.  When the last usage goes away, the cgroup
258562306a36Sopenharmony_cinamespace is destroyed.  The cgroupns root and the actual cgroups
258662306a36Sopenharmony_ciremain.
258762306a36Sopenharmony_ci
258862306a36Sopenharmony_ci
258962306a36Sopenharmony_ciThe Root and Views
259062306a36Sopenharmony_ci------------------
259162306a36Sopenharmony_ci
259262306a36Sopenharmony_ciThe 'cgroupns root' for a cgroup namespace is the cgroup in which the
259362306a36Sopenharmony_ciprocess calling unshare(2) is running.  For example, if a process in
259462306a36Sopenharmony_ci/batchjobs/container_id1 cgroup calls unshare, cgroup
259562306a36Sopenharmony_ci/batchjobs/container_id1 becomes the cgroupns root.  For the
259662306a36Sopenharmony_ciinit_cgroup_ns, this is the real root ('/') cgroup.
259762306a36Sopenharmony_ci
259862306a36Sopenharmony_ciThe cgroupns root cgroup does not change even if the namespace creator
259962306a36Sopenharmony_ciprocess later moves to a different cgroup::
260062306a36Sopenharmony_ci
260162306a36Sopenharmony_ci  # ~/unshare -c # unshare cgroupns in some cgroup
260262306a36Sopenharmony_ci  # cat /proc/self/cgroup
260362306a36Sopenharmony_ci  0::/
260462306a36Sopenharmony_ci  # mkdir sub_cgrp_1
260562306a36Sopenharmony_ci  # echo 0 > sub_cgrp_1/cgroup.procs
260662306a36Sopenharmony_ci  # cat /proc/self/cgroup
260762306a36Sopenharmony_ci  0::/sub_cgrp_1
260862306a36Sopenharmony_ci
260962306a36Sopenharmony_ciEach process gets its namespace-specific view of "/proc/$PID/cgroup"
261062306a36Sopenharmony_ci
261162306a36Sopenharmony_ciProcesses running inside the cgroup namespace will be able to see
261262306a36Sopenharmony_cicgroup paths (in /proc/self/cgroup) only inside their root cgroup.
261362306a36Sopenharmony_ciFrom within an unshared cgroupns::
261462306a36Sopenharmony_ci
261562306a36Sopenharmony_ci  # sleep 100000 &
261662306a36Sopenharmony_ci  [1] 7353
261762306a36Sopenharmony_ci  # echo 7353 > sub_cgrp_1/cgroup.procs
261862306a36Sopenharmony_ci  # cat /proc/7353/cgroup
261962306a36Sopenharmony_ci  0::/sub_cgrp_1
262062306a36Sopenharmony_ci
262162306a36Sopenharmony_ciFrom the initial cgroup namespace, the real cgroup path will be
262262306a36Sopenharmony_civisible::
262362306a36Sopenharmony_ci
262462306a36Sopenharmony_ci  $ cat /proc/7353/cgroup
262562306a36Sopenharmony_ci  0::/batchjobs/container_id1/sub_cgrp_1
262662306a36Sopenharmony_ci
262762306a36Sopenharmony_ciFrom a sibling cgroup namespace (that is, a namespace rooted at a
262862306a36Sopenharmony_cidifferent cgroup), the cgroup path relative to its own cgroup
262962306a36Sopenharmony_cinamespace root will be shown.  For instance, if PID 7353's cgroup
263062306a36Sopenharmony_cinamespace root is at '/batchjobs/container_id2', then it will see::
263162306a36Sopenharmony_ci
263262306a36Sopenharmony_ci  # cat /proc/7353/cgroup
263362306a36Sopenharmony_ci  0::/../container_id2/sub_cgrp_1
263462306a36Sopenharmony_ci
263562306a36Sopenharmony_ciNote that the relative path always starts with '/' to indicate that
263662306a36Sopenharmony_ciits relative to the cgroup namespace root of the caller.
263762306a36Sopenharmony_ci
263862306a36Sopenharmony_ci
263962306a36Sopenharmony_ciMigration and setns(2)
264062306a36Sopenharmony_ci----------------------
264162306a36Sopenharmony_ci
264262306a36Sopenharmony_ciProcesses inside a cgroup namespace can move into and out of the
264362306a36Sopenharmony_cinamespace root if they have proper access to external cgroups.  For
264462306a36Sopenharmony_ciexample, from inside a namespace with cgroupns root at
264562306a36Sopenharmony_ci/batchjobs/container_id1, and assuming that the global hierarchy is
264662306a36Sopenharmony_cistill accessible inside cgroupns::
264762306a36Sopenharmony_ci
264862306a36Sopenharmony_ci  # cat /proc/7353/cgroup
264962306a36Sopenharmony_ci  0::/sub_cgrp_1
265062306a36Sopenharmony_ci  # echo 7353 > batchjobs/container_id2/cgroup.procs
265162306a36Sopenharmony_ci  # cat /proc/7353/cgroup
265262306a36Sopenharmony_ci  0::/../container_id2
265362306a36Sopenharmony_ci
265462306a36Sopenharmony_ciNote that this kind of setup is not encouraged.  A task inside cgroup
265562306a36Sopenharmony_cinamespace should only be exposed to its own cgroupns hierarchy.
265662306a36Sopenharmony_ci
265762306a36Sopenharmony_cisetns(2) to another cgroup namespace is allowed when:
265862306a36Sopenharmony_ci
265962306a36Sopenharmony_ci(a) the process has CAP_SYS_ADMIN against its current user namespace
266062306a36Sopenharmony_ci(b) the process has CAP_SYS_ADMIN against the target cgroup
266162306a36Sopenharmony_ci    namespace's userns
266262306a36Sopenharmony_ci
266362306a36Sopenharmony_ciNo implicit cgroup changes happen with attaching to another cgroup
266462306a36Sopenharmony_cinamespace.  It is expected that the someone moves the attaching
266562306a36Sopenharmony_ciprocess under the target cgroup namespace root.
266662306a36Sopenharmony_ci
266762306a36Sopenharmony_ci
266862306a36Sopenharmony_ciInteraction with Other Namespaces
266962306a36Sopenharmony_ci---------------------------------
267062306a36Sopenharmony_ci
267162306a36Sopenharmony_ciNamespace specific cgroup hierarchy can be mounted by a process
267262306a36Sopenharmony_cirunning inside a non-init cgroup namespace::
267362306a36Sopenharmony_ci
267462306a36Sopenharmony_ci  # mount -t cgroup2 none $MOUNT_POINT
267562306a36Sopenharmony_ci
267662306a36Sopenharmony_ciThis will mount the unified cgroup hierarchy with cgroupns root as the
267762306a36Sopenharmony_cifilesystem root.  The process needs CAP_SYS_ADMIN against its user and
267862306a36Sopenharmony_cimount namespaces.
267962306a36Sopenharmony_ci
268062306a36Sopenharmony_ciThe virtualization of /proc/self/cgroup file combined with restricting
268162306a36Sopenharmony_cithe view of cgroup hierarchy by namespace-private cgroupfs mount
268262306a36Sopenharmony_ciprovides a properly isolated cgroup view inside the container.
268362306a36Sopenharmony_ci
268462306a36Sopenharmony_ci
268562306a36Sopenharmony_ciInformation on Kernel Programming
268662306a36Sopenharmony_ci=================================
268762306a36Sopenharmony_ci
268862306a36Sopenharmony_ciThis section contains kernel programming information in the areas
268962306a36Sopenharmony_ciwhere interacting with cgroup is necessary.  cgroup core and
269062306a36Sopenharmony_cicontrollers are not covered.
269162306a36Sopenharmony_ci
269262306a36Sopenharmony_ci
269362306a36Sopenharmony_ciFilesystem Support for Writeback
269462306a36Sopenharmony_ci--------------------------------
269562306a36Sopenharmony_ci
269662306a36Sopenharmony_ciA filesystem can support cgroup writeback by updating
269762306a36Sopenharmony_ciaddress_space_operations->writepage[s]() to annotate bio's using the
269862306a36Sopenharmony_cifollowing two functions.
269962306a36Sopenharmony_ci
270062306a36Sopenharmony_ci  wbc_init_bio(@wbc, @bio)
270162306a36Sopenharmony_ci	Should be called for each bio carrying writeback data and
270262306a36Sopenharmony_ci	associates the bio with the inode's owner cgroup and the
270362306a36Sopenharmony_ci	corresponding request queue.  This must be called after
270462306a36Sopenharmony_ci	a queue (device) has been associated with the bio and
270562306a36Sopenharmony_ci	before submission.
270662306a36Sopenharmony_ci
270762306a36Sopenharmony_ci  wbc_account_cgroup_owner(@wbc, @page, @bytes)
270862306a36Sopenharmony_ci	Should be called for each data segment being written out.
270962306a36Sopenharmony_ci	While this function doesn't care exactly when it's called
271062306a36Sopenharmony_ci	during the writeback session, it's the easiest and most
271162306a36Sopenharmony_ci	natural to call it as data segments are added to a bio.
271262306a36Sopenharmony_ci
271362306a36Sopenharmony_ciWith writeback bio's annotated, cgroup support can be enabled per
271462306a36Sopenharmony_cisuper_block by setting SB_I_CGROUPWB in ->s_iflags.  This allows for
271562306a36Sopenharmony_ciselective disabling of cgroup writeback support which is helpful when
271662306a36Sopenharmony_cicertain filesystem features, e.g. journaled data mode, are
271762306a36Sopenharmony_ciincompatible.
271862306a36Sopenharmony_ci
271962306a36Sopenharmony_ciwbc_init_bio() binds the specified bio to its cgroup.  Depending on
272062306a36Sopenharmony_cithe configuration, the bio may be executed at a lower priority and if
272162306a36Sopenharmony_cithe writeback session is holding shared resources, e.g. a journal
272262306a36Sopenharmony_cientry, may lead to priority inversion.  There is no one easy solution
272362306a36Sopenharmony_cifor the problem.  Filesystems can try to work around specific problem
272462306a36Sopenharmony_cicases by skipping wbc_init_bio() and using bio_associate_blkg()
272562306a36Sopenharmony_cidirectly.
272662306a36Sopenharmony_ci
272762306a36Sopenharmony_ci
272862306a36Sopenharmony_ciDeprecated v1 Core Features
272962306a36Sopenharmony_ci===========================
273062306a36Sopenharmony_ci
273162306a36Sopenharmony_ci- Multiple hierarchies including named ones are not supported.
273262306a36Sopenharmony_ci
273362306a36Sopenharmony_ci- All v1 mount options are not supported.
273462306a36Sopenharmony_ci
273562306a36Sopenharmony_ci- The "tasks" file is removed and "cgroup.procs" is not sorted.
273662306a36Sopenharmony_ci
273762306a36Sopenharmony_ci- "cgroup.clone_children" is removed.
273862306a36Sopenharmony_ci
273962306a36Sopenharmony_ci- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
274062306a36Sopenharmony_ci  at the root instead.
274162306a36Sopenharmony_ci
274262306a36Sopenharmony_ci
274362306a36Sopenharmony_ciIssues with v1 and Rationales for v2
274462306a36Sopenharmony_ci====================================
274562306a36Sopenharmony_ci
274662306a36Sopenharmony_ciMultiple Hierarchies
274762306a36Sopenharmony_ci--------------------
274862306a36Sopenharmony_ci
274962306a36Sopenharmony_cicgroup v1 allowed an arbitrary number of hierarchies and each
275062306a36Sopenharmony_cihierarchy could host any number of controllers.  While this seemed to
275162306a36Sopenharmony_ciprovide a high level of flexibility, it wasn't useful in practice.
275262306a36Sopenharmony_ci
275362306a36Sopenharmony_ciFor example, as there is only one instance of each controller, utility
275462306a36Sopenharmony_citype controllers such as freezer which can be useful in all
275562306a36Sopenharmony_cihierarchies could only be used in one.  The issue is exacerbated by
275662306a36Sopenharmony_cithe fact that controllers couldn't be moved to another hierarchy once
275762306a36Sopenharmony_cihierarchies were populated.  Another issue was that all controllers
275862306a36Sopenharmony_cibound to a hierarchy were forced to have exactly the same view of the
275962306a36Sopenharmony_cihierarchy.  It wasn't possible to vary the granularity depending on
276062306a36Sopenharmony_cithe specific controller.
276162306a36Sopenharmony_ci
276262306a36Sopenharmony_ciIn practice, these issues heavily limited which controllers could be
276362306a36Sopenharmony_ciput on the same hierarchy and most configurations resorted to putting
276462306a36Sopenharmony_cieach controller on its own hierarchy.  Only closely related ones, such
276562306a36Sopenharmony_cias the cpu and cpuacct controllers, made sense to be put on the same
276662306a36Sopenharmony_cihierarchy.  This often meant that userland ended up managing multiple
276762306a36Sopenharmony_cisimilar hierarchies repeating the same steps on each hierarchy
276862306a36Sopenharmony_ciwhenever a hierarchy management operation was necessary.
276962306a36Sopenharmony_ci
277062306a36Sopenharmony_ciFurthermore, support for multiple hierarchies came at a steep cost.
277162306a36Sopenharmony_ciIt greatly complicated cgroup core implementation but more importantly
277262306a36Sopenharmony_cithe support for multiple hierarchies restricted how cgroup could be
277362306a36Sopenharmony_ciused in general and what controllers was able to do.
277462306a36Sopenharmony_ci
277562306a36Sopenharmony_ciThere was no limit on how many hierarchies there might be, which meant
277662306a36Sopenharmony_cithat a thread's cgroup membership couldn't be described in finite
277762306a36Sopenharmony_cilength.  The key might contain any number of entries and was unlimited
277862306a36Sopenharmony_ciin length, which made it highly awkward to manipulate and led to
277962306a36Sopenharmony_ciaddition of controllers which existed only to identify membership,
278062306a36Sopenharmony_ciwhich in turn exacerbated the original problem of proliferating number
278162306a36Sopenharmony_ciof hierarchies.
278262306a36Sopenharmony_ci
278362306a36Sopenharmony_ciAlso, as a controller couldn't have any expectation regarding the
278462306a36Sopenharmony_citopologies of hierarchies other controllers might be on, each
278562306a36Sopenharmony_cicontroller had to assume that all other controllers were attached to
278662306a36Sopenharmony_cicompletely orthogonal hierarchies.  This made it impossible, or at
278762306a36Sopenharmony_cileast very cumbersome, for controllers to cooperate with each other.
278862306a36Sopenharmony_ci
278962306a36Sopenharmony_ciIn most use cases, putting controllers on hierarchies which are
279062306a36Sopenharmony_cicompletely orthogonal to each other isn't necessary.  What usually is
279162306a36Sopenharmony_cicalled for is the ability to have differing levels of granularity
279262306a36Sopenharmony_cidepending on the specific controller.  In other words, hierarchy may
279362306a36Sopenharmony_cibe collapsed from leaf towards root when viewed from specific
279462306a36Sopenharmony_cicontrollers.  For example, a given configuration might not care about
279562306a36Sopenharmony_cihow memory is distributed beyond a certain level while still wanting
279662306a36Sopenharmony_cito control how CPU cycles are distributed.
279762306a36Sopenharmony_ci
279862306a36Sopenharmony_ci
279962306a36Sopenharmony_ciThread Granularity
280062306a36Sopenharmony_ci------------------
280162306a36Sopenharmony_ci
280262306a36Sopenharmony_cicgroup v1 allowed threads of a process to belong to different cgroups.
280362306a36Sopenharmony_ciThis didn't make sense for some controllers and those controllers
280462306a36Sopenharmony_ciended up implementing different ways to ignore such situations but
280562306a36Sopenharmony_cimuch more importantly it blurred the line between API exposed to
280662306a36Sopenharmony_ciindividual applications and system management interface.
280762306a36Sopenharmony_ci
280862306a36Sopenharmony_ciGenerally, in-process knowledge is available only to the process
280962306a36Sopenharmony_ciitself; thus, unlike service-level organization of processes,
281062306a36Sopenharmony_cicategorizing threads of a process requires active participation from
281162306a36Sopenharmony_cithe application which owns the target process.
281262306a36Sopenharmony_ci
281362306a36Sopenharmony_cicgroup v1 had an ambiguously defined delegation model which got abused
281462306a36Sopenharmony_ciin combination with thread granularity.  cgroups were delegated to
281562306a36Sopenharmony_ciindividual applications so that they can create and manage their own
281662306a36Sopenharmony_cisub-hierarchies and control resource distributions along them.  This
281762306a36Sopenharmony_cieffectively raised cgroup to the status of a syscall-like API exposed
281862306a36Sopenharmony_cito lay programs.
281962306a36Sopenharmony_ci
282062306a36Sopenharmony_ciFirst of all, cgroup has a fundamentally inadequate interface to be
282162306a36Sopenharmony_ciexposed this way.  For a process to access its own knobs, it has to
282262306a36Sopenharmony_ciextract the path on the target hierarchy from /proc/self/cgroup,
282362306a36Sopenharmony_ciconstruct the path by appending the name of the knob to the path, open
282462306a36Sopenharmony_ciand then read and/or write to it.  This is not only extremely clunky
282562306a36Sopenharmony_ciand unusual but also inherently racy.  There is no conventional way to
282662306a36Sopenharmony_cidefine transaction across the required steps and nothing can guarantee
282762306a36Sopenharmony_cithat the process would actually be operating on its own sub-hierarchy.
282862306a36Sopenharmony_ci
282962306a36Sopenharmony_cicgroup controllers implemented a number of knobs which would never be
283062306a36Sopenharmony_ciaccepted as public APIs because they were just adding control knobs to
283162306a36Sopenharmony_cisystem-management pseudo filesystem.  cgroup ended up with interface
283262306a36Sopenharmony_ciknobs which were not properly abstracted or refined and directly
283362306a36Sopenharmony_cirevealed kernel internal details.  These knobs got exposed to
283462306a36Sopenharmony_ciindividual applications through the ill-defined delegation mechanism
283562306a36Sopenharmony_cieffectively abusing cgroup as a shortcut to implementing public APIs
283662306a36Sopenharmony_ciwithout going through the required scrutiny.
283762306a36Sopenharmony_ci
283862306a36Sopenharmony_ciThis was painful for both userland and kernel.  Userland ended up with
283962306a36Sopenharmony_cimisbehaving and poorly abstracted interfaces and kernel exposing and
284062306a36Sopenharmony_cilocked into constructs inadvertently.
284162306a36Sopenharmony_ci
284262306a36Sopenharmony_ci
284362306a36Sopenharmony_ciCompetition Between Inner Nodes and Threads
284462306a36Sopenharmony_ci-------------------------------------------
284562306a36Sopenharmony_ci
284662306a36Sopenharmony_cicgroup v1 allowed threads to be in any cgroups which created an
284762306a36Sopenharmony_ciinteresting problem where threads belonging to a parent cgroup and its
284862306a36Sopenharmony_cichildren cgroups competed for resources.  This was nasty as two
284962306a36Sopenharmony_cidifferent types of entities competed and there was no obvious way to
285062306a36Sopenharmony_cisettle it.  Different controllers did different things.
285162306a36Sopenharmony_ci
285262306a36Sopenharmony_ciThe cpu controller considered threads and cgroups as equivalents and
285362306a36Sopenharmony_cimapped nice levels to cgroup weights.  This worked for some cases but
285462306a36Sopenharmony_cifell flat when children wanted to be allocated specific ratios of CPU
285562306a36Sopenharmony_cicycles and the number of internal threads fluctuated - the ratios
285662306a36Sopenharmony_ciconstantly changed as the number of competing entities fluctuated.
285762306a36Sopenharmony_ciThere also were other issues.  The mapping from nice level to weight
285862306a36Sopenharmony_ciwasn't obvious or universal, and there were various other knobs which
285962306a36Sopenharmony_cisimply weren't available for threads.
286062306a36Sopenharmony_ci
286162306a36Sopenharmony_ciThe io controller implicitly created a hidden leaf node for each
286262306a36Sopenharmony_cicgroup to host the threads.  The hidden leaf had its own copies of all
286362306a36Sopenharmony_cithe knobs with ``leaf_`` prefixed.  While this allowed equivalent
286462306a36Sopenharmony_cicontrol over internal threads, it was with serious drawbacks.  It
286562306a36Sopenharmony_cialways added an extra layer of nesting which wouldn't be necessary
286662306a36Sopenharmony_ciotherwise, made the interface messy and significantly complicated the
286762306a36Sopenharmony_ciimplementation.
286862306a36Sopenharmony_ci
286962306a36Sopenharmony_ciThe memory controller didn't have a way to control what happened
287062306a36Sopenharmony_cibetween internal tasks and child cgroups and the behavior was not
287162306a36Sopenharmony_ciclearly defined.  There were attempts to add ad-hoc behaviors and
287262306a36Sopenharmony_ciknobs to tailor the behavior to specific workloads which would have
287362306a36Sopenharmony_ciled to problems extremely difficult to resolve in the long term.
287462306a36Sopenharmony_ci
287562306a36Sopenharmony_ciMultiple controllers struggled with internal tasks and came up with
287662306a36Sopenharmony_cidifferent ways to deal with it; unfortunately, all the approaches were
287762306a36Sopenharmony_ciseverely flawed and, furthermore, the widely different behaviors
287862306a36Sopenharmony_cimade cgroup as a whole highly inconsistent.
287962306a36Sopenharmony_ci
288062306a36Sopenharmony_ciThis clearly is a problem which needs to be addressed from cgroup core
288162306a36Sopenharmony_ciin a uniform way.
288262306a36Sopenharmony_ci
288362306a36Sopenharmony_ci
288462306a36Sopenharmony_ciOther Interface Issues
288562306a36Sopenharmony_ci----------------------
288662306a36Sopenharmony_ci
288762306a36Sopenharmony_cicgroup v1 grew without oversight and developed a large number of
288862306a36Sopenharmony_ciidiosyncrasies and inconsistencies.  One issue on the cgroup core side
288962306a36Sopenharmony_ciwas how an empty cgroup was notified - a userland helper binary was
289062306a36Sopenharmony_ciforked and executed for each event.  The event delivery wasn't
289162306a36Sopenharmony_cirecursive or delegatable.  The limitations of the mechanism also led
289262306a36Sopenharmony_cito in-kernel event delivery filtering mechanism further complicating
289362306a36Sopenharmony_cithe interface.
289462306a36Sopenharmony_ci
289562306a36Sopenharmony_ciController interfaces were problematic too.  An extreme example is
289662306a36Sopenharmony_cicontrollers completely ignoring hierarchical organization and treating
289762306a36Sopenharmony_ciall cgroups as if they were all located directly under the root
289862306a36Sopenharmony_cicgroup.  Some controllers exposed a large amount of inconsistent
289962306a36Sopenharmony_ciimplementation details to userland.
290062306a36Sopenharmony_ci
290162306a36Sopenharmony_ciThere also was no consistency across controllers.  When a new cgroup
290262306a36Sopenharmony_ciwas created, some controllers defaulted to not imposing extra
290362306a36Sopenharmony_cirestrictions while others disallowed any resource usage until
290462306a36Sopenharmony_ciexplicitly configured.  Configuration knobs for the same type of
290562306a36Sopenharmony_cicontrol used widely differing naming schemes and formats.  Statistics
290662306a36Sopenharmony_ciand information knobs were named arbitrarily and used different
290762306a36Sopenharmony_ciformats and units even in the same controller.
290862306a36Sopenharmony_ci
290962306a36Sopenharmony_cicgroup v2 establishes common conventions where appropriate and updates
291062306a36Sopenharmony_cicontrollers so that they expose minimal and consistent interfaces.
291162306a36Sopenharmony_ci
291262306a36Sopenharmony_ci
291362306a36Sopenharmony_ciController Issues and Remedies
291462306a36Sopenharmony_ci------------------------------
291562306a36Sopenharmony_ci
291662306a36Sopenharmony_ciMemory
291762306a36Sopenharmony_ci~~~~~~
291862306a36Sopenharmony_ci
291962306a36Sopenharmony_ciThe original lower boundary, the soft limit, is defined as a limit
292062306a36Sopenharmony_cithat is per default unset.  As a result, the set of cgroups that
292162306a36Sopenharmony_ciglobal reclaim prefers is opt-in, rather than opt-out.  The costs for
292262306a36Sopenharmony_cioptimizing these mostly negative lookups are so high that the
292362306a36Sopenharmony_ciimplementation, despite its enormous size, does not even provide the
292462306a36Sopenharmony_cibasic desirable behavior.  First off, the soft limit has no
292562306a36Sopenharmony_cihierarchical meaning.  All configured groups are organized in a global
292662306a36Sopenharmony_cirbtree and treated like equal peers, regardless where they are located
292762306a36Sopenharmony_ciin the hierarchy.  This makes subtree delegation impossible.  Second,
292862306a36Sopenharmony_cithe soft limit reclaim pass is so aggressive that it not just
292962306a36Sopenharmony_ciintroduces high allocation latencies into the system, but also impacts
293062306a36Sopenharmony_cisystem performance due to overreclaim, to the point where the feature
293162306a36Sopenharmony_cibecomes self-defeating.
293262306a36Sopenharmony_ci
293362306a36Sopenharmony_ciThe memory.low boundary on the other hand is a top-down allocated
293462306a36Sopenharmony_cireserve.  A cgroup enjoys reclaim protection when it's within its
293562306a36Sopenharmony_cieffective low, which makes delegation of subtrees possible. It also
293662306a36Sopenharmony_cienjoys having reclaim pressure proportional to its overage when
293762306a36Sopenharmony_ciabove its effective low.
293862306a36Sopenharmony_ci
293962306a36Sopenharmony_ciThe original high boundary, the hard limit, is defined as a strict
294062306a36Sopenharmony_cilimit that can not budge, even if the OOM killer has to be called.
294162306a36Sopenharmony_ciBut this generally goes against the goal of making the most out of the
294262306a36Sopenharmony_ciavailable memory.  The memory consumption of workloads varies during
294362306a36Sopenharmony_ciruntime, and that requires users to overcommit.  But doing that with a
294462306a36Sopenharmony_cistrict upper limit requires either a fairly accurate prediction of the
294562306a36Sopenharmony_ciworking set size or adding slack to the limit.  Since working set size
294662306a36Sopenharmony_ciestimation is hard and error prone, and getting it wrong results in
294762306a36Sopenharmony_ciOOM kills, most users tend to err on the side of a looser limit and
294862306a36Sopenharmony_ciend up wasting precious resources.
294962306a36Sopenharmony_ci
295062306a36Sopenharmony_ciThe memory.high boundary on the other hand can be set much more
295162306a36Sopenharmony_ciconservatively.  When hit, it throttles allocations by forcing them
295262306a36Sopenharmony_ciinto direct reclaim to work off the excess, but it never invokes the
295362306a36Sopenharmony_ciOOM killer.  As a result, a high boundary that is chosen too
295462306a36Sopenharmony_ciaggressively will not terminate the processes, but instead it will
295562306a36Sopenharmony_cilead to gradual performance degradation.  The user can monitor this
295662306a36Sopenharmony_ciand make corrections until the minimal memory footprint that still
295762306a36Sopenharmony_cigives acceptable performance is found.
295862306a36Sopenharmony_ci
295962306a36Sopenharmony_ciIn extreme cases, with many concurrent allocations and a complete
296062306a36Sopenharmony_cibreakdown of reclaim progress within the group, the high boundary can
296162306a36Sopenharmony_cibe exceeded.  But even then it's mostly better to satisfy the
296262306a36Sopenharmony_ciallocation from the slack available in other groups or the rest of the
296362306a36Sopenharmony_cisystem than killing the group.  Otherwise, memory.max is there to
296462306a36Sopenharmony_cilimit this type of spillover and ultimately contain buggy or even
296562306a36Sopenharmony_cimalicious applications.
296662306a36Sopenharmony_ci
296762306a36Sopenharmony_ciSetting the original memory.limit_in_bytes below the current usage was
296862306a36Sopenharmony_cisubject to a race condition, where concurrent charges could cause the
296962306a36Sopenharmony_cilimit setting to fail. memory.max on the other hand will first set the
297062306a36Sopenharmony_cilimit to prevent new charges, and then reclaim and OOM kill until the
297162306a36Sopenharmony_cinew limit is met - or the task writing to memory.max is killed.
297262306a36Sopenharmony_ci
297362306a36Sopenharmony_ciThe combined memory+swap accounting and limiting is replaced by real
297462306a36Sopenharmony_cicontrol over swap space.
297562306a36Sopenharmony_ci
297662306a36Sopenharmony_ciThe main argument for a combined memory+swap facility in the original
297762306a36Sopenharmony_cicgroup design was that global or parental pressure would always be
297862306a36Sopenharmony_ciable to swap all anonymous memory of a child group, regardless of the
297962306a36Sopenharmony_cichild's own (possibly untrusted) configuration.  However, untrusted
298062306a36Sopenharmony_cigroups can sabotage swapping by other means - such as referencing its
298162306a36Sopenharmony_cianonymous memory in a tight loop - and an admin can not assume full
298262306a36Sopenharmony_ciswappability when overcommitting untrusted jobs.
298362306a36Sopenharmony_ci
298462306a36Sopenharmony_ciFor trusted jobs, on the other hand, a combined counter is not an
298562306a36Sopenharmony_ciintuitive userspace interface, and it flies in the face of the idea
298662306a36Sopenharmony_cithat cgroup controllers should account and limit specific physical
298762306a36Sopenharmony_ciresources.  Swap space is a resource like all others in the system,
298862306a36Sopenharmony_ciand that's why unified hierarchy allows distributing it separately.
2989