162306a36Sopenharmony_ci================== 262306a36Sopenharmony_ciNUMA Memory Policy 362306a36Sopenharmony_ci================== 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciWhat is NUMA Memory Policy? 662306a36Sopenharmony_ci============================ 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciIn the Linux kernel, "memory policy" determines from which node the kernel will 962306a36Sopenharmony_ciallocate memory in a NUMA system or in an emulated NUMA system. Linux has 1062306a36Sopenharmony_cisupported platforms with Non-Uniform Memory Access architectures since 2.4.?. 1162306a36Sopenharmony_ciThe current memory policy support was added to Linux 2.6 around May 2004. This 1262306a36Sopenharmony_cidocument attempts to describe the concepts and APIs of the 2.6 memory policy 1362306a36Sopenharmony_cisupport. 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciMemory policies should not be confused with cpusets 1662306a36Sopenharmony_ci(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) 1762306a36Sopenharmony_ciwhich is an administrative mechanism for restricting the nodes from which 1862306a36Sopenharmony_cimemory may be allocated by a set of processes. Memory policies are a 1962306a36Sopenharmony_ciprogramming interface that a NUMA-aware application can take advantage of. When 2062306a36Sopenharmony_ciboth cpusets and policies are applied to a task, the restrictions of the cpuset 2162306a36Sopenharmony_citakes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` 2262306a36Sopenharmony_cibelow for more details. 2362306a36Sopenharmony_ci 2462306a36Sopenharmony_ciMemory Policy Concepts 2562306a36Sopenharmony_ci====================== 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ciScope of Memory Policies 2862306a36Sopenharmony_ci------------------------ 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ciThe Linux kernel supports _scopes_ of memory policy, described here from 3162306a36Sopenharmony_cimost general to most specific: 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ciSystem Default Policy 3462306a36Sopenharmony_ci this policy is "hard coded" into the kernel. It is the policy 3562306a36Sopenharmony_ci that governs all page allocations that aren't controlled by 3662306a36Sopenharmony_ci one of the more specific policy scopes discussed below. When 3762306a36Sopenharmony_ci the system is "up and running", the system default policy will 3862306a36Sopenharmony_ci use "local allocation" described below. However, during boot 3962306a36Sopenharmony_ci up, the system default policy will be set to interleave 4062306a36Sopenharmony_ci allocations across all nodes with "sufficient" memory, so as 4162306a36Sopenharmony_ci not to overload the initial boot node with boot-time 4262306a36Sopenharmony_ci allocations. 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ciTask/Process Policy 4562306a36Sopenharmony_ci this is an optional, per-task policy. When defined for a 4662306a36Sopenharmony_ci specific task, this policy controls all page allocations made 4762306a36Sopenharmony_ci by or on behalf of the task that aren't controlled by a more 4862306a36Sopenharmony_ci specific scope. If a task does not define a task policy, then 4962306a36Sopenharmony_ci all page allocations that would have been controlled by the 5062306a36Sopenharmony_ci task policy "fall back" to the System Default Policy. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ci The task policy applies to the entire address space of a task. Thus, 5362306a36Sopenharmony_ci it is inheritable, and indeed is inherited, across both fork() 5462306a36Sopenharmony_ci [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task 5562306a36Sopenharmony_ci to establish the task policy for a child task exec()'d from an 5662306a36Sopenharmony_ci executable image that has no awareness of memory policy. See the 5762306a36Sopenharmony_ci :ref:`Memory Policy APIs <memory_policy_apis>` section, 5862306a36Sopenharmony_ci below, for an overview of the system call 5962306a36Sopenharmony_ci that a task may use to set/change its task/process policy. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ci In a multi-threaded task, task policies apply only to the thread 6262306a36Sopenharmony_ci [Linux kernel task] that installs the policy and any threads 6362306a36Sopenharmony_ci subsequently created by that thread. Any sibling threads existing 6462306a36Sopenharmony_ci at the time a new task policy is installed retain their current 6562306a36Sopenharmony_ci policy. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ci A task policy applies only to pages allocated after the policy is 6862306a36Sopenharmony_ci installed. Any pages already faulted in by the task when the task 6962306a36Sopenharmony_ci changes its task policy remain where they were allocated based on 7062306a36Sopenharmony_ci the policy at the time they were allocated. 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ci.. _vma_policy: 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciVMA Policy 7562306a36Sopenharmony_ci A "VMA" or "Virtual Memory Area" refers to a range of a task's 7662306a36Sopenharmony_ci virtual address space. A task may define a specific policy for a range 7762306a36Sopenharmony_ci of its virtual address space. See the 7862306a36Sopenharmony_ci :ref:`Memory Policy APIs <memory_policy_apis>` section, 7962306a36Sopenharmony_ci below, for an overview of the mbind() system call used to set a VMA 8062306a36Sopenharmony_ci policy. 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ci A VMA policy will govern the allocation of pages that back 8362306a36Sopenharmony_ci this region of the address space. Any regions of the task's 8462306a36Sopenharmony_ci address space that don't have an explicit VMA policy will fall 8562306a36Sopenharmony_ci back to the task policy, which may itself fall back to the 8662306a36Sopenharmony_ci System Default Policy. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci VMA policies have a few complicating details: 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ci * VMA policy applies ONLY to anonymous pages. These include 9162306a36Sopenharmony_ci pages allocated for anonymous segments, such as the task 9262306a36Sopenharmony_ci stack and heap, and any regions of the address space 9362306a36Sopenharmony_ci mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is 9462306a36Sopenharmony_ci applied to a file mapping, it will be ignored if the mapping 9562306a36Sopenharmony_ci used the MAP_SHARED flag. If the file mapping used the 9662306a36Sopenharmony_ci MAP_PRIVATE flag, the VMA policy will only be applied when 9762306a36Sopenharmony_ci an anonymous page is allocated on an attempt to write to the 9862306a36Sopenharmony_ci mapping-- i.e., at Copy-On-Write. 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ci * VMA policies are shared between all tasks that share a 10162306a36Sopenharmony_ci virtual address space--a.k.a. threads--independent of when 10262306a36Sopenharmony_ci the policy is installed; and they are inherited across 10362306a36Sopenharmony_ci fork(). However, because VMA policies refer to a specific 10462306a36Sopenharmony_ci region of a task's address space, and because the address 10562306a36Sopenharmony_ci space is discarded and recreated on exec*(), VMA policies 10662306a36Sopenharmony_ci are NOT inheritable across exec(). Thus, only NUMA-aware 10762306a36Sopenharmony_ci applications may use VMA policies. 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ci * A task may install a new VMA policy on a sub-range of a 11062306a36Sopenharmony_ci previously mmap()ed region. When this happens, Linux splits 11162306a36Sopenharmony_ci the existing virtual memory area into 2 or 3 VMAs, each with 11262306a36Sopenharmony_ci its own policy. 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ci * By default, VMA policy applies only to pages allocated after 11562306a36Sopenharmony_ci the policy is installed. Any pages already faulted into the 11662306a36Sopenharmony_ci VMA range remain where they were allocated based on the 11762306a36Sopenharmony_ci policy at the time they were allocated. However, since 11862306a36Sopenharmony_ci 2.6.16, Linux supports page migration via the mbind() system 11962306a36Sopenharmony_ci call, so that page contents can be moved to match a newly 12062306a36Sopenharmony_ci installed policy. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ciShared Policy 12362306a36Sopenharmony_ci Conceptually, shared policies apply to "memory objects" mapped 12462306a36Sopenharmony_ci shared into one or more tasks' distinct address spaces. An 12562306a36Sopenharmony_ci application installs shared policies the same way as VMA 12662306a36Sopenharmony_ci policies--using the mbind() system call specifying a range of 12762306a36Sopenharmony_ci virtual addresses that map the shared object. However, unlike 12862306a36Sopenharmony_ci VMA policies, which can be considered to be an attribute of a 12962306a36Sopenharmony_ci range of a task's address space, shared policies apply 13062306a36Sopenharmony_ci directly to the shared object. Thus, all tasks that attach to 13162306a36Sopenharmony_ci the object share the policy, and all pages allocated for the 13262306a36Sopenharmony_ci shared object, by any task, will obey the shared policy. 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_ci As of 2.6.22, only shared memory segments, created by shmget() or 13562306a36Sopenharmony_ci mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared 13662306a36Sopenharmony_ci policy support was added to Linux, the associated data structures were 13762306a36Sopenharmony_ci added to hugetlbfs shmem segments. At the time, hugetlbfs did not 13862306a36Sopenharmony_ci support allocation at fault time--a.k.a lazy allocation--so hugetlbfs 13962306a36Sopenharmony_ci shmem segments were never "hooked up" to the shared policy support. 14062306a36Sopenharmony_ci Although hugetlbfs segments now support lazy allocation, their support 14162306a36Sopenharmony_ci for shared policy has not been completed. 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ci As mentioned above in :ref:`VMA policies <vma_policy>` section, 14462306a36Sopenharmony_ci allocations of page cache pages for regular files mmap()ed 14562306a36Sopenharmony_ci with MAP_SHARED ignore any VMA policy installed on the virtual 14662306a36Sopenharmony_ci address range backed by the shared file mapping. Rather, 14762306a36Sopenharmony_ci shared page cache pages, including pages backing private 14862306a36Sopenharmony_ci mappings that have not yet been written by the task, follow 14962306a36Sopenharmony_ci task policy, if any, else System Default Policy. 15062306a36Sopenharmony_ci 15162306a36Sopenharmony_ci The shared policy infrastructure supports different policies on subset 15262306a36Sopenharmony_ci ranges of the shared object. However, Linux still splits the VMA of 15362306a36Sopenharmony_ci the task that installs the policy for each range of distinct policy. 15462306a36Sopenharmony_ci Thus, different tasks that attach to a shared memory segment can have 15562306a36Sopenharmony_ci different VMA configurations mapping that one shared object. This 15662306a36Sopenharmony_ci can be seen by examining the /proc/<pid>/numa_maps of tasks sharing 15762306a36Sopenharmony_ci a shared memory region, when one task has installed shared policy on 15862306a36Sopenharmony_ci one or more ranges of the region. 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_ciComponents of Memory Policies 16162306a36Sopenharmony_ci----------------------------- 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ciA NUMA memory policy consists of a "mode", optional mode flags, and 16462306a36Sopenharmony_cian optional set of nodes. The mode determines the behavior of the 16562306a36Sopenharmony_cipolicy, the optional mode flags determine the behavior of the mode, 16662306a36Sopenharmony_ciand the optional set of nodes can be viewed as the arguments to the 16762306a36Sopenharmony_cipolicy behavior. 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ciInternally, memory policies are implemented by a reference counted 17062306a36Sopenharmony_cistructure, struct mempolicy. Details of this structure will be 17162306a36Sopenharmony_cidiscussed in context, below, as required to explain the behavior. 17262306a36Sopenharmony_ci 17362306a36Sopenharmony_ciNUMA memory policy supports the following 4 behavioral modes: 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ciDefault Mode--MPOL_DEFAULT 17662306a36Sopenharmony_ci This mode is only used in the memory policy APIs. Internally, 17762306a36Sopenharmony_ci MPOL_DEFAULT is converted to the NULL memory policy in all 17862306a36Sopenharmony_ci policy scopes. Any existing non-default policy will simply be 17962306a36Sopenharmony_ci removed when MPOL_DEFAULT is specified. As a result, 18062306a36Sopenharmony_ci MPOL_DEFAULT means "fall back to the next most specific policy 18162306a36Sopenharmony_ci scope." 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ci For example, a NULL or default task policy will fall back to the 18462306a36Sopenharmony_ci system default policy. A NULL or default vma policy will fall 18562306a36Sopenharmony_ci back to the task policy. 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ci When specified in one of the memory policy APIs, the Default mode 18862306a36Sopenharmony_ci does not use the optional set of nodes. 18962306a36Sopenharmony_ci 19062306a36Sopenharmony_ci It is an error for the set of nodes specified for this policy to 19162306a36Sopenharmony_ci be non-empty. 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ciMPOL_BIND 19462306a36Sopenharmony_ci This mode specifies that memory must come from the set of 19562306a36Sopenharmony_ci nodes specified by the policy. Memory will be allocated from 19662306a36Sopenharmony_ci the node in the set with sufficient free memory that is 19762306a36Sopenharmony_ci closest to the node where the allocation takes place. 19862306a36Sopenharmony_ci 19962306a36Sopenharmony_ciMPOL_PREFERRED 20062306a36Sopenharmony_ci This mode specifies that the allocation should be attempted 20162306a36Sopenharmony_ci from the single node specified in the policy. If that 20262306a36Sopenharmony_ci allocation fails, the kernel will search other nodes, in order 20362306a36Sopenharmony_ci of increasing distance from the preferred node based on 20462306a36Sopenharmony_ci information provided by the platform firmware. 20562306a36Sopenharmony_ci 20662306a36Sopenharmony_ci Internally, the Preferred policy uses a single node--the 20762306a36Sopenharmony_ci preferred_node member of struct mempolicy. When the internal 20862306a36Sopenharmony_ci mode flag MPOL_F_LOCAL is set, the preferred_node is ignored 20962306a36Sopenharmony_ci and the policy is interpreted as local allocation. "Local" 21062306a36Sopenharmony_ci allocation policy can be viewed as a Preferred policy that 21162306a36Sopenharmony_ci starts at the node containing the cpu where the allocation 21262306a36Sopenharmony_ci takes place. 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ci It is possible for the user to specify that local allocation 21562306a36Sopenharmony_ci is always preferred by passing an empty nodemask with this 21662306a36Sopenharmony_ci mode. If an empty nodemask is passed, the policy cannot use 21762306a36Sopenharmony_ci the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags 21862306a36Sopenharmony_ci described below. 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciMPOL_INTERLEAVED 22162306a36Sopenharmony_ci This mode specifies that page allocations be interleaved, on a 22262306a36Sopenharmony_ci page granularity, across the nodes specified in the policy. 22362306a36Sopenharmony_ci This mode also behaves slightly differently, based on the 22462306a36Sopenharmony_ci context where it is used: 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ci For allocation of anonymous pages and shared memory pages, 22762306a36Sopenharmony_ci Interleave mode indexes the set of nodes specified by the 22862306a36Sopenharmony_ci policy using the page offset of the faulting address into the 22962306a36Sopenharmony_ci segment [VMA] containing the address modulo the number of 23062306a36Sopenharmony_ci nodes specified by the policy. It then attempts to allocate a 23162306a36Sopenharmony_ci page, starting at the selected node, as if the node had been 23262306a36Sopenharmony_ci specified by a Preferred policy or had been selected by a 23362306a36Sopenharmony_ci local allocation. That is, allocation will follow the per 23462306a36Sopenharmony_ci node zonelist. 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_ci For allocation of page cache pages, Interleave mode indexes 23762306a36Sopenharmony_ci the set of nodes specified by the policy using a node counter 23862306a36Sopenharmony_ci maintained per task. This counter wraps around to the lowest 23962306a36Sopenharmony_ci specified node after it reaches the highest specified node. 24062306a36Sopenharmony_ci This will tend to spread the pages out over the nodes 24162306a36Sopenharmony_ci specified by the policy based on the order in which they are 24262306a36Sopenharmony_ci allocated, rather than based on any page offset into an 24362306a36Sopenharmony_ci address range or file. During system boot up, the temporary 24462306a36Sopenharmony_ci interleaved system default policy works in this mode. 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ciMPOL_PREFERRED_MANY 24762306a36Sopenharmony_ci This mode specifies that the allocation should be preferably 24862306a36Sopenharmony_ci satisfied from the nodemask specified in the policy. If there is 24962306a36Sopenharmony_ci a memory pressure on all nodes in the nodemask, the allocation 25062306a36Sopenharmony_ci can fall back to all existing numa nodes. This is effectively 25162306a36Sopenharmony_ci MPOL_PREFERRED allowed for a mask rather than a single node. 25262306a36Sopenharmony_ci 25362306a36Sopenharmony_ciNUMA memory policy supports the following optional mode flags: 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ciMPOL_F_STATIC_NODES 25662306a36Sopenharmony_ci This flag specifies that the nodemask passed by 25762306a36Sopenharmony_ci the user should not be remapped if the task or VMA's set of allowed 25862306a36Sopenharmony_ci nodes changes after the memory policy has been defined. 25962306a36Sopenharmony_ci 26062306a36Sopenharmony_ci Without this flag, any time a mempolicy is rebound because of a 26162306a36Sopenharmony_ci change in the set of allowed nodes, the preferred nodemask (Preferred 26262306a36Sopenharmony_ci Many), preferred node (Preferred) or nodemask (Bind, Interleave) is 26362306a36Sopenharmony_ci remapped to the new set of allowed nodes. This may result in nodes 26462306a36Sopenharmony_ci being used that were previously undesired. 26562306a36Sopenharmony_ci 26662306a36Sopenharmony_ci With this flag, if the user-specified nodes overlap with the 26762306a36Sopenharmony_ci nodes allowed by the task's cpuset, then the memory policy is 26862306a36Sopenharmony_ci applied to their intersection. If the two sets of nodes do not 26962306a36Sopenharmony_ci overlap, the Default policy is used. 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_ci For example, consider a task that is attached to a cpuset with 27262306a36Sopenharmony_ci mems 1-3 that sets an Interleave policy over the same set. If 27362306a36Sopenharmony_ci the cpuset's mems change to 3-5, the Interleave will now occur 27462306a36Sopenharmony_ci over nodes 3, 4, and 5. With this flag, however, since only node 27562306a36Sopenharmony_ci 3 is allowed from the user's nodemask, the "interleave" only 27662306a36Sopenharmony_ci occurs over that node. If no nodes from the user's nodemask are 27762306a36Sopenharmony_ci now allowed, the Default behavior is used. 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ci MPOL_F_STATIC_NODES cannot be combined with the 28062306a36Sopenharmony_ci MPOL_F_RELATIVE_NODES flag. It also cannot be used for 28162306a36Sopenharmony_ci MPOL_PREFERRED policies that were created with an empty nodemask 28262306a36Sopenharmony_ci (local allocation). 28362306a36Sopenharmony_ci 28462306a36Sopenharmony_ciMPOL_F_RELATIVE_NODES 28562306a36Sopenharmony_ci This flag specifies that the nodemask passed 28662306a36Sopenharmony_ci by the user will be mapped relative to the set of the task or VMA's 28762306a36Sopenharmony_ci set of allowed nodes. The kernel stores the user-passed nodemask, 28862306a36Sopenharmony_ci and if the allowed nodes changes, then that original nodemask will 28962306a36Sopenharmony_ci be remapped relative to the new set of allowed nodes. 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ci Without this flag (and without MPOL_F_STATIC_NODES), anytime a 29262306a36Sopenharmony_ci mempolicy is rebound because of a change in the set of allowed 29362306a36Sopenharmony_ci nodes, the node (Preferred) or nodemask (Bind, Interleave) is 29462306a36Sopenharmony_ci remapped to the new set of allowed nodes. That remap may not 29562306a36Sopenharmony_ci preserve the relative nature of the user's passed nodemask to its 29662306a36Sopenharmony_ci set of allowed nodes upon successive rebinds: a nodemask of 29762306a36Sopenharmony_ci 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of 29862306a36Sopenharmony_ci allowed nodes is restored to its original state. 29962306a36Sopenharmony_ci 30062306a36Sopenharmony_ci With this flag, the remap is done so that the node numbers from 30162306a36Sopenharmony_ci the user's passed nodemask are relative to the set of allowed 30262306a36Sopenharmony_ci nodes. In other words, if nodes 0, 2, and 4 are set in the user's 30362306a36Sopenharmony_ci nodemask, the policy will be effected over the first (and in the 30462306a36Sopenharmony_ci Bind or Interleave case, the third and fifth) nodes in the set of 30562306a36Sopenharmony_ci allowed nodes. The nodemask passed by the user represents nodes 30662306a36Sopenharmony_ci relative to task or VMA's set of allowed nodes. 30762306a36Sopenharmony_ci 30862306a36Sopenharmony_ci If the user's nodemask includes nodes that are outside the range 30962306a36Sopenharmony_ci of the new set of allowed nodes (for example, node 5 is set in 31062306a36Sopenharmony_ci the user's nodemask when the set of allowed nodes is only 0-3), 31162306a36Sopenharmony_ci then the remap wraps around to the beginning of the nodemask and, 31262306a36Sopenharmony_ci if not already set, sets the node in the mempolicy nodemask. 31362306a36Sopenharmony_ci 31462306a36Sopenharmony_ci For example, consider a task that is attached to a cpuset with 31562306a36Sopenharmony_ci mems 2-5 that sets an Interleave policy over the same set with 31662306a36Sopenharmony_ci MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the 31762306a36Sopenharmony_ci interleave now occurs over nodes 3,5-7. If the cpuset's mems 31862306a36Sopenharmony_ci then change to 0,2-3,5, then the interleave occurs over nodes 31962306a36Sopenharmony_ci 0,2-3,5. 32062306a36Sopenharmony_ci 32162306a36Sopenharmony_ci Thanks to the consistent remapping, applications preparing 32262306a36Sopenharmony_ci nodemasks to specify memory policies using this flag should 32362306a36Sopenharmony_ci disregard their current, actual cpuset imposed memory placement 32462306a36Sopenharmony_ci and prepare the nodemask as if they were always located on 32562306a36Sopenharmony_ci memory nodes 0 to N-1, where N is the number of memory nodes the 32662306a36Sopenharmony_ci policy is intended to manage. Let the kernel then remap to the 32762306a36Sopenharmony_ci set of memory nodes allowed by the task's cpuset, as that may 32862306a36Sopenharmony_ci change over time. 32962306a36Sopenharmony_ci 33062306a36Sopenharmony_ci MPOL_F_RELATIVE_NODES cannot be combined with the 33162306a36Sopenharmony_ci MPOL_F_STATIC_NODES flag. It also cannot be used for 33262306a36Sopenharmony_ci MPOL_PREFERRED policies that were created with an empty nodemask 33362306a36Sopenharmony_ci (local allocation). 33462306a36Sopenharmony_ci 33562306a36Sopenharmony_ciMemory Policy Reference Counting 33662306a36Sopenharmony_ci================================ 33762306a36Sopenharmony_ci 33862306a36Sopenharmony_ciTo resolve use/free races, struct mempolicy contains an atomic reference 33962306a36Sopenharmony_cicount field. Internal interfaces, mpol_get()/mpol_put() increment and 34062306a36Sopenharmony_cidecrement this reference count, respectively. mpol_put() will only free 34162306a36Sopenharmony_cithe structure back to the mempolicy kmem cache when the reference count 34262306a36Sopenharmony_cigoes to zero. 34362306a36Sopenharmony_ci 34462306a36Sopenharmony_ciWhen a new memory policy is allocated, its reference count is initialized 34562306a36Sopenharmony_cito '1', representing the reference held by the task that is installing the 34662306a36Sopenharmony_cinew policy. When a pointer to a memory policy structure is stored in another 34762306a36Sopenharmony_cistructure, another reference is added, as the task's reference will be dropped 34862306a36Sopenharmony_cion completion of the policy installation. 34962306a36Sopenharmony_ci 35062306a36Sopenharmony_ciDuring run-time "usage" of the policy, we attempt to minimize atomic operations 35162306a36Sopenharmony_cion the reference count, as this can lead to cache lines bouncing between cpus 35262306a36Sopenharmony_ciand NUMA nodes. "Usage" here means one of the following: 35362306a36Sopenharmony_ci 35462306a36Sopenharmony_ci1) querying of the policy, either by the task itself [using the get_mempolicy() 35562306a36Sopenharmony_ci API discussed below] or by another task using the /proc/<pid>/numa_maps 35662306a36Sopenharmony_ci interface. 35762306a36Sopenharmony_ci 35862306a36Sopenharmony_ci2) examination of the policy to determine the policy mode and associated node 35962306a36Sopenharmony_ci or node lists, if any, for page allocation. This is considered a "hot 36062306a36Sopenharmony_ci path". Note that for MPOL_BIND, the "usage" extends across the entire 36162306a36Sopenharmony_ci allocation process, which may sleep during page reclamation, because the 36262306a36Sopenharmony_ci BIND policy nodemask is used, by reference, to filter ineligible nodes. 36362306a36Sopenharmony_ci 36462306a36Sopenharmony_ciWe can avoid taking an extra reference during the usages listed above as 36562306a36Sopenharmony_cifollows: 36662306a36Sopenharmony_ci 36762306a36Sopenharmony_ci1) we never need to get/free the system default policy as this is never 36862306a36Sopenharmony_ci changed nor freed, once the system is up and running. 36962306a36Sopenharmony_ci 37062306a36Sopenharmony_ci2) for querying the policy, we do not need to take an extra reference on the 37162306a36Sopenharmony_ci target task's task policy nor vma policies because we always acquire the 37262306a36Sopenharmony_ci task's mm's mmap_lock for read during the query. The set_mempolicy() and 37362306a36Sopenharmony_ci mbind() APIs [see below] always acquire the mmap_lock for write when 37462306a36Sopenharmony_ci installing or replacing task or vma policies. Thus, there is no possibility 37562306a36Sopenharmony_ci of a task or thread freeing a policy while another task or thread is 37662306a36Sopenharmony_ci querying it. 37762306a36Sopenharmony_ci 37862306a36Sopenharmony_ci3) Page allocation usage of task or vma policy occurs in the fault path where 37962306a36Sopenharmony_ci we hold them mmap_lock for read. Again, because replacing the task or vma 38062306a36Sopenharmony_ci policy requires that the mmap_lock be held for write, the policy can't be 38162306a36Sopenharmony_ci freed out from under us while we're using it for page allocation. 38262306a36Sopenharmony_ci 38362306a36Sopenharmony_ci4) Shared policies require special consideration. One task can replace a 38462306a36Sopenharmony_ci shared memory policy while another task, with a distinct mmap_lock, is 38562306a36Sopenharmony_ci querying or allocating a page based on the policy. To resolve this 38662306a36Sopenharmony_ci potential race, the shared policy infrastructure adds an extra reference 38762306a36Sopenharmony_ci to the shared policy during lookup while holding a spin lock on the shared 38862306a36Sopenharmony_ci policy management structure. This requires that we drop this extra 38962306a36Sopenharmony_ci reference when we're finished "using" the policy. We must drop the 39062306a36Sopenharmony_ci extra reference on shared policies in the same query/allocation paths 39162306a36Sopenharmony_ci used for non-shared policies. For this reason, shared policies are marked 39262306a36Sopenharmony_ci as such, and the extra reference is dropped "conditionally"--i.e., only 39362306a36Sopenharmony_ci for shared policies. 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ci Because of this extra reference counting, and because we must lookup 39662306a36Sopenharmony_ci shared policies in a tree structure under spinlock, shared policies are 39762306a36Sopenharmony_ci more expensive to use in the page allocation path. This is especially 39862306a36Sopenharmony_ci true for shared policies on shared memory regions shared by tasks running 39962306a36Sopenharmony_ci on different NUMA nodes. This extra overhead can be avoided by always 40062306a36Sopenharmony_ci falling back to task or system default policy for shared memory regions, 40162306a36Sopenharmony_ci or by prefaulting the entire shared memory region into memory and locking 40262306a36Sopenharmony_ci it down. However, this might not be appropriate for all applications. 40362306a36Sopenharmony_ci 40462306a36Sopenharmony_ci.. _memory_policy_apis: 40562306a36Sopenharmony_ci 40662306a36Sopenharmony_ciMemory Policy APIs 40762306a36Sopenharmony_ci================== 40862306a36Sopenharmony_ci 40962306a36Sopenharmony_ciLinux supports 4 system calls for controlling memory policy. These APIS 41062306a36Sopenharmony_cialways affect only the calling task, the calling task's address space, or 41162306a36Sopenharmony_cisome shared object mapped into the calling task's address space. 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ci.. note:: 41462306a36Sopenharmony_ci the headers that define these APIs and the parameter data types for 41562306a36Sopenharmony_ci user space applications reside in a package that is not part of the 41662306a36Sopenharmony_ci Linux kernel. The kernel system call interfaces, with the 'sys\_' 41762306a36Sopenharmony_ci prefix, are defined in <linux/syscalls.h>; the mode and flag 41862306a36Sopenharmony_ci definitions are defined in <linux/mempolicy.h>. 41962306a36Sopenharmony_ci 42062306a36Sopenharmony_ciSet [Task] Memory Policy:: 42162306a36Sopenharmony_ci 42262306a36Sopenharmony_ci long set_mempolicy(int mode, const unsigned long *nmask, 42362306a36Sopenharmony_ci unsigned long maxnode); 42462306a36Sopenharmony_ci 42562306a36Sopenharmony_ciSet's the calling task's "task/process memory policy" to mode 42662306a36Sopenharmony_cispecified by the 'mode' argument and the set of nodes defined by 42762306a36Sopenharmony_ci'nmask'. 'nmask' points to a bit mask of node ids containing at least 42862306a36Sopenharmony_ci'maxnode' ids. Optional mode flags may be passed by combining the 42962306a36Sopenharmony_ci'mode' argument with the flag (for example: MPOL_INTERLEAVE | 43062306a36Sopenharmony_ciMPOL_F_STATIC_NODES). 43162306a36Sopenharmony_ci 43262306a36Sopenharmony_ciSee the set_mempolicy(2) man page for more details 43362306a36Sopenharmony_ci 43462306a36Sopenharmony_ci 43562306a36Sopenharmony_ciGet [Task] Memory Policy or Related Information:: 43662306a36Sopenharmony_ci 43762306a36Sopenharmony_ci long get_mempolicy(int *mode, 43862306a36Sopenharmony_ci const unsigned long *nmask, unsigned long maxnode, 43962306a36Sopenharmony_ci void *addr, int flags); 44062306a36Sopenharmony_ci 44162306a36Sopenharmony_ciQueries the "task/process memory policy" of the calling task, or the 44262306a36Sopenharmony_cipolicy or location of a specified virtual address, depending on the 44362306a36Sopenharmony_ci'flags' argument. 44462306a36Sopenharmony_ci 44562306a36Sopenharmony_ciSee the get_mempolicy(2) man page for more details 44662306a36Sopenharmony_ci 44762306a36Sopenharmony_ci 44862306a36Sopenharmony_ciInstall VMA/Shared Policy for a Range of Task's Address Space:: 44962306a36Sopenharmony_ci 45062306a36Sopenharmony_ci long mbind(void *start, unsigned long len, int mode, 45162306a36Sopenharmony_ci const unsigned long *nmask, unsigned long maxnode, 45262306a36Sopenharmony_ci unsigned flags); 45362306a36Sopenharmony_ci 45462306a36Sopenharmony_cimbind() installs the policy specified by (mode, nmask, maxnodes) as a 45562306a36Sopenharmony_ciVMA policy for the range of the calling task's address space specified 45662306a36Sopenharmony_ciby the 'start' and 'len' arguments. Additional actions may be 45762306a36Sopenharmony_cirequested via the 'flags' argument. 45862306a36Sopenharmony_ci 45962306a36Sopenharmony_ciSee the mbind(2) man page for more details. 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ciSet home node for a Range of Task's Address Spacec:: 46262306a36Sopenharmony_ci 46362306a36Sopenharmony_ci long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, 46462306a36Sopenharmony_ci unsigned long home_node, 46562306a36Sopenharmony_ci unsigned long flags); 46662306a36Sopenharmony_ci 46762306a36Sopenharmony_cisys_set_mempolicy_home_node set the home node for a VMA policy present in the 46862306a36Sopenharmony_citask's address range. The system call updates the home node only for the existing 46962306a36Sopenharmony_cimempolicy range. Other address ranges are ignored. A home node is the NUMA node 47062306a36Sopenharmony_ciclosest to which page allocation will come from. Specifying the home node override 47162306a36Sopenharmony_cithe default allocation policy to allocate memory close to the local node for an 47262306a36Sopenharmony_ciexecuting CPU. 47362306a36Sopenharmony_ci 47462306a36Sopenharmony_ci 47562306a36Sopenharmony_ciMemory Policy Command Line Interface 47662306a36Sopenharmony_ci==================================== 47762306a36Sopenharmony_ci 47862306a36Sopenharmony_ciAlthough not strictly part of the Linux implementation of memory policy, 47962306a36Sopenharmony_cia command line tool, numactl(8), exists that allows one to: 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ci+ set the task policy for a specified program via set_mempolicy(2), fork(2) and 48262306a36Sopenharmony_ci exec(2) 48362306a36Sopenharmony_ci 48462306a36Sopenharmony_ci+ set the shared policy for a shared memory segment via mbind(2) 48562306a36Sopenharmony_ci 48662306a36Sopenharmony_ciThe numactl(8) tool is packaged with the run-time version of the library 48762306a36Sopenharmony_cicontaining the memory policy system call wrappers. Some distributions 48862306a36Sopenharmony_cipackage the headers and compile-time libraries in a separate development 48962306a36Sopenharmony_cipackage. 49062306a36Sopenharmony_ci 49162306a36Sopenharmony_ci.. _mem_pol_and_cpusets: 49262306a36Sopenharmony_ci 49362306a36Sopenharmony_ciMemory Policies and cpusets 49462306a36Sopenharmony_ci=========================== 49562306a36Sopenharmony_ci 49662306a36Sopenharmony_ciMemory policies work within cpusets as described above. For memory policies 49762306a36Sopenharmony_cithat require a node or set of nodes, the nodes are restricted to the set of 49862306a36Sopenharmony_cinodes whose memories are allowed by the cpuset constraints. If the nodemask 49962306a36Sopenharmony_cispecified for the policy contains nodes that are not allowed by the cpuset and 50062306a36Sopenharmony_ciMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes 50162306a36Sopenharmony_cispecified for the policy and the set of nodes with memory is used. If the 50262306a36Sopenharmony_ciresult is the empty set, the policy is considered invalid and cannot be 50362306a36Sopenharmony_ciinstalled. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped 50462306a36Sopenharmony_cionto and folded into the task's set of allowed nodes as previously described. 50562306a36Sopenharmony_ci 50662306a36Sopenharmony_ciThe interaction of memory policies and cpusets can be problematic when tasks 50762306a36Sopenharmony_ciin two cpusets share access to a memory region, such as shared memory segments 50862306a36Sopenharmony_cicreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and 50962306a36Sopenharmony_ciany of the tasks install shared policy on the region, only nodes whose 51062306a36Sopenharmony_cimemories are allowed in both cpusets may be used in the policies. Obtaining 51162306a36Sopenharmony_cithis information requires "stepping outside" the memory policy APIs to use the 51262306a36Sopenharmony_cicpuset information and requires that one know in what cpusets other task might 51362306a36Sopenharmony_cibe attaching to the shared region. Furthermore, if the cpusets' allowed 51462306a36Sopenharmony_cimemory sets are disjoint, "local" allocation is the only valid policy. 515