18c2ecf20Sopenharmony_ci.. _numa_memory_policy: 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci================== 48c2ecf20Sopenharmony_ciNUMA Memory Policy 58c2ecf20Sopenharmony_ci================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciWhat is NUMA Memory Policy? 88c2ecf20Sopenharmony_ci============================ 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciIn the Linux kernel, "memory policy" determines from which node the kernel will 118c2ecf20Sopenharmony_ciallocate memory in a NUMA system or in an emulated NUMA system. Linux has 128c2ecf20Sopenharmony_cisupported platforms with Non-Uniform Memory Access architectures since 2.4.?. 138c2ecf20Sopenharmony_ciThe current memory policy support was added to Linux 2.6 around May 2004. This 148c2ecf20Sopenharmony_cidocument attempts to describe the concepts and APIs of the 2.6 memory policy 158c2ecf20Sopenharmony_cisupport. 168c2ecf20Sopenharmony_ci 178c2ecf20Sopenharmony_ciMemory policies should not be confused with cpusets 188c2ecf20Sopenharmony_ci(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) 198c2ecf20Sopenharmony_ciwhich is an administrative mechanism for restricting the nodes from which 208c2ecf20Sopenharmony_cimemory may be allocated by a set of processes. Memory policies are a 218c2ecf20Sopenharmony_ciprogramming interface that a NUMA-aware application can take advantage of. When 228c2ecf20Sopenharmony_ciboth cpusets and policies are applied to a task, the restrictions of the cpuset 238c2ecf20Sopenharmony_citakes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` 248c2ecf20Sopenharmony_cibelow for more details. 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ciMemory Policy Concepts 278c2ecf20Sopenharmony_ci====================== 288c2ecf20Sopenharmony_ci 298c2ecf20Sopenharmony_ciScope of Memory Policies 308c2ecf20Sopenharmony_ci------------------------ 318c2ecf20Sopenharmony_ci 328c2ecf20Sopenharmony_ciThe Linux kernel supports _scopes_ of memory policy, described here from 338c2ecf20Sopenharmony_cimost general to most specific: 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ciSystem Default Policy 368c2ecf20Sopenharmony_ci this policy is "hard coded" into the kernel. It is the policy 378c2ecf20Sopenharmony_ci that governs all page allocations that aren't controlled by 388c2ecf20Sopenharmony_ci one of the more specific policy scopes discussed below. When 398c2ecf20Sopenharmony_ci the system is "up and running", the system default policy will 408c2ecf20Sopenharmony_ci use "local allocation" described below. However, during boot 418c2ecf20Sopenharmony_ci up, the system default policy will be set to interleave 428c2ecf20Sopenharmony_ci allocations across all nodes with "sufficient" memory, so as 438c2ecf20Sopenharmony_ci not to overload the initial boot node with boot-time 448c2ecf20Sopenharmony_ci allocations. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ciTask/Process Policy 478c2ecf20Sopenharmony_ci this is an optional, per-task policy. When defined for a 488c2ecf20Sopenharmony_ci specific task, this policy controls all page allocations made 498c2ecf20Sopenharmony_ci by or on behalf of the task that aren't controlled by a more 508c2ecf20Sopenharmony_ci specific scope. If a task does not define a task policy, then 518c2ecf20Sopenharmony_ci all page allocations that would have been controlled by the 528c2ecf20Sopenharmony_ci task policy "fall back" to the System Default Policy. 538c2ecf20Sopenharmony_ci 548c2ecf20Sopenharmony_ci The task policy applies to the entire address space of a task. Thus, 558c2ecf20Sopenharmony_ci it is inheritable, and indeed is inherited, across both fork() 568c2ecf20Sopenharmony_ci [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task 578c2ecf20Sopenharmony_ci to establish the task policy for a child task exec()'d from an 588c2ecf20Sopenharmony_ci executable image that has no awareness of memory policy. See the 598c2ecf20Sopenharmony_ci :ref:`Memory Policy APIs <memory_policy_apis>` section, 608c2ecf20Sopenharmony_ci below, for an overview of the system call 618c2ecf20Sopenharmony_ci that a task may use to set/change its task/process policy. 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ci In a multi-threaded task, task policies apply only to the thread 648c2ecf20Sopenharmony_ci [Linux kernel task] that installs the policy and any threads 658c2ecf20Sopenharmony_ci subsequently created by that thread. Any sibling threads existing 668c2ecf20Sopenharmony_ci at the time a new task policy is installed retain their current 678c2ecf20Sopenharmony_ci policy. 688c2ecf20Sopenharmony_ci 698c2ecf20Sopenharmony_ci A task policy applies only to pages allocated after the policy is 708c2ecf20Sopenharmony_ci installed. Any pages already faulted in by the task when the task 718c2ecf20Sopenharmony_ci changes its task policy remain where they were allocated based on 728c2ecf20Sopenharmony_ci the policy at the time they were allocated. 738c2ecf20Sopenharmony_ci 748c2ecf20Sopenharmony_ci.. _vma_policy: 758c2ecf20Sopenharmony_ci 768c2ecf20Sopenharmony_ciVMA Policy 778c2ecf20Sopenharmony_ci A "VMA" or "Virtual Memory Area" refers to a range of a task's 788c2ecf20Sopenharmony_ci virtual address space. A task may define a specific policy for a range 798c2ecf20Sopenharmony_ci of its virtual address space. See the 808c2ecf20Sopenharmony_ci :ref:`Memory Policy APIs <memory_policy_apis>` section, 818c2ecf20Sopenharmony_ci below, for an overview of the mbind() system call used to set a VMA 828c2ecf20Sopenharmony_ci policy. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ci A VMA policy will govern the allocation of pages that back 858c2ecf20Sopenharmony_ci this region of the address space. Any regions of the task's 868c2ecf20Sopenharmony_ci address space that don't have an explicit VMA policy will fall 878c2ecf20Sopenharmony_ci back to the task policy, which may itself fall back to the 888c2ecf20Sopenharmony_ci System Default Policy. 898c2ecf20Sopenharmony_ci 908c2ecf20Sopenharmony_ci VMA policies have a few complicating details: 918c2ecf20Sopenharmony_ci 928c2ecf20Sopenharmony_ci * VMA policy applies ONLY to anonymous pages. These include 938c2ecf20Sopenharmony_ci pages allocated for anonymous segments, such as the task 948c2ecf20Sopenharmony_ci stack and heap, and any regions of the address space 958c2ecf20Sopenharmony_ci mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is 968c2ecf20Sopenharmony_ci applied to a file mapping, it will be ignored if the mapping 978c2ecf20Sopenharmony_ci used the MAP_SHARED flag. If the file mapping used the 988c2ecf20Sopenharmony_ci MAP_PRIVATE flag, the VMA policy will only be applied when 998c2ecf20Sopenharmony_ci an anonymous page is allocated on an attempt to write to the 1008c2ecf20Sopenharmony_ci mapping-- i.e., at Copy-On-Write. 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ci * VMA policies are shared between all tasks that share a 1038c2ecf20Sopenharmony_ci virtual address space--a.k.a. threads--independent of when 1048c2ecf20Sopenharmony_ci the policy is installed; and they are inherited across 1058c2ecf20Sopenharmony_ci fork(). However, because VMA policies refer to a specific 1068c2ecf20Sopenharmony_ci region of a task's address space, and because the address 1078c2ecf20Sopenharmony_ci space is discarded and recreated on exec*(), VMA policies 1088c2ecf20Sopenharmony_ci are NOT inheritable across exec(). Thus, only NUMA-aware 1098c2ecf20Sopenharmony_ci applications may use VMA policies. 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ci * A task may install a new VMA policy on a sub-range of a 1128c2ecf20Sopenharmony_ci previously mmap()ed region. When this happens, Linux splits 1138c2ecf20Sopenharmony_ci the existing virtual memory area into 2 or 3 VMAs, each with 1148c2ecf20Sopenharmony_ci it's own policy. 1158c2ecf20Sopenharmony_ci 1168c2ecf20Sopenharmony_ci * By default, VMA policy applies only to pages allocated after 1178c2ecf20Sopenharmony_ci the policy is installed. Any pages already faulted into the 1188c2ecf20Sopenharmony_ci VMA range remain where they were allocated based on the 1198c2ecf20Sopenharmony_ci policy at the time they were allocated. However, since 1208c2ecf20Sopenharmony_ci 2.6.16, Linux supports page migration via the mbind() system 1218c2ecf20Sopenharmony_ci call, so that page contents can be moved to match a newly 1228c2ecf20Sopenharmony_ci installed policy. 1238c2ecf20Sopenharmony_ci 1248c2ecf20Sopenharmony_ciShared Policy 1258c2ecf20Sopenharmony_ci Conceptually, shared policies apply to "memory objects" mapped 1268c2ecf20Sopenharmony_ci shared into one or more tasks' distinct address spaces. An 1278c2ecf20Sopenharmony_ci application installs shared policies the same way as VMA 1288c2ecf20Sopenharmony_ci policies--using the mbind() system call specifying a range of 1298c2ecf20Sopenharmony_ci virtual addresses that map the shared object. However, unlike 1308c2ecf20Sopenharmony_ci VMA policies, which can be considered to be an attribute of a 1318c2ecf20Sopenharmony_ci range of a task's address space, shared policies apply 1328c2ecf20Sopenharmony_ci directly to the shared object. Thus, all tasks that attach to 1338c2ecf20Sopenharmony_ci the object share the policy, and all pages allocated for the 1348c2ecf20Sopenharmony_ci shared object, by any task, will obey the shared policy. 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ci As of 2.6.22, only shared memory segments, created by shmget() or 1378c2ecf20Sopenharmony_ci mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared 1388c2ecf20Sopenharmony_ci policy support was added to Linux, the associated data structures were 1398c2ecf20Sopenharmony_ci added to hugetlbfs shmem segments. At the time, hugetlbfs did not 1408c2ecf20Sopenharmony_ci support allocation at fault time--a.k.a lazy allocation--so hugetlbfs 1418c2ecf20Sopenharmony_ci shmem segments were never "hooked up" to the shared policy support. 1428c2ecf20Sopenharmony_ci Although hugetlbfs segments now support lazy allocation, their support 1438c2ecf20Sopenharmony_ci for shared policy has not been completed. 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ci As mentioned above in :ref:`VMA policies <vma_policy>` section, 1468c2ecf20Sopenharmony_ci allocations of page cache pages for regular files mmap()ed 1478c2ecf20Sopenharmony_ci with MAP_SHARED ignore any VMA policy installed on the virtual 1488c2ecf20Sopenharmony_ci address range backed by the shared file mapping. Rather, 1498c2ecf20Sopenharmony_ci shared page cache pages, including pages backing private 1508c2ecf20Sopenharmony_ci mappings that have not yet been written by the task, follow 1518c2ecf20Sopenharmony_ci task policy, if any, else System Default Policy. 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_ci The shared policy infrastructure supports different policies on subset 1548c2ecf20Sopenharmony_ci ranges of the shared object. However, Linux still splits the VMA of 1558c2ecf20Sopenharmony_ci the task that installs the policy for each range of distinct policy. 1568c2ecf20Sopenharmony_ci Thus, different tasks that attach to a shared memory segment can have 1578c2ecf20Sopenharmony_ci different VMA configurations mapping that one shared object. This 1588c2ecf20Sopenharmony_ci can be seen by examining the /proc/<pid>/numa_maps of tasks sharing 1598c2ecf20Sopenharmony_ci a shared memory region, when one task has installed shared policy on 1608c2ecf20Sopenharmony_ci one or more ranges of the region. 1618c2ecf20Sopenharmony_ci 1628c2ecf20Sopenharmony_ciComponents of Memory Policies 1638c2ecf20Sopenharmony_ci----------------------------- 1648c2ecf20Sopenharmony_ci 1658c2ecf20Sopenharmony_ciA NUMA memory policy consists of a "mode", optional mode flags, and 1668c2ecf20Sopenharmony_cian optional set of nodes. The mode determines the behavior of the 1678c2ecf20Sopenharmony_cipolicy, the optional mode flags determine the behavior of the mode, 1688c2ecf20Sopenharmony_ciand the optional set of nodes can be viewed as the arguments to the 1698c2ecf20Sopenharmony_cipolicy behavior. 1708c2ecf20Sopenharmony_ci 1718c2ecf20Sopenharmony_ciInternally, memory policies are implemented by a reference counted 1728c2ecf20Sopenharmony_cistructure, struct mempolicy. Details of this structure will be 1738c2ecf20Sopenharmony_cidiscussed in context, below, as required to explain the behavior. 1748c2ecf20Sopenharmony_ci 1758c2ecf20Sopenharmony_ciNUMA memory policy supports the following 4 behavioral modes: 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ciDefault Mode--MPOL_DEFAULT 1788c2ecf20Sopenharmony_ci This mode is only used in the memory policy APIs. Internally, 1798c2ecf20Sopenharmony_ci MPOL_DEFAULT is converted to the NULL memory policy in all 1808c2ecf20Sopenharmony_ci policy scopes. Any existing non-default policy will simply be 1818c2ecf20Sopenharmony_ci removed when MPOL_DEFAULT is specified. As a result, 1828c2ecf20Sopenharmony_ci MPOL_DEFAULT means "fall back to the next most specific policy 1838c2ecf20Sopenharmony_ci scope." 1848c2ecf20Sopenharmony_ci 1858c2ecf20Sopenharmony_ci For example, a NULL or default task policy will fall back to the 1868c2ecf20Sopenharmony_ci system default policy. A NULL or default vma policy will fall 1878c2ecf20Sopenharmony_ci back to the task policy. 1888c2ecf20Sopenharmony_ci 1898c2ecf20Sopenharmony_ci When specified in one of the memory policy APIs, the Default mode 1908c2ecf20Sopenharmony_ci does not use the optional set of nodes. 1918c2ecf20Sopenharmony_ci 1928c2ecf20Sopenharmony_ci It is an error for the set of nodes specified for this policy to 1938c2ecf20Sopenharmony_ci be non-empty. 1948c2ecf20Sopenharmony_ci 1958c2ecf20Sopenharmony_ciMPOL_BIND 1968c2ecf20Sopenharmony_ci This mode specifies that memory must come from the set of 1978c2ecf20Sopenharmony_ci nodes specified by the policy. Memory will be allocated from 1988c2ecf20Sopenharmony_ci the node in the set with sufficient free memory that is 1998c2ecf20Sopenharmony_ci closest to the node where the allocation takes place. 2008c2ecf20Sopenharmony_ci 2018c2ecf20Sopenharmony_ciMPOL_PREFERRED 2028c2ecf20Sopenharmony_ci This mode specifies that the allocation should be attempted 2038c2ecf20Sopenharmony_ci from the single node specified in the policy. If that 2048c2ecf20Sopenharmony_ci allocation fails, the kernel will search other nodes, in order 2058c2ecf20Sopenharmony_ci of increasing distance from the preferred node based on 2068c2ecf20Sopenharmony_ci information provided by the platform firmware. 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ci Internally, the Preferred policy uses a single node--the 2098c2ecf20Sopenharmony_ci preferred_node member of struct mempolicy. When the internal 2108c2ecf20Sopenharmony_ci mode flag MPOL_F_LOCAL is set, the preferred_node is ignored 2118c2ecf20Sopenharmony_ci and the policy is interpreted as local allocation. "Local" 2128c2ecf20Sopenharmony_ci allocation policy can be viewed as a Preferred policy that 2138c2ecf20Sopenharmony_ci starts at the node containing the cpu where the allocation 2148c2ecf20Sopenharmony_ci takes place. 2158c2ecf20Sopenharmony_ci 2168c2ecf20Sopenharmony_ci It is possible for the user to specify that local allocation 2178c2ecf20Sopenharmony_ci is always preferred by passing an empty nodemask with this 2188c2ecf20Sopenharmony_ci mode. If an empty nodemask is passed, the policy cannot use 2198c2ecf20Sopenharmony_ci the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags 2208c2ecf20Sopenharmony_ci described below. 2218c2ecf20Sopenharmony_ci 2228c2ecf20Sopenharmony_ciMPOL_INTERLEAVED 2238c2ecf20Sopenharmony_ci This mode specifies that page allocations be interleaved, on a 2248c2ecf20Sopenharmony_ci page granularity, across the nodes specified in the policy. 2258c2ecf20Sopenharmony_ci This mode also behaves slightly differently, based on the 2268c2ecf20Sopenharmony_ci context where it is used: 2278c2ecf20Sopenharmony_ci 2288c2ecf20Sopenharmony_ci For allocation of anonymous pages and shared memory pages, 2298c2ecf20Sopenharmony_ci Interleave mode indexes the set of nodes specified by the 2308c2ecf20Sopenharmony_ci policy using the page offset of the faulting address into the 2318c2ecf20Sopenharmony_ci segment [VMA] containing the address modulo the number of 2328c2ecf20Sopenharmony_ci nodes specified by the policy. It then attempts to allocate a 2338c2ecf20Sopenharmony_ci page, starting at the selected node, as if the node had been 2348c2ecf20Sopenharmony_ci specified by a Preferred policy or had been selected by a 2358c2ecf20Sopenharmony_ci local allocation. That is, allocation will follow the per 2368c2ecf20Sopenharmony_ci node zonelist. 2378c2ecf20Sopenharmony_ci 2388c2ecf20Sopenharmony_ci For allocation of page cache pages, Interleave mode indexes 2398c2ecf20Sopenharmony_ci the set of nodes specified by the policy using a node counter 2408c2ecf20Sopenharmony_ci maintained per task. This counter wraps around to the lowest 2418c2ecf20Sopenharmony_ci specified node after it reaches the highest specified node. 2428c2ecf20Sopenharmony_ci This will tend to spread the pages out over the nodes 2438c2ecf20Sopenharmony_ci specified by the policy based on the order in which they are 2448c2ecf20Sopenharmony_ci allocated, rather than based on any page offset into an 2458c2ecf20Sopenharmony_ci address range or file. During system boot up, the temporary 2468c2ecf20Sopenharmony_ci interleaved system default policy works in this mode. 2478c2ecf20Sopenharmony_ci 2488c2ecf20Sopenharmony_ciNUMA memory policy supports the following optional mode flags: 2498c2ecf20Sopenharmony_ci 2508c2ecf20Sopenharmony_ciMPOL_F_STATIC_NODES 2518c2ecf20Sopenharmony_ci This flag specifies that the nodemask passed by 2528c2ecf20Sopenharmony_ci the user should not be remapped if the task or VMA's set of allowed 2538c2ecf20Sopenharmony_ci nodes changes after the memory policy has been defined. 2548c2ecf20Sopenharmony_ci 2558c2ecf20Sopenharmony_ci Without this flag, any time a mempolicy is rebound because of a 2568c2ecf20Sopenharmony_ci change in the set of allowed nodes, the node (Preferred) or 2578c2ecf20Sopenharmony_ci nodemask (Bind, Interleave) is remapped to the new set of 2588c2ecf20Sopenharmony_ci allowed nodes. This may result in nodes being used that were 2598c2ecf20Sopenharmony_ci previously undesired. 2608c2ecf20Sopenharmony_ci 2618c2ecf20Sopenharmony_ci With this flag, if the user-specified nodes overlap with the 2628c2ecf20Sopenharmony_ci nodes allowed by the task's cpuset, then the memory policy is 2638c2ecf20Sopenharmony_ci applied to their intersection. If the two sets of nodes do not 2648c2ecf20Sopenharmony_ci overlap, the Default policy is used. 2658c2ecf20Sopenharmony_ci 2668c2ecf20Sopenharmony_ci For example, consider a task that is attached to a cpuset with 2678c2ecf20Sopenharmony_ci mems 1-3 that sets an Interleave policy over the same set. If 2688c2ecf20Sopenharmony_ci the cpuset's mems change to 3-5, the Interleave will now occur 2698c2ecf20Sopenharmony_ci over nodes 3, 4, and 5. With this flag, however, since only node 2708c2ecf20Sopenharmony_ci 3 is allowed from the user's nodemask, the "interleave" only 2718c2ecf20Sopenharmony_ci occurs over that node. If no nodes from the user's nodemask are 2728c2ecf20Sopenharmony_ci now allowed, the Default behavior is used. 2738c2ecf20Sopenharmony_ci 2748c2ecf20Sopenharmony_ci MPOL_F_STATIC_NODES cannot be combined with the 2758c2ecf20Sopenharmony_ci MPOL_F_RELATIVE_NODES flag. It also cannot be used for 2768c2ecf20Sopenharmony_ci MPOL_PREFERRED policies that were created with an empty nodemask 2778c2ecf20Sopenharmony_ci (local allocation). 2788c2ecf20Sopenharmony_ci 2798c2ecf20Sopenharmony_ciMPOL_F_RELATIVE_NODES 2808c2ecf20Sopenharmony_ci This flag specifies that the nodemask passed 2818c2ecf20Sopenharmony_ci by the user will be mapped relative to the set of the task or VMA's 2828c2ecf20Sopenharmony_ci set of allowed nodes. The kernel stores the user-passed nodemask, 2838c2ecf20Sopenharmony_ci and if the allowed nodes changes, then that original nodemask will 2848c2ecf20Sopenharmony_ci be remapped relative to the new set of allowed nodes. 2858c2ecf20Sopenharmony_ci 2868c2ecf20Sopenharmony_ci Without this flag (and without MPOL_F_STATIC_NODES), anytime a 2878c2ecf20Sopenharmony_ci mempolicy is rebound because of a change in the set of allowed 2888c2ecf20Sopenharmony_ci nodes, the node (Preferred) or nodemask (Bind, Interleave) is 2898c2ecf20Sopenharmony_ci remapped to the new set of allowed nodes. That remap may not 2908c2ecf20Sopenharmony_ci preserve the relative nature of the user's passed nodemask to its 2918c2ecf20Sopenharmony_ci set of allowed nodes upon successive rebinds: a nodemask of 2928c2ecf20Sopenharmony_ci 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of 2938c2ecf20Sopenharmony_ci allowed nodes is restored to its original state. 2948c2ecf20Sopenharmony_ci 2958c2ecf20Sopenharmony_ci With this flag, the remap is done so that the node numbers from 2968c2ecf20Sopenharmony_ci the user's passed nodemask are relative to the set of allowed 2978c2ecf20Sopenharmony_ci nodes. In other words, if nodes 0, 2, and 4 are set in the user's 2988c2ecf20Sopenharmony_ci nodemask, the policy will be effected over the first (and in the 2998c2ecf20Sopenharmony_ci Bind or Interleave case, the third and fifth) nodes in the set of 3008c2ecf20Sopenharmony_ci allowed nodes. The nodemask passed by the user represents nodes 3018c2ecf20Sopenharmony_ci relative to task or VMA's set of allowed nodes. 3028c2ecf20Sopenharmony_ci 3038c2ecf20Sopenharmony_ci If the user's nodemask includes nodes that are outside the range 3048c2ecf20Sopenharmony_ci of the new set of allowed nodes (for example, node 5 is set in 3058c2ecf20Sopenharmony_ci the user's nodemask when the set of allowed nodes is only 0-3), 3068c2ecf20Sopenharmony_ci then the remap wraps around to the beginning of the nodemask and, 3078c2ecf20Sopenharmony_ci if not already set, sets the node in the mempolicy nodemask. 3088c2ecf20Sopenharmony_ci 3098c2ecf20Sopenharmony_ci For example, consider a task that is attached to a cpuset with 3108c2ecf20Sopenharmony_ci mems 2-5 that sets an Interleave policy over the same set with 3118c2ecf20Sopenharmony_ci MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the 3128c2ecf20Sopenharmony_ci interleave now occurs over nodes 3,5-7. If the cpuset's mems 3138c2ecf20Sopenharmony_ci then change to 0,2-3,5, then the interleave occurs over nodes 3148c2ecf20Sopenharmony_ci 0,2-3,5. 3158c2ecf20Sopenharmony_ci 3168c2ecf20Sopenharmony_ci Thanks to the consistent remapping, applications preparing 3178c2ecf20Sopenharmony_ci nodemasks to specify memory policies using this flag should 3188c2ecf20Sopenharmony_ci disregard their current, actual cpuset imposed memory placement 3198c2ecf20Sopenharmony_ci and prepare the nodemask as if they were always located on 3208c2ecf20Sopenharmony_ci memory nodes 0 to N-1, where N is the number of memory nodes the 3218c2ecf20Sopenharmony_ci policy is intended to manage. Let the kernel then remap to the 3228c2ecf20Sopenharmony_ci set of memory nodes allowed by the task's cpuset, as that may 3238c2ecf20Sopenharmony_ci change over time. 3248c2ecf20Sopenharmony_ci 3258c2ecf20Sopenharmony_ci MPOL_F_RELATIVE_NODES cannot be combined with the 3268c2ecf20Sopenharmony_ci MPOL_F_STATIC_NODES flag. It also cannot be used for 3278c2ecf20Sopenharmony_ci MPOL_PREFERRED policies that were created with an empty nodemask 3288c2ecf20Sopenharmony_ci (local allocation). 3298c2ecf20Sopenharmony_ci 3308c2ecf20Sopenharmony_ciMemory Policy Reference Counting 3318c2ecf20Sopenharmony_ci================================ 3328c2ecf20Sopenharmony_ci 3338c2ecf20Sopenharmony_ciTo resolve use/free races, struct mempolicy contains an atomic reference 3348c2ecf20Sopenharmony_cicount field. Internal interfaces, mpol_get()/mpol_put() increment and 3358c2ecf20Sopenharmony_cidecrement this reference count, respectively. mpol_put() will only free 3368c2ecf20Sopenharmony_cithe structure back to the mempolicy kmem cache when the reference count 3378c2ecf20Sopenharmony_cigoes to zero. 3388c2ecf20Sopenharmony_ci 3398c2ecf20Sopenharmony_ciWhen a new memory policy is allocated, its reference count is initialized 3408c2ecf20Sopenharmony_cito '1', representing the reference held by the task that is installing the 3418c2ecf20Sopenharmony_cinew policy. When a pointer to a memory policy structure is stored in another 3428c2ecf20Sopenharmony_cistructure, another reference is added, as the task's reference will be dropped 3438c2ecf20Sopenharmony_cion completion of the policy installation. 3448c2ecf20Sopenharmony_ci 3458c2ecf20Sopenharmony_ciDuring run-time "usage" of the policy, we attempt to minimize atomic operations 3468c2ecf20Sopenharmony_cion the reference count, as this can lead to cache lines bouncing between cpus 3478c2ecf20Sopenharmony_ciand NUMA nodes. "Usage" here means one of the following: 3488c2ecf20Sopenharmony_ci 3498c2ecf20Sopenharmony_ci1) querying of the policy, either by the task itself [using the get_mempolicy() 3508c2ecf20Sopenharmony_ci API discussed below] or by another task using the /proc/<pid>/numa_maps 3518c2ecf20Sopenharmony_ci interface. 3528c2ecf20Sopenharmony_ci 3538c2ecf20Sopenharmony_ci2) examination of the policy to determine the policy mode and associated node 3548c2ecf20Sopenharmony_ci or node lists, if any, for page allocation. This is considered a "hot 3558c2ecf20Sopenharmony_ci path". Note that for MPOL_BIND, the "usage" extends across the entire 3568c2ecf20Sopenharmony_ci allocation process, which may sleep during page reclaimation, because the 3578c2ecf20Sopenharmony_ci BIND policy nodemask is used, by reference, to filter ineligible nodes. 3588c2ecf20Sopenharmony_ci 3598c2ecf20Sopenharmony_ciWe can avoid taking an extra reference during the usages listed above as 3608c2ecf20Sopenharmony_cifollows: 3618c2ecf20Sopenharmony_ci 3628c2ecf20Sopenharmony_ci1) we never need to get/free the system default policy as this is never 3638c2ecf20Sopenharmony_ci changed nor freed, once the system is up and running. 3648c2ecf20Sopenharmony_ci 3658c2ecf20Sopenharmony_ci2) for querying the policy, we do not need to take an extra reference on the 3668c2ecf20Sopenharmony_ci target task's task policy nor vma policies because we always acquire the 3678c2ecf20Sopenharmony_ci task's mm's mmap_lock for read during the query. The set_mempolicy() and 3688c2ecf20Sopenharmony_ci mbind() APIs [see below] always acquire the mmap_lock for write when 3698c2ecf20Sopenharmony_ci installing or replacing task or vma policies. Thus, there is no possibility 3708c2ecf20Sopenharmony_ci of a task or thread freeing a policy while another task or thread is 3718c2ecf20Sopenharmony_ci querying it. 3728c2ecf20Sopenharmony_ci 3738c2ecf20Sopenharmony_ci3) Page allocation usage of task or vma policy occurs in the fault path where 3748c2ecf20Sopenharmony_ci we hold them mmap_lock for read. Again, because replacing the task or vma 3758c2ecf20Sopenharmony_ci policy requires that the mmap_lock be held for write, the policy can't be 3768c2ecf20Sopenharmony_ci freed out from under us while we're using it for page allocation. 3778c2ecf20Sopenharmony_ci 3788c2ecf20Sopenharmony_ci4) Shared policies require special consideration. One task can replace a 3798c2ecf20Sopenharmony_ci shared memory policy while another task, with a distinct mmap_lock, is 3808c2ecf20Sopenharmony_ci querying or allocating a page based on the policy. To resolve this 3818c2ecf20Sopenharmony_ci potential race, the shared policy infrastructure adds an extra reference 3828c2ecf20Sopenharmony_ci to the shared policy during lookup while holding a spin lock on the shared 3838c2ecf20Sopenharmony_ci policy management structure. This requires that we drop this extra 3848c2ecf20Sopenharmony_ci reference when we're finished "using" the policy. We must drop the 3858c2ecf20Sopenharmony_ci extra reference on shared policies in the same query/allocation paths 3868c2ecf20Sopenharmony_ci used for non-shared policies. For this reason, shared policies are marked 3878c2ecf20Sopenharmony_ci as such, and the extra reference is dropped "conditionally"--i.e., only 3888c2ecf20Sopenharmony_ci for shared policies. 3898c2ecf20Sopenharmony_ci 3908c2ecf20Sopenharmony_ci Because of this extra reference counting, and because we must lookup 3918c2ecf20Sopenharmony_ci shared policies in a tree structure under spinlock, shared policies are 3928c2ecf20Sopenharmony_ci more expensive to use in the page allocation path. This is especially 3938c2ecf20Sopenharmony_ci true for shared policies on shared memory regions shared by tasks running 3948c2ecf20Sopenharmony_ci on different NUMA nodes. This extra overhead can be avoided by always 3958c2ecf20Sopenharmony_ci falling back to task or system default policy for shared memory regions, 3968c2ecf20Sopenharmony_ci or by prefaulting the entire shared memory region into memory and locking 3978c2ecf20Sopenharmony_ci it down. However, this might not be appropriate for all applications. 3988c2ecf20Sopenharmony_ci 3998c2ecf20Sopenharmony_ci.. _memory_policy_apis: 4008c2ecf20Sopenharmony_ci 4018c2ecf20Sopenharmony_ciMemory Policy APIs 4028c2ecf20Sopenharmony_ci================== 4038c2ecf20Sopenharmony_ci 4048c2ecf20Sopenharmony_ciLinux supports 3 system calls for controlling memory policy. These APIS 4058c2ecf20Sopenharmony_cialways affect only the calling task, the calling task's address space, or 4068c2ecf20Sopenharmony_cisome shared object mapped into the calling task's address space. 4078c2ecf20Sopenharmony_ci 4088c2ecf20Sopenharmony_ci.. note:: 4098c2ecf20Sopenharmony_ci the headers that define these APIs and the parameter data types for 4108c2ecf20Sopenharmony_ci user space applications reside in a package that is not part of the 4118c2ecf20Sopenharmony_ci Linux kernel. The kernel system call interfaces, with the 'sys\_' 4128c2ecf20Sopenharmony_ci prefix, are defined in <linux/syscalls.h>; the mode and flag 4138c2ecf20Sopenharmony_ci definitions are defined in <linux/mempolicy.h>. 4148c2ecf20Sopenharmony_ci 4158c2ecf20Sopenharmony_ciSet [Task] Memory Policy:: 4168c2ecf20Sopenharmony_ci 4178c2ecf20Sopenharmony_ci long set_mempolicy(int mode, const unsigned long *nmask, 4188c2ecf20Sopenharmony_ci unsigned long maxnode); 4198c2ecf20Sopenharmony_ci 4208c2ecf20Sopenharmony_ciSet's the calling task's "task/process memory policy" to mode 4218c2ecf20Sopenharmony_cispecified by the 'mode' argument and the set of nodes defined by 4228c2ecf20Sopenharmony_ci'nmask'. 'nmask' points to a bit mask of node ids containing at least 4238c2ecf20Sopenharmony_ci'maxnode' ids. Optional mode flags may be passed by combining the 4248c2ecf20Sopenharmony_ci'mode' argument with the flag (for example: MPOL_INTERLEAVE | 4258c2ecf20Sopenharmony_ciMPOL_F_STATIC_NODES). 4268c2ecf20Sopenharmony_ci 4278c2ecf20Sopenharmony_ciSee the set_mempolicy(2) man page for more details 4288c2ecf20Sopenharmony_ci 4298c2ecf20Sopenharmony_ci 4308c2ecf20Sopenharmony_ciGet [Task] Memory Policy or Related Information:: 4318c2ecf20Sopenharmony_ci 4328c2ecf20Sopenharmony_ci long get_mempolicy(int *mode, 4338c2ecf20Sopenharmony_ci const unsigned long *nmask, unsigned long maxnode, 4348c2ecf20Sopenharmony_ci void *addr, int flags); 4358c2ecf20Sopenharmony_ci 4368c2ecf20Sopenharmony_ciQueries the "task/process memory policy" of the calling task, or the 4378c2ecf20Sopenharmony_cipolicy or location of a specified virtual address, depending on the 4388c2ecf20Sopenharmony_ci'flags' argument. 4398c2ecf20Sopenharmony_ci 4408c2ecf20Sopenharmony_ciSee the get_mempolicy(2) man page for more details 4418c2ecf20Sopenharmony_ci 4428c2ecf20Sopenharmony_ci 4438c2ecf20Sopenharmony_ciInstall VMA/Shared Policy for a Range of Task's Address Space:: 4448c2ecf20Sopenharmony_ci 4458c2ecf20Sopenharmony_ci long mbind(void *start, unsigned long len, int mode, 4468c2ecf20Sopenharmony_ci const unsigned long *nmask, unsigned long maxnode, 4478c2ecf20Sopenharmony_ci unsigned flags); 4488c2ecf20Sopenharmony_ci 4498c2ecf20Sopenharmony_cimbind() installs the policy specified by (mode, nmask, maxnodes) as a 4508c2ecf20Sopenharmony_ciVMA policy for the range of the calling task's address space specified 4518c2ecf20Sopenharmony_ciby the 'start' and 'len' arguments. Additional actions may be 4528c2ecf20Sopenharmony_cirequested via the 'flags' argument. 4538c2ecf20Sopenharmony_ci 4548c2ecf20Sopenharmony_ciSee the mbind(2) man page for more details. 4558c2ecf20Sopenharmony_ci 4568c2ecf20Sopenharmony_ciMemory Policy Command Line Interface 4578c2ecf20Sopenharmony_ci==================================== 4588c2ecf20Sopenharmony_ci 4598c2ecf20Sopenharmony_ciAlthough not strictly part of the Linux implementation of memory policy, 4608c2ecf20Sopenharmony_cia command line tool, numactl(8), exists that allows one to: 4618c2ecf20Sopenharmony_ci 4628c2ecf20Sopenharmony_ci+ set the task policy for a specified program via set_mempolicy(2), fork(2) and 4638c2ecf20Sopenharmony_ci exec(2) 4648c2ecf20Sopenharmony_ci 4658c2ecf20Sopenharmony_ci+ set the shared policy for a shared memory segment via mbind(2) 4668c2ecf20Sopenharmony_ci 4678c2ecf20Sopenharmony_ciThe numactl(8) tool is packaged with the run-time version of the library 4688c2ecf20Sopenharmony_cicontaining the memory policy system call wrappers. Some distributions 4698c2ecf20Sopenharmony_cipackage the headers and compile-time libraries in a separate development 4708c2ecf20Sopenharmony_cipackage. 4718c2ecf20Sopenharmony_ci 4728c2ecf20Sopenharmony_ci.. _mem_pol_and_cpusets: 4738c2ecf20Sopenharmony_ci 4748c2ecf20Sopenharmony_ciMemory Policies and cpusets 4758c2ecf20Sopenharmony_ci=========================== 4768c2ecf20Sopenharmony_ci 4778c2ecf20Sopenharmony_ciMemory policies work within cpusets as described above. For memory policies 4788c2ecf20Sopenharmony_cithat require a node or set of nodes, the nodes are restricted to the set of 4798c2ecf20Sopenharmony_cinodes whose memories are allowed by the cpuset constraints. If the nodemask 4808c2ecf20Sopenharmony_cispecified for the policy contains nodes that are not allowed by the cpuset and 4818c2ecf20Sopenharmony_ciMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes 4828c2ecf20Sopenharmony_cispecified for the policy and the set of nodes with memory is used. If the 4838c2ecf20Sopenharmony_ciresult is the empty set, the policy is considered invalid and cannot be 4848c2ecf20Sopenharmony_ciinstalled. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped 4858c2ecf20Sopenharmony_cionto and folded into the task's set of allowed nodes as previously described. 4868c2ecf20Sopenharmony_ci 4878c2ecf20Sopenharmony_ciThe interaction of memory policies and cpusets can be problematic when tasks 4888c2ecf20Sopenharmony_ciin two cpusets share access to a memory region, such as shared memory segments 4898c2ecf20Sopenharmony_cicreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and 4908c2ecf20Sopenharmony_ciany of the tasks install shared policy on the region, only nodes whose 4918c2ecf20Sopenharmony_cimemories are allowed in both cpusets may be used in the policies. Obtaining 4928c2ecf20Sopenharmony_cithis information requires "stepping outside" the memory policy APIs to use the 4938c2ecf20Sopenharmony_cicpuset information and requires that one know in what cpusets other task might 4948c2ecf20Sopenharmony_cibe attaching to the shared region. Furthermore, if the cpusets' allowed 4958c2ecf20Sopenharmony_cimemory sets are disjoint, "local" allocation is the only valid policy. 496