18c2ecf20Sopenharmony_ci.. _numa_memory_policy:
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci==================
48c2ecf20Sopenharmony_ciNUMA Memory Policy
58c2ecf20Sopenharmony_ci==================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciWhat is NUMA Memory Policy?
88c2ecf20Sopenharmony_ci============================
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciIn the Linux kernel, "memory policy" determines from which node the kernel will
118c2ecf20Sopenharmony_ciallocate memory in a NUMA system or in an emulated NUMA system.  Linux has
128c2ecf20Sopenharmony_cisupported platforms with Non-Uniform Memory Access architectures since 2.4.?.
138c2ecf20Sopenharmony_ciThe current memory policy support was added to Linux 2.6 around May 2004.  This
148c2ecf20Sopenharmony_cidocument attempts to describe the concepts and APIs of the 2.6 memory policy
158c2ecf20Sopenharmony_cisupport.
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ciMemory policies should not be confused with cpusets
188c2ecf20Sopenharmony_ci(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
198c2ecf20Sopenharmony_ciwhich is an administrative mechanism for restricting the nodes from which
208c2ecf20Sopenharmony_cimemory may be allocated by a set of processes. Memory policies are a
218c2ecf20Sopenharmony_ciprogramming interface that a NUMA-aware application can take advantage of.  When
228c2ecf20Sopenharmony_ciboth cpusets and policies are applied to a task, the restrictions of the cpuset
238c2ecf20Sopenharmony_citakes priority.  See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
248c2ecf20Sopenharmony_cibelow for more details.
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ciMemory Policy Concepts
278c2ecf20Sopenharmony_ci======================
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ciScope of Memory Policies
308c2ecf20Sopenharmony_ci------------------------
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ciThe Linux kernel supports _scopes_ of memory policy, described here from
338c2ecf20Sopenharmony_cimost general to most specific:
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ciSystem Default Policy
368c2ecf20Sopenharmony_ci	this policy is "hard coded" into the kernel.  It is the policy
378c2ecf20Sopenharmony_ci	that governs all page allocations that aren't controlled by
388c2ecf20Sopenharmony_ci	one of the more specific policy scopes discussed below.  When
398c2ecf20Sopenharmony_ci	the system is "up and running", the system default policy will
408c2ecf20Sopenharmony_ci	use "local allocation" described below.  However, during boot
418c2ecf20Sopenharmony_ci	up, the system default policy will be set to interleave
428c2ecf20Sopenharmony_ci	allocations across all nodes with "sufficient" memory, so as
438c2ecf20Sopenharmony_ci	not to overload the initial boot node with boot-time
448c2ecf20Sopenharmony_ci	allocations.
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ciTask/Process Policy
478c2ecf20Sopenharmony_ci	this is an optional, per-task policy.  When defined for a
488c2ecf20Sopenharmony_ci	specific task, this policy controls all page allocations made
498c2ecf20Sopenharmony_ci	by or on behalf of the task that aren't controlled by a more
508c2ecf20Sopenharmony_ci	specific scope. If a task does not define a task policy, then
518c2ecf20Sopenharmony_ci	all page allocations that would have been controlled by the
528c2ecf20Sopenharmony_ci	task policy "fall back" to the System Default Policy.
538c2ecf20Sopenharmony_ci
548c2ecf20Sopenharmony_ci	The task policy applies to the entire address space of a task. Thus,
558c2ecf20Sopenharmony_ci	it is inheritable, and indeed is inherited, across both fork()
568c2ecf20Sopenharmony_ci	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
578c2ecf20Sopenharmony_ci	to establish the task policy for a child task exec()'d from an
588c2ecf20Sopenharmony_ci	executable image that has no awareness of memory policy.  See the
598c2ecf20Sopenharmony_ci	:ref:`Memory Policy APIs <memory_policy_apis>` section,
608c2ecf20Sopenharmony_ci	below, for an overview of the system call
618c2ecf20Sopenharmony_ci	that a task may use to set/change its task/process policy.
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ci	In a multi-threaded task, task policies apply only to the thread
648c2ecf20Sopenharmony_ci	[Linux kernel task] that installs the policy and any threads
658c2ecf20Sopenharmony_ci	subsequently created by that thread.  Any sibling threads existing
668c2ecf20Sopenharmony_ci	at the time a new task policy is installed retain their current
678c2ecf20Sopenharmony_ci	policy.
688c2ecf20Sopenharmony_ci
698c2ecf20Sopenharmony_ci	A task policy applies only to pages allocated after the policy is
708c2ecf20Sopenharmony_ci	installed.  Any pages already faulted in by the task when the task
718c2ecf20Sopenharmony_ci	changes its task policy remain where they were allocated based on
728c2ecf20Sopenharmony_ci	the policy at the time they were allocated.
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ci.. _vma_policy:
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciVMA Policy
778c2ecf20Sopenharmony_ci	A "VMA" or "Virtual Memory Area" refers to a range of a task's
788c2ecf20Sopenharmony_ci	virtual address space.  A task may define a specific policy for a range
798c2ecf20Sopenharmony_ci	of its virtual address space.   See the
808c2ecf20Sopenharmony_ci	:ref:`Memory Policy APIs <memory_policy_apis>` section,
818c2ecf20Sopenharmony_ci	below, for an overview of the mbind() system call used to set a VMA
828c2ecf20Sopenharmony_ci	policy.
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ci	A VMA policy will govern the allocation of pages that back
858c2ecf20Sopenharmony_ci	this region of the address space.  Any regions of the task's
868c2ecf20Sopenharmony_ci	address space that don't have an explicit VMA policy will fall
878c2ecf20Sopenharmony_ci	back to the task policy, which may itself fall back to the
888c2ecf20Sopenharmony_ci	System Default Policy.
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ci	VMA policies have a few complicating details:
918c2ecf20Sopenharmony_ci
928c2ecf20Sopenharmony_ci	* VMA policy applies ONLY to anonymous pages.  These include
938c2ecf20Sopenharmony_ci	  pages allocated for anonymous segments, such as the task
948c2ecf20Sopenharmony_ci	  stack and heap, and any regions of the address space
958c2ecf20Sopenharmony_ci	  mmap()ed with the MAP_ANONYMOUS flag.  If a VMA policy is
968c2ecf20Sopenharmony_ci	  applied to a file mapping, it will be ignored if the mapping
978c2ecf20Sopenharmony_ci	  used the MAP_SHARED flag.  If the file mapping used the
988c2ecf20Sopenharmony_ci	  MAP_PRIVATE flag, the VMA policy will only be applied when
998c2ecf20Sopenharmony_ci	  an anonymous page is allocated on an attempt to write to the
1008c2ecf20Sopenharmony_ci	  mapping-- i.e., at Copy-On-Write.
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ci	* VMA policies are shared between all tasks that share a
1038c2ecf20Sopenharmony_ci	  virtual address space--a.k.a. threads--independent of when
1048c2ecf20Sopenharmony_ci	  the policy is installed; and they are inherited across
1058c2ecf20Sopenharmony_ci	  fork().  However, because VMA policies refer to a specific
1068c2ecf20Sopenharmony_ci	  region of a task's address space, and because the address
1078c2ecf20Sopenharmony_ci	  space is discarded and recreated on exec*(), VMA policies
1088c2ecf20Sopenharmony_ci	  are NOT inheritable across exec().  Thus, only NUMA-aware
1098c2ecf20Sopenharmony_ci	  applications may use VMA policies.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ci	* A task may install a new VMA policy on a sub-range of a
1128c2ecf20Sopenharmony_ci	  previously mmap()ed region.  When this happens, Linux splits
1138c2ecf20Sopenharmony_ci	  the existing virtual memory area into 2 or 3 VMAs, each with
1148c2ecf20Sopenharmony_ci	  it's own policy.
1158c2ecf20Sopenharmony_ci
1168c2ecf20Sopenharmony_ci	* By default, VMA policy applies only to pages allocated after
1178c2ecf20Sopenharmony_ci	  the policy is installed.  Any pages already faulted into the
1188c2ecf20Sopenharmony_ci	  VMA range remain where they were allocated based on the
1198c2ecf20Sopenharmony_ci	  policy at the time they were allocated.  However, since
1208c2ecf20Sopenharmony_ci	  2.6.16, Linux supports page migration via the mbind() system
1218c2ecf20Sopenharmony_ci	  call, so that page contents can be moved to match a newly
1228c2ecf20Sopenharmony_ci	  installed policy.
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ciShared Policy
1258c2ecf20Sopenharmony_ci	Conceptually, shared policies apply to "memory objects" mapped
1268c2ecf20Sopenharmony_ci	shared into one or more tasks' distinct address spaces.  An
1278c2ecf20Sopenharmony_ci	application installs shared policies the same way as VMA
1288c2ecf20Sopenharmony_ci	policies--using the mbind() system call specifying a range of
1298c2ecf20Sopenharmony_ci	virtual addresses that map the shared object.  However, unlike
1308c2ecf20Sopenharmony_ci	VMA policies, which can be considered to be an attribute of a
1318c2ecf20Sopenharmony_ci	range of a task's address space, shared policies apply
1328c2ecf20Sopenharmony_ci	directly to the shared object.  Thus, all tasks that attach to
1338c2ecf20Sopenharmony_ci	the object share the policy, and all pages allocated for the
1348c2ecf20Sopenharmony_ci	shared object, by any task, will obey the shared policy.
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ci	As of 2.6.22, only shared memory segments, created by shmget() or
1378c2ecf20Sopenharmony_ci	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
1388c2ecf20Sopenharmony_ci	policy support was added to Linux, the associated data structures were
1398c2ecf20Sopenharmony_ci	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
1408c2ecf20Sopenharmony_ci	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
1418c2ecf20Sopenharmony_ci	shmem segments were never "hooked up" to the shared policy support.
1428c2ecf20Sopenharmony_ci	Although hugetlbfs segments now support lazy allocation, their support
1438c2ecf20Sopenharmony_ci	for shared policy has not been completed.
1448c2ecf20Sopenharmony_ci
1458c2ecf20Sopenharmony_ci	As mentioned above in :ref:`VMA policies <vma_policy>` section,
1468c2ecf20Sopenharmony_ci	allocations of page cache pages for regular files mmap()ed
1478c2ecf20Sopenharmony_ci	with MAP_SHARED ignore any VMA policy installed on the virtual
1488c2ecf20Sopenharmony_ci	address range backed by the shared file mapping.  Rather,
1498c2ecf20Sopenharmony_ci	shared page cache pages, including pages backing private
1508c2ecf20Sopenharmony_ci	mappings that have not yet been written by the task, follow
1518c2ecf20Sopenharmony_ci	task policy, if any, else System Default Policy.
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ci	The shared policy infrastructure supports different policies on subset
1548c2ecf20Sopenharmony_ci	ranges of the shared object.  However, Linux still splits the VMA of
1558c2ecf20Sopenharmony_ci	the task that installs the policy for each range of distinct policy.
1568c2ecf20Sopenharmony_ci	Thus, different tasks that attach to a shared memory segment can have
1578c2ecf20Sopenharmony_ci	different VMA configurations mapping that one shared object.  This
1588c2ecf20Sopenharmony_ci	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
1598c2ecf20Sopenharmony_ci	a shared memory region, when one task has installed shared policy on
1608c2ecf20Sopenharmony_ci	one or more ranges of the region.
1618c2ecf20Sopenharmony_ci
1628c2ecf20Sopenharmony_ciComponents of Memory Policies
1638c2ecf20Sopenharmony_ci-----------------------------
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciA NUMA memory policy consists of a "mode", optional mode flags, and
1668c2ecf20Sopenharmony_cian optional set of nodes.  The mode determines the behavior of the
1678c2ecf20Sopenharmony_cipolicy, the optional mode flags determine the behavior of the mode,
1688c2ecf20Sopenharmony_ciand the optional set of nodes can be viewed as the arguments to the
1698c2ecf20Sopenharmony_cipolicy behavior.
1708c2ecf20Sopenharmony_ci
1718c2ecf20Sopenharmony_ciInternally, memory policies are implemented by a reference counted
1728c2ecf20Sopenharmony_cistructure, struct mempolicy.  Details of this structure will be
1738c2ecf20Sopenharmony_cidiscussed in context, below, as required to explain the behavior.
1748c2ecf20Sopenharmony_ci
1758c2ecf20Sopenharmony_ciNUMA memory policy supports the following 4 behavioral modes:
1768c2ecf20Sopenharmony_ci
1778c2ecf20Sopenharmony_ciDefault Mode--MPOL_DEFAULT
1788c2ecf20Sopenharmony_ci	This mode is only used in the memory policy APIs.  Internally,
1798c2ecf20Sopenharmony_ci	MPOL_DEFAULT is converted to the NULL memory policy in all
1808c2ecf20Sopenharmony_ci	policy scopes.  Any existing non-default policy will simply be
1818c2ecf20Sopenharmony_ci	removed when MPOL_DEFAULT is specified.  As a result,
1828c2ecf20Sopenharmony_ci	MPOL_DEFAULT means "fall back to the next most specific policy
1838c2ecf20Sopenharmony_ci	scope."
1848c2ecf20Sopenharmony_ci
1858c2ecf20Sopenharmony_ci	For example, a NULL or default task policy will fall back to the
1868c2ecf20Sopenharmony_ci	system default policy.  A NULL or default vma policy will fall
1878c2ecf20Sopenharmony_ci	back to the task policy.
1888c2ecf20Sopenharmony_ci
1898c2ecf20Sopenharmony_ci	When specified in one of the memory policy APIs, the Default mode
1908c2ecf20Sopenharmony_ci	does not use the optional set of nodes.
1918c2ecf20Sopenharmony_ci
1928c2ecf20Sopenharmony_ci	It is an error for the set of nodes specified for this policy to
1938c2ecf20Sopenharmony_ci	be non-empty.
1948c2ecf20Sopenharmony_ci
1958c2ecf20Sopenharmony_ciMPOL_BIND
1968c2ecf20Sopenharmony_ci	This mode specifies that memory must come from the set of
1978c2ecf20Sopenharmony_ci	nodes specified by the policy.  Memory will be allocated from
1988c2ecf20Sopenharmony_ci	the node in the set with sufficient free memory that is
1998c2ecf20Sopenharmony_ci	closest to the node where the allocation takes place.
2008c2ecf20Sopenharmony_ci
2018c2ecf20Sopenharmony_ciMPOL_PREFERRED
2028c2ecf20Sopenharmony_ci	This mode specifies that the allocation should be attempted
2038c2ecf20Sopenharmony_ci	from the single node specified in the policy.  If that
2048c2ecf20Sopenharmony_ci	allocation fails, the kernel will search other nodes, in order
2058c2ecf20Sopenharmony_ci	of increasing distance from the preferred node based on
2068c2ecf20Sopenharmony_ci	information provided by the platform firmware.
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ci	Internally, the Preferred policy uses a single node--the
2098c2ecf20Sopenharmony_ci	preferred_node member of struct mempolicy.  When the internal
2108c2ecf20Sopenharmony_ci	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
2118c2ecf20Sopenharmony_ci	and the policy is interpreted as local allocation.  "Local"
2128c2ecf20Sopenharmony_ci	allocation policy can be viewed as a Preferred policy that
2138c2ecf20Sopenharmony_ci	starts at the node containing the cpu where the allocation
2148c2ecf20Sopenharmony_ci	takes place.
2158c2ecf20Sopenharmony_ci
2168c2ecf20Sopenharmony_ci	It is possible for the user to specify that local allocation
2178c2ecf20Sopenharmony_ci	is always preferred by passing an empty nodemask with this
2188c2ecf20Sopenharmony_ci	mode.  If an empty nodemask is passed, the policy cannot use
2198c2ecf20Sopenharmony_ci	the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
2208c2ecf20Sopenharmony_ci	described below.
2218c2ecf20Sopenharmony_ci
2228c2ecf20Sopenharmony_ciMPOL_INTERLEAVED
2238c2ecf20Sopenharmony_ci	This mode specifies that page allocations be interleaved, on a
2248c2ecf20Sopenharmony_ci	page granularity, across the nodes specified in the policy.
2258c2ecf20Sopenharmony_ci	This mode also behaves slightly differently, based on the
2268c2ecf20Sopenharmony_ci	context where it is used:
2278c2ecf20Sopenharmony_ci
2288c2ecf20Sopenharmony_ci	For allocation of anonymous pages and shared memory pages,
2298c2ecf20Sopenharmony_ci	Interleave mode indexes the set of nodes specified by the
2308c2ecf20Sopenharmony_ci	policy using the page offset of the faulting address into the
2318c2ecf20Sopenharmony_ci	segment [VMA] containing the address modulo the number of
2328c2ecf20Sopenharmony_ci	nodes specified by the policy.  It then attempts to allocate a
2338c2ecf20Sopenharmony_ci	page, starting at the selected node, as if the node had been
2348c2ecf20Sopenharmony_ci	specified by a Preferred policy or had been selected by a
2358c2ecf20Sopenharmony_ci	local allocation.  That is, allocation will follow the per
2368c2ecf20Sopenharmony_ci	node zonelist.
2378c2ecf20Sopenharmony_ci
2388c2ecf20Sopenharmony_ci	For allocation of page cache pages, Interleave mode indexes
2398c2ecf20Sopenharmony_ci	the set of nodes specified by the policy using a node counter
2408c2ecf20Sopenharmony_ci	maintained per task.  This counter wraps around to the lowest
2418c2ecf20Sopenharmony_ci	specified node after it reaches the highest specified node.
2428c2ecf20Sopenharmony_ci	This will tend to spread the pages out over the nodes
2438c2ecf20Sopenharmony_ci	specified by the policy based on the order in which they are
2448c2ecf20Sopenharmony_ci	allocated, rather than based on any page offset into an
2458c2ecf20Sopenharmony_ci	address range or file.  During system boot up, the temporary
2468c2ecf20Sopenharmony_ci	interleaved system default policy works in this mode.
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ciNUMA memory policy supports the following optional mode flags:
2498c2ecf20Sopenharmony_ci
2508c2ecf20Sopenharmony_ciMPOL_F_STATIC_NODES
2518c2ecf20Sopenharmony_ci	This flag specifies that the nodemask passed by
2528c2ecf20Sopenharmony_ci	the user should not be remapped if the task or VMA's set of allowed
2538c2ecf20Sopenharmony_ci	nodes changes after the memory policy has been defined.
2548c2ecf20Sopenharmony_ci
2558c2ecf20Sopenharmony_ci	Without this flag, any time a mempolicy is rebound because of a
2568c2ecf20Sopenharmony_ci	change in the set of allowed nodes, the node (Preferred) or
2578c2ecf20Sopenharmony_ci	nodemask (Bind, Interleave) is remapped to the new set of
2588c2ecf20Sopenharmony_ci	allowed nodes.  This may result in nodes being used that were
2598c2ecf20Sopenharmony_ci	previously undesired.
2608c2ecf20Sopenharmony_ci
2618c2ecf20Sopenharmony_ci	With this flag, if the user-specified nodes overlap with the
2628c2ecf20Sopenharmony_ci	nodes allowed by the task's cpuset, then the memory policy is
2638c2ecf20Sopenharmony_ci	applied to their intersection.  If the two sets of nodes do not
2648c2ecf20Sopenharmony_ci	overlap, the Default policy is used.
2658c2ecf20Sopenharmony_ci
2668c2ecf20Sopenharmony_ci	For example, consider a task that is attached to a cpuset with
2678c2ecf20Sopenharmony_ci	mems 1-3 that sets an Interleave policy over the same set.  If
2688c2ecf20Sopenharmony_ci	the cpuset's mems change to 3-5, the Interleave will now occur
2698c2ecf20Sopenharmony_ci	over nodes 3, 4, and 5.  With this flag, however, since only node
2708c2ecf20Sopenharmony_ci	3 is allowed from the user's nodemask, the "interleave" only
2718c2ecf20Sopenharmony_ci	occurs over that node.  If no nodes from the user's nodemask are
2728c2ecf20Sopenharmony_ci	now allowed, the Default behavior is used.
2738c2ecf20Sopenharmony_ci
2748c2ecf20Sopenharmony_ci	MPOL_F_STATIC_NODES cannot be combined with the
2758c2ecf20Sopenharmony_ci	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
2768c2ecf20Sopenharmony_ci	MPOL_PREFERRED policies that were created with an empty nodemask
2778c2ecf20Sopenharmony_ci	(local allocation).
2788c2ecf20Sopenharmony_ci
2798c2ecf20Sopenharmony_ciMPOL_F_RELATIVE_NODES
2808c2ecf20Sopenharmony_ci	This flag specifies that the nodemask passed
2818c2ecf20Sopenharmony_ci	by the user will be mapped relative to the set of the task or VMA's
2828c2ecf20Sopenharmony_ci	set of allowed nodes.  The kernel stores the user-passed nodemask,
2838c2ecf20Sopenharmony_ci	and if the allowed nodes changes, then that original nodemask will
2848c2ecf20Sopenharmony_ci	be remapped relative to the new set of allowed nodes.
2858c2ecf20Sopenharmony_ci
2868c2ecf20Sopenharmony_ci	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
2878c2ecf20Sopenharmony_ci	mempolicy is rebound because of a change in the set of allowed
2888c2ecf20Sopenharmony_ci	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
2898c2ecf20Sopenharmony_ci	remapped to the new set of allowed nodes.  That remap may not
2908c2ecf20Sopenharmony_ci	preserve the relative nature of the user's passed nodemask to its
2918c2ecf20Sopenharmony_ci	set of allowed nodes upon successive rebinds: a nodemask of
2928c2ecf20Sopenharmony_ci	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
2938c2ecf20Sopenharmony_ci	allowed nodes is restored to its original state.
2948c2ecf20Sopenharmony_ci
2958c2ecf20Sopenharmony_ci	With this flag, the remap is done so that the node numbers from
2968c2ecf20Sopenharmony_ci	the user's passed nodemask are relative to the set of allowed
2978c2ecf20Sopenharmony_ci	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
2988c2ecf20Sopenharmony_ci	nodemask, the policy will be effected over the first (and in the
2998c2ecf20Sopenharmony_ci	Bind or Interleave case, the third and fifth) nodes in the set of
3008c2ecf20Sopenharmony_ci	allowed nodes.  The nodemask passed by the user represents nodes
3018c2ecf20Sopenharmony_ci	relative to task or VMA's set of allowed nodes.
3028c2ecf20Sopenharmony_ci
3038c2ecf20Sopenharmony_ci	If the user's nodemask includes nodes that are outside the range
3048c2ecf20Sopenharmony_ci	of the new set of allowed nodes (for example, node 5 is set in
3058c2ecf20Sopenharmony_ci	the user's nodemask when the set of allowed nodes is only 0-3),
3068c2ecf20Sopenharmony_ci	then the remap wraps around to the beginning of the nodemask and,
3078c2ecf20Sopenharmony_ci	if not already set, sets the node in the mempolicy nodemask.
3088c2ecf20Sopenharmony_ci
3098c2ecf20Sopenharmony_ci	For example, consider a task that is attached to a cpuset with
3108c2ecf20Sopenharmony_ci	mems 2-5 that sets an Interleave policy over the same set with
3118c2ecf20Sopenharmony_ci	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
3128c2ecf20Sopenharmony_ci	interleave now occurs over nodes 3,5-7.  If the cpuset's mems
3138c2ecf20Sopenharmony_ci	then change to 0,2-3,5, then the interleave occurs over nodes
3148c2ecf20Sopenharmony_ci	0,2-3,5.
3158c2ecf20Sopenharmony_ci
3168c2ecf20Sopenharmony_ci	Thanks to the consistent remapping, applications preparing
3178c2ecf20Sopenharmony_ci	nodemasks to specify memory policies using this flag should
3188c2ecf20Sopenharmony_ci	disregard their current, actual cpuset imposed memory placement
3198c2ecf20Sopenharmony_ci	and prepare the nodemask as if they were always located on
3208c2ecf20Sopenharmony_ci	memory nodes 0 to N-1, where N is the number of memory nodes the
3218c2ecf20Sopenharmony_ci	policy is intended to manage.  Let the kernel then remap to the
3228c2ecf20Sopenharmony_ci	set of memory nodes allowed by the task's cpuset, as that may
3238c2ecf20Sopenharmony_ci	change over time.
3248c2ecf20Sopenharmony_ci
3258c2ecf20Sopenharmony_ci	MPOL_F_RELATIVE_NODES cannot be combined with the
3268c2ecf20Sopenharmony_ci	MPOL_F_STATIC_NODES flag.  It also cannot be used for
3278c2ecf20Sopenharmony_ci	MPOL_PREFERRED policies that were created with an empty nodemask
3288c2ecf20Sopenharmony_ci	(local allocation).
3298c2ecf20Sopenharmony_ci
3308c2ecf20Sopenharmony_ciMemory Policy Reference Counting
3318c2ecf20Sopenharmony_ci================================
3328c2ecf20Sopenharmony_ci
3338c2ecf20Sopenharmony_ciTo resolve use/free races, struct mempolicy contains an atomic reference
3348c2ecf20Sopenharmony_cicount field.  Internal interfaces, mpol_get()/mpol_put() increment and
3358c2ecf20Sopenharmony_cidecrement this reference count, respectively.  mpol_put() will only free
3368c2ecf20Sopenharmony_cithe structure back to the mempolicy kmem cache when the reference count
3378c2ecf20Sopenharmony_cigoes to zero.
3388c2ecf20Sopenharmony_ci
3398c2ecf20Sopenharmony_ciWhen a new memory policy is allocated, its reference count is initialized
3408c2ecf20Sopenharmony_cito '1', representing the reference held by the task that is installing the
3418c2ecf20Sopenharmony_cinew policy.  When a pointer to a memory policy structure is stored in another
3428c2ecf20Sopenharmony_cistructure, another reference is added, as the task's reference will be dropped
3438c2ecf20Sopenharmony_cion completion of the policy installation.
3448c2ecf20Sopenharmony_ci
3458c2ecf20Sopenharmony_ciDuring run-time "usage" of the policy, we attempt to minimize atomic operations
3468c2ecf20Sopenharmony_cion the reference count, as this can lead to cache lines bouncing between cpus
3478c2ecf20Sopenharmony_ciand NUMA nodes.  "Usage" here means one of the following:
3488c2ecf20Sopenharmony_ci
3498c2ecf20Sopenharmony_ci1) querying of the policy, either by the task itself [using the get_mempolicy()
3508c2ecf20Sopenharmony_ci   API discussed below] or by another task using the /proc/<pid>/numa_maps
3518c2ecf20Sopenharmony_ci   interface.
3528c2ecf20Sopenharmony_ci
3538c2ecf20Sopenharmony_ci2) examination of the policy to determine the policy mode and associated node
3548c2ecf20Sopenharmony_ci   or node lists, if any, for page allocation.  This is considered a "hot
3558c2ecf20Sopenharmony_ci   path".  Note that for MPOL_BIND, the "usage" extends across the entire
3568c2ecf20Sopenharmony_ci   allocation process, which may sleep during page reclaimation, because the
3578c2ecf20Sopenharmony_ci   BIND policy nodemask is used, by reference, to filter ineligible nodes.
3588c2ecf20Sopenharmony_ci
3598c2ecf20Sopenharmony_ciWe can avoid taking an extra reference during the usages listed above as
3608c2ecf20Sopenharmony_cifollows:
3618c2ecf20Sopenharmony_ci
3628c2ecf20Sopenharmony_ci1) we never need to get/free the system default policy as this is never
3638c2ecf20Sopenharmony_ci   changed nor freed, once the system is up and running.
3648c2ecf20Sopenharmony_ci
3658c2ecf20Sopenharmony_ci2) for querying the policy, we do not need to take an extra reference on the
3668c2ecf20Sopenharmony_ci   target task's task policy nor vma policies because we always acquire the
3678c2ecf20Sopenharmony_ci   task's mm's mmap_lock for read during the query.  The set_mempolicy() and
3688c2ecf20Sopenharmony_ci   mbind() APIs [see below] always acquire the mmap_lock for write when
3698c2ecf20Sopenharmony_ci   installing or replacing task or vma policies.  Thus, there is no possibility
3708c2ecf20Sopenharmony_ci   of a task or thread freeing a policy while another task or thread is
3718c2ecf20Sopenharmony_ci   querying it.
3728c2ecf20Sopenharmony_ci
3738c2ecf20Sopenharmony_ci3) Page allocation usage of task or vma policy occurs in the fault path where
3748c2ecf20Sopenharmony_ci   we hold them mmap_lock for read.  Again, because replacing the task or vma
3758c2ecf20Sopenharmony_ci   policy requires that the mmap_lock be held for write, the policy can't be
3768c2ecf20Sopenharmony_ci   freed out from under us while we're using it for page allocation.
3778c2ecf20Sopenharmony_ci
3788c2ecf20Sopenharmony_ci4) Shared policies require special consideration.  One task can replace a
3798c2ecf20Sopenharmony_ci   shared memory policy while another task, with a distinct mmap_lock, is
3808c2ecf20Sopenharmony_ci   querying or allocating a page based on the policy.  To resolve this
3818c2ecf20Sopenharmony_ci   potential race, the shared policy infrastructure adds an extra reference
3828c2ecf20Sopenharmony_ci   to the shared policy during lookup while holding a spin lock on the shared
3838c2ecf20Sopenharmony_ci   policy management structure.  This requires that we drop this extra
3848c2ecf20Sopenharmony_ci   reference when we're finished "using" the policy.  We must drop the
3858c2ecf20Sopenharmony_ci   extra reference on shared policies in the same query/allocation paths
3868c2ecf20Sopenharmony_ci   used for non-shared policies.  For this reason, shared policies are marked
3878c2ecf20Sopenharmony_ci   as such, and the extra reference is dropped "conditionally"--i.e., only
3888c2ecf20Sopenharmony_ci   for shared policies.
3898c2ecf20Sopenharmony_ci
3908c2ecf20Sopenharmony_ci   Because of this extra reference counting, and because we must lookup
3918c2ecf20Sopenharmony_ci   shared policies in a tree structure under spinlock, shared policies are
3928c2ecf20Sopenharmony_ci   more expensive to use in the page allocation path.  This is especially
3938c2ecf20Sopenharmony_ci   true for shared policies on shared memory regions shared by tasks running
3948c2ecf20Sopenharmony_ci   on different NUMA nodes.  This extra overhead can be avoided by always
3958c2ecf20Sopenharmony_ci   falling back to task or system default policy for shared memory regions,
3968c2ecf20Sopenharmony_ci   or by prefaulting the entire shared memory region into memory and locking
3978c2ecf20Sopenharmony_ci   it down.  However, this might not be appropriate for all applications.
3988c2ecf20Sopenharmony_ci
3998c2ecf20Sopenharmony_ci.. _memory_policy_apis:
4008c2ecf20Sopenharmony_ci
4018c2ecf20Sopenharmony_ciMemory Policy APIs
4028c2ecf20Sopenharmony_ci==================
4038c2ecf20Sopenharmony_ci
4048c2ecf20Sopenharmony_ciLinux supports 3 system calls for controlling memory policy.  These APIS
4058c2ecf20Sopenharmony_cialways affect only the calling task, the calling task's address space, or
4068c2ecf20Sopenharmony_cisome shared object mapped into the calling task's address space.
4078c2ecf20Sopenharmony_ci
4088c2ecf20Sopenharmony_ci.. note::
4098c2ecf20Sopenharmony_ci   the headers that define these APIs and the parameter data types for
4108c2ecf20Sopenharmony_ci   user space applications reside in a package that is not part of the
4118c2ecf20Sopenharmony_ci   Linux kernel.  The kernel system call interfaces, with the 'sys\_'
4128c2ecf20Sopenharmony_ci   prefix, are defined in <linux/syscalls.h>; the mode and flag
4138c2ecf20Sopenharmony_ci   definitions are defined in <linux/mempolicy.h>.
4148c2ecf20Sopenharmony_ci
4158c2ecf20Sopenharmony_ciSet [Task] Memory Policy::
4168c2ecf20Sopenharmony_ci
4178c2ecf20Sopenharmony_ci	long set_mempolicy(int mode, const unsigned long *nmask,
4188c2ecf20Sopenharmony_ci					unsigned long maxnode);
4198c2ecf20Sopenharmony_ci
4208c2ecf20Sopenharmony_ciSet's the calling task's "task/process memory policy" to mode
4218c2ecf20Sopenharmony_cispecified by the 'mode' argument and the set of nodes defined by
4228c2ecf20Sopenharmony_ci'nmask'.  'nmask' points to a bit mask of node ids containing at least
4238c2ecf20Sopenharmony_ci'maxnode' ids.  Optional mode flags may be passed by combining the
4248c2ecf20Sopenharmony_ci'mode' argument with the flag (for example: MPOL_INTERLEAVE |
4258c2ecf20Sopenharmony_ciMPOL_F_STATIC_NODES).
4268c2ecf20Sopenharmony_ci
4278c2ecf20Sopenharmony_ciSee the set_mempolicy(2) man page for more details
4288c2ecf20Sopenharmony_ci
4298c2ecf20Sopenharmony_ci
4308c2ecf20Sopenharmony_ciGet [Task] Memory Policy or Related Information::
4318c2ecf20Sopenharmony_ci
4328c2ecf20Sopenharmony_ci	long get_mempolicy(int *mode,
4338c2ecf20Sopenharmony_ci			   const unsigned long *nmask, unsigned long maxnode,
4348c2ecf20Sopenharmony_ci			   void *addr, int flags);
4358c2ecf20Sopenharmony_ci
4368c2ecf20Sopenharmony_ciQueries the "task/process memory policy" of the calling task, or the
4378c2ecf20Sopenharmony_cipolicy or location of a specified virtual address, depending on the
4388c2ecf20Sopenharmony_ci'flags' argument.
4398c2ecf20Sopenharmony_ci
4408c2ecf20Sopenharmony_ciSee the get_mempolicy(2) man page for more details
4418c2ecf20Sopenharmony_ci
4428c2ecf20Sopenharmony_ci
4438c2ecf20Sopenharmony_ciInstall VMA/Shared Policy for a Range of Task's Address Space::
4448c2ecf20Sopenharmony_ci
4458c2ecf20Sopenharmony_ci	long mbind(void *start, unsigned long len, int mode,
4468c2ecf20Sopenharmony_ci		   const unsigned long *nmask, unsigned long maxnode,
4478c2ecf20Sopenharmony_ci		   unsigned flags);
4488c2ecf20Sopenharmony_ci
4498c2ecf20Sopenharmony_cimbind() installs the policy specified by (mode, nmask, maxnodes) as a
4508c2ecf20Sopenharmony_ciVMA policy for the range of the calling task's address space specified
4518c2ecf20Sopenharmony_ciby the 'start' and 'len' arguments.  Additional actions may be
4528c2ecf20Sopenharmony_cirequested via the 'flags' argument.
4538c2ecf20Sopenharmony_ci
4548c2ecf20Sopenharmony_ciSee the mbind(2) man page for more details.
4558c2ecf20Sopenharmony_ci
4568c2ecf20Sopenharmony_ciMemory Policy Command Line Interface
4578c2ecf20Sopenharmony_ci====================================
4588c2ecf20Sopenharmony_ci
4598c2ecf20Sopenharmony_ciAlthough not strictly part of the Linux implementation of memory policy,
4608c2ecf20Sopenharmony_cia command line tool, numactl(8), exists that allows one to:
4618c2ecf20Sopenharmony_ci
4628c2ecf20Sopenharmony_ci+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
4638c2ecf20Sopenharmony_ci  exec(2)
4648c2ecf20Sopenharmony_ci
4658c2ecf20Sopenharmony_ci+ set the shared policy for a shared memory segment via mbind(2)
4668c2ecf20Sopenharmony_ci
4678c2ecf20Sopenharmony_ciThe numactl(8) tool is packaged with the run-time version of the library
4688c2ecf20Sopenharmony_cicontaining the memory policy system call wrappers.  Some distributions
4698c2ecf20Sopenharmony_cipackage the headers and compile-time libraries in a separate development
4708c2ecf20Sopenharmony_cipackage.
4718c2ecf20Sopenharmony_ci
4728c2ecf20Sopenharmony_ci.. _mem_pol_and_cpusets:
4738c2ecf20Sopenharmony_ci
4748c2ecf20Sopenharmony_ciMemory Policies and cpusets
4758c2ecf20Sopenharmony_ci===========================
4768c2ecf20Sopenharmony_ci
4778c2ecf20Sopenharmony_ciMemory policies work within cpusets as described above.  For memory policies
4788c2ecf20Sopenharmony_cithat require a node or set of nodes, the nodes are restricted to the set of
4798c2ecf20Sopenharmony_cinodes whose memories are allowed by the cpuset constraints.  If the nodemask
4808c2ecf20Sopenharmony_cispecified for the policy contains nodes that are not allowed by the cpuset and
4818c2ecf20Sopenharmony_ciMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
4828c2ecf20Sopenharmony_cispecified for the policy and the set of nodes with memory is used.  If the
4838c2ecf20Sopenharmony_ciresult is the empty set, the policy is considered invalid and cannot be
4848c2ecf20Sopenharmony_ciinstalled.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
4858c2ecf20Sopenharmony_cionto and folded into the task's set of allowed nodes as previously described.
4868c2ecf20Sopenharmony_ci
4878c2ecf20Sopenharmony_ciThe interaction of memory policies and cpusets can be problematic when tasks
4888c2ecf20Sopenharmony_ciin two cpusets share access to a memory region, such as shared memory segments
4898c2ecf20Sopenharmony_cicreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
4908c2ecf20Sopenharmony_ciany of the tasks install shared policy on the region, only nodes whose
4918c2ecf20Sopenharmony_cimemories are allowed in both cpusets may be used in the policies.  Obtaining
4928c2ecf20Sopenharmony_cithis information requires "stepping outside" the memory policy APIs to use the
4938c2ecf20Sopenharmony_cicpuset information and requires that one know in what cpusets other task might
4948c2ecf20Sopenharmony_cibe attaching to the shared region.  Furthermore, if the cpusets' allowed
4958c2ecf20Sopenharmony_cimemory sets are disjoint, "local" allocation is the only valid policy.
496