162306a36Sopenharmony_ci==================
262306a36Sopenharmony_ciNUMA Memory Policy
362306a36Sopenharmony_ci==================
462306a36Sopenharmony_ci
562306a36Sopenharmony_ciWhat is NUMA Memory Policy?
662306a36Sopenharmony_ci============================
762306a36Sopenharmony_ci
862306a36Sopenharmony_ciIn the Linux kernel, "memory policy" determines from which node the kernel will
962306a36Sopenharmony_ciallocate memory in a NUMA system or in an emulated NUMA system.  Linux has
1062306a36Sopenharmony_cisupported platforms with Non-Uniform Memory Access architectures since 2.4.?.
1162306a36Sopenharmony_ciThe current memory policy support was added to Linux 2.6 around May 2004.  This
1262306a36Sopenharmony_cidocument attempts to describe the concepts and APIs of the 2.6 memory policy
1362306a36Sopenharmony_cisupport.
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciMemory policies should not be confused with cpusets
1662306a36Sopenharmony_ci(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
1762306a36Sopenharmony_ciwhich is an administrative mechanism for restricting the nodes from which
1862306a36Sopenharmony_cimemory may be allocated by a set of processes. Memory policies are a
1962306a36Sopenharmony_ciprogramming interface that a NUMA-aware application can take advantage of.  When
2062306a36Sopenharmony_ciboth cpusets and policies are applied to a task, the restrictions of the cpuset
2162306a36Sopenharmony_citakes priority.  See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
2262306a36Sopenharmony_cibelow for more details.
2362306a36Sopenharmony_ci
2462306a36Sopenharmony_ciMemory Policy Concepts
2562306a36Sopenharmony_ci======================
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ciScope of Memory Policies
2862306a36Sopenharmony_ci------------------------
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciThe Linux kernel supports _scopes_ of memory policy, described here from
3162306a36Sopenharmony_cimost general to most specific:
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ciSystem Default Policy
3462306a36Sopenharmony_ci	this policy is "hard coded" into the kernel.  It is the policy
3562306a36Sopenharmony_ci	that governs all page allocations that aren't controlled by
3662306a36Sopenharmony_ci	one of the more specific policy scopes discussed below.  When
3762306a36Sopenharmony_ci	the system is "up and running", the system default policy will
3862306a36Sopenharmony_ci	use "local allocation" described below.  However, during boot
3962306a36Sopenharmony_ci	up, the system default policy will be set to interleave
4062306a36Sopenharmony_ci	allocations across all nodes with "sufficient" memory, so as
4162306a36Sopenharmony_ci	not to overload the initial boot node with boot-time
4262306a36Sopenharmony_ci	allocations.
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ciTask/Process Policy
4562306a36Sopenharmony_ci	this is an optional, per-task policy.  When defined for a
4662306a36Sopenharmony_ci	specific task, this policy controls all page allocations made
4762306a36Sopenharmony_ci	by or on behalf of the task that aren't controlled by a more
4862306a36Sopenharmony_ci	specific scope. If a task does not define a task policy, then
4962306a36Sopenharmony_ci	all page allocations that would have been controlled by the
5062306a36Sopenharmony_ci	task policy "fall back" to the System Default Policy.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ci	The task policy applies to the entire address space of a task. Thus,
5362306a36Sopenharmony_ci	it is inheritable, and indeed is inherited, across both fork()
5462306a36Sopenharmony_ci	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
5562306a36Sopenharmony_ci	to establish the task policy for a child task exec()'d from an
5662306a36Sopenharmony_ci	executable image that has no awareness of memory policy.  See the
5762306a36Sopenharmony_ci	:ref:`Memory Policy APIs <memory_policy_apis>` section,
5862306a36Sopenharmony_ci	below, for an overview of the system call
5962306a36Sopenharmony_ci	that a task may use to set/change its task/process policy.
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ci	In a multi-threaded task, task policies apply only to the thread
6262306a36Sopenharmony_ci	[Linux kernel task] that installs the policy and any threads
6362306a36Sopenharmony_ci	subsequently created by that thread.  Any sibling threads existing
6462306a36Sopenharmony_ci	at the time a new task policy is installed retain their current
6562306a36Sopenharmony_ci	policy.
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ci	A task policy applies only to pages allocated after the policy is
6862306a36Sopenharmony_ci	installed.  Any pages already faulted in by the task when the task
6962306a36Sopenharmony_ci	changes its task policy remain where they were allocated based on
7062306a36Sopenharmony_ci	the policy at the time they were allocated.
7162306a36Sopenharmony_ci
7262306a36Sopenharmony_ci.. _vma_policy:
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ciVMA Policy
7562306a36Sopenharmony_ci	A "VMA" or "Virtual Memory Area" refers to a range of a task's
7662306a36Sopenharmony_ci	virtual address space.  A task may define a specific policy for a range
7762306a36Sopenharmony_ci	of its virtual address space.   See the
7862306a36Sopenharmony_ci	:ref:`Memory Policy APIs <memory_policy_apis>` section,
7962306a36Sopenharmony_ci	below, for an overview of the mbind() system call used to set a VMA
8062306a36Sopenharmony_ci	policy.
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ci	A VMA policy will govern the allocation of pages that back
8362306a36Sopenharmony_ci	this region of the address space.  Any regions of the task's
8462306a36Sopenharmony_ci	address space that don't have an explicit VMA policy will fall
8562306a36Sopenharmony_ci	back to the task policy, which may itself fall back to the
8662306a36Sopenharmony_ci	System Default Policy.
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci	VMA policies have a few complicating details:
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ci	* VMA policy applies ONLY to anonymous pages.  These include
9162306a36Sopenharmony_ci	  pages allocated for anonymous segments, such as the task
9262306a36Sopenharmony_ci	  stack and heap, and any regions of the address space
9362306a36Sopenharmony_ci	  mmap()ed with the MAP_ANONYMOUS flag.  If a VMA policy is
9462306a36Sopenharmony_ci	  applied to a file mapping, it will be ignored if the mapping
9562306a36Sopenharmony_ci	  used the MAP_SHARED flag.  If the file mapping used the
9662306a36Sopenharmony_ci	  MAP_PRIVATE flag, the VMA policy will only be applied when
9762306a36Sopenharmony_ci	  an anonymous page is allocated on an attempt to write to the
9862306a36Sopenharmony_ci	  mapping-- i.e., at Copy-On-Write.
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ci	* VMA policies are shared between all tasks that share a
10162306a36Sopenharmony_ci	  virtual address space--a.k.a. threads--independent of when
10262306a36Sopenharmony_ci	  the policy is installed; and they are inherited across
10362306a36Sopenharmony_ci	  fork().  However, because VMA policies refer to a specific
10462306a36Sopenharmony_ci	  region of a task's address space, and because the address
10562306a36Sopenharmony_ci	  space is discarded and recreated on exec*(), VMA policies
10662306a36Sopenharmony_ci	  are NOT inheritable across exec().  Thus, only NUMA-aware
10762306a36Sopenharmony_ci	  applications may use VMA policies.
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ci	* A task may install a new VMA policy on a sub-range of a
11062306a36Sopenharmony_ci	  previously mmap()ed region.  When this happens, Linux splits
11162306a36Sopenharmony_ci	  the existing virtual memory area into 2 or 3 VMAs, each with
11262306a36Sopenharmony_ci	  its own policy.
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci	* By default, VMA policy applies only to pages allocated after
11562306a36Sopenharmony_ci	  the policy is installed.  Any pages already faulted into the
11662306a36Sopenharmony_ci	  VMA range remain where they were allocated based on the
11762306a36Sopenharmony_ci	  policy at the time they were allocated.  However, since
11862306a36Sopenharmony_ci	  2.6.16, Linux supports page migration via the mbind() system
11962306a36Sopenharmony_ci	  call, so that page contents can be moved to match a newly
12062306a36Sopenharmony_ci	  installed policy.
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ciShared Policy
12362306a36Sopenharmony_ci	Conceptually, shared policies apply to "memory objects" mapped
12462306a36Sopenharmony_ci	shared into one or more tasks' distinct address spaces.  An
12562306a36Sopenharmony_ci	application installs shared policies the same way as VMA
12662306a36Sopenharmony_ci	policies--using the mbind() system call specifying a range of
12762306a36Sopenharmony_ci	virtual addresses that map the shared object.  However, unlike
12862306a36Sopenharmony_ci	VMA policies, which can be considered to be an attribute of a
12962306a36Sopenharmony_ci	range of a task's address space, shared policies apply
13062306a36Sopenharmony_ci	directly to the shared object.  Thus, all tasks that attach to
13162306a36Sopenharmony_ci	the object share the policy, and all pages allocated for the
13262306a36Sopenharmony_ci	shared object, by any task, will obey the shared policy.
13362306a36Sopenharmony_ci
13462306a36Sopenharmony_ci	As of 2.6.22, only shared memory segments, created by shmget() or
13562306a36Sopenharmony_ci	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
13662306a36Sopenharmony_ci	policy support was added to Linux, the associated data structures were
13762306a36Sopenharmony_ci	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
13862306a36Sopenharmony_ci	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
13962306a36Sopenharmony_ci	shmem segments were never "hooked up" to the shared policy support.
14062306a36Sopenharmony_ci	Although hugetlbfs segments now support lazy allocation, their support
14162306a36Sopenharmony_ci	for shared policy has not been completed.
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ci	As mentioned above in :ref:`VMA policies <vma_policy>` section,
14462306a36Sopenharmony_ci	allocations of page cache pages for regular files mmap()ed
14562306a36Sopenharmony_ci	with MAP_SHARED ignore any VMA policy installed on the virtual
14662306a36Sopenharmony_ci	address range backed by the shared file mapping.  Rather,
14762306a36Sopenharmony_ci	shared page cache pages, including pages backing private
14862306a36Sopenharmony_ci	mappings that have not yet been written by the task, follow
14962306a36Sopenharmony_ci	task policy, if any, else System Default Policy.
15062306a36Sopenharmony_ci
15162306a36Sopenharmony_ci	The shared policy infrastructure supports different policies on subset
15262306a36Sopenharmony_ci	ranges of the shared object.  However, Linux still splits the VMA of
15362306a36Sopenharmony_ci	the task that installs the policy for each range of distinct policy.
15462306a36Sopenharmony_ci	Thus, different tasks that attach to a shared memory segment can have
15562306a36Sopenharmony_ci	different VMA configurations mapping that one shared object.  This
15662306a36Sopenharmony_ci	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
15762306a36Sopenharmony_ci	a shared memory region, when one task has installed shared policy on
15862306a36Sopenharmony_ci	one or more ranges of the region.
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_ciComponents of Memory Policies
16162306a36Sopenharmony_ci-----------------------------
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciA NUMA memory policy consists of a "mode", optional mode flags, and
16462306a36Sopenharmony_cian optional set of nodes.  The mode determines the behavior of the
16562306a36Sopenharmony_cipolicy, the optional mode flags determine the behavior of the mode,
16662306a36Sopenharmony_ciand the optional set of nodes can be viewed as the arguments to the
16762306a36Sopenharmony_cipolicy behavior.
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciInternally, memory policies are implemented by a reference counted
17062306a36Sopenharmony_cistructure, struct mempolicy.  Details of this structure will be
17162306a36Sopenharmony_cidiscussed in context, below, as required to explain the behavior.
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ciNUMA memory policy supports the following 4 behavioral modes:
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ciDefault Mode--MPOL_DEFAULT
17662306a36Sopenharmony_ci	This mode is only used in the memory policy APIs.  Internally,
17762306a36Sopenharmony_ci	MPOL_DEFAULT is converted to the NULL memory policy in all
17862306a36Sopenharmony_ci	policy scopes.  Any existing non-default policy will simply be
17962306a36Sopenharmony_ci	removed when MPOL_DEFAULT is specified.  As a result,
18062306a36Sopenharmony_ci	MPOL_DEFAULT means "fall back to the next most specific policy
18162306a36Sopenharmony_ci	scope."
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ci	For example, a NULL or default task policy will fall back to the
18462306a36Sopenharmony_ci	system default policy.  A NULL or default vma policy will fall
18562306a36Sopenharmony_ci	back to the task policy.
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci	When specified in one of the memory policy APIs, the Default mode
18862306a36Sopenharmony_ci	does not use the optional set of nodes.
18962306a36Sopenharmony_ci
19062306a36Sopenharmony_ci	It is an error for the set of nodes specified for this policy to
19162306a36Sopenharmony_ci	be non-empty.
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ciMPOL_BIND
19462306a36Sopenharmony_ci	This mode specifies that memory must come from the set of
19562306a36Sopenharmony_ci	nodes specified by the policy.  Memory will be allocated from
19662306a36Sopenharmony_ci	the node in the set with sufficient free memory that is
19762306a36Sopenharmony_ci	closest to the node where the allocation takes place.
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ciMPOL_PREFERRED
20062306a36Sopenharmony_ci	This mode specifies that the allocation should be attempted
20162306a36Sopenharmony_ci	from the single node specified in the policy.  If that
20262306a36Sopenharmony_ci	allocation fails, the kernel will search other nodes, in order
20362306a36Sopenharmony_ci	of increasing distance from the preferred node based on
20462306a36Sopenharmony_ci	information provided by the platform firmware.
20562306a36Sopenharmony_ci
20662306a36Sopenharmony_ci	Internally, the Preferred policy uses a single node--the
20762306a36Sopenharmony_ci	preferred_node member of struct mempolicy.  When the internal
20862306a36Sopenharmony_ci	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
20962306a36Sopenharmony_ci	and the policy is interpreted as local allocation.  "Local"
21062306a36Sopenharmony_ci	allocation policy can be viewed as a Preferred policy that
21162306a36Sopenharmony_ci	starts at the node containing the cpu where the allocation
21262306a36Sopenharmony_ci	takes place.
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ci	It is possible for the user to specify that local allocation
21562306a36Sopenharmony_ci	is always preferred by passing an empty nodemask with this
21662306a36Sopenharmony_ci	mode.  If an empty nodemask is passed, the policy cannot use
21762306a36Sopenharmony_ci	the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
21862306a36Sopenharmony_ci	described below.
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciMPOL_INTERLEAVED
22162306a36Sopenharmony_ci	This mode specifies that page allocations be interleaved, on a
22262306a36Sopenharmony_ci	page granularity, across the nodes specified in the policy.
22362306a36Sopenharmony_ci	This mode also behaves slightly differently, based on the
22462306a36Sopenharmony_ci	context where it is used:
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ci	For allocation of anonymous pages and shared memory pages,
22762306a36Sopenharmony_ci	Interleave mode indexes the set of nodes specified by the
22862306a36Sopenharmony_ci	policy using the page offset of the faulting address into the
22962306a36Sopenharmony_ci	segment [VMA] containing the address modulo the number of
23062306a36Sopenharmony_ci	nodes specified by the policy.  It then attempts to allocate a
23162306a36Sopenharmony_ci	page, starting at the selected node, as if the node had been
23262306a36Sopenharmony_ci	specified by a Preferred policy or had been selected by a
23362306a36Sopenharmony_ci	local allocation.  That is, allocation will follow the per
23462306a36Sopenharmony_ci	node zonelist.
23562306a36Sopenharmony_ci
23662306a36Sopenharmony_ci	For allocation of page cache pages, Interleave mode indexes
23762306a36Sopenharmony_ci	the set of nodes specified by the policy using a node counter
23862306a36Sopenharmony_ci	maintained per task.  This counter wraps around to the lowest
23962306a36Sopenharmony_ci	specified node after it reaches the highest specified node.
24062306a36Sopenharmony_ci	This will tend to spread the pages out over the nodes
24162306a36Sopenharmony_ci	specified by the policy based on the order in which they are
24262306a36Sopenharmony_ci	allocated, rather than based on any page offset into an
24362306a36Sopenharmony_ci	address range or file.  During system boot up, the temporary
24462306a36Sopenharmony_ci	interleaved system default policy works in this mode.
24562306a36Sopenharmony_ci
24662306a36Sopenharmony_ciMPOL_PREFERRED_MANY
24762306a36Sopenharmony_ci	This mode specifies that the allocation should be preferably
24862306a36Sopenharmony_ci	satisfied from the nodemask specified in the policy. If there is
24962306a36Sopenharmony_ci	a memory pressure on all nodes in the nodemask, the allocation
25062306a36Sopenharmony_ci	can fall back to all existing numa nodes. This is effectively
25162306a36Sopenharmony_ci	MPOL_PREFERRED allowed for a mask rather than a single node.
25262306a36Sopenharmony_ci
25362306a36Sopenharmony_ciNUMA memory policy supports the following optional mode flags:
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ciMPOL_F_STATIC_NODES
25662306a36Sopenharmony_ci	This flag specifies that the nodemask passed by
25762306a36Sopenharmony_ci	the user should not be remapped if the task or VMA's set of allowed
25862306a36Sopenharmony_ci	nodes changes after the memory policy has been defined.
25962306a36Sopenharmony_ci
26062306a36Sopenharmony_ci	Without this flag, any time a mempolicy is rebound because of a
26162306a36Sopenharmony_ci        change in the set of allowed nodes, the preferred nodemask (Preferred
26262306a36Sopenharmony_ci        Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
26362306a36Sopenharmony_ci        remapped to the new set of allowed nodes.  This may result in nodes
26462306a36Sopenharmony_ci        being used that were previously undesired.
26562306a36Sopenharmony_ci
26662306a36Sopenharmony_ci	With this flag, if the user-specified nodes overlap with the
26762306a36Sopenharmony_ci	nodes allowed by the task's cpuset, then the memory policy is
26862306a36Sopenharmony_ci	applied to their intersection.  If the two sets of nodes do not
26962306a36Sopenharmony_ci	overlap, the Default policy is used.
27062306a36Sopenharmony_ci
27162306a36Sopenharmony_ci	For example, consider a task that is attached to a cpuset with
27262306a36Sopenharmony_ci	mems 1-3 that sets an Interleave policy over the same set.  If
27362306a36Sopenharmony_ci	the cpuset's mems change to 3-5, the Interleave will now occur
27462306a36Sopenharmony_ci	over nodes 3, 4, and 5.  With this flag, however, since only node
27562306a36Sopenharmony_ci	3 is allowed from the user's nodemask, the "interleave" only
27662306a36Sopenharmony_ci	occurs over that node.  If no nodes from the user's nodemask are
27762306a36Sopenharmony_ci	now allowed, the Default behavior is used.
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ci	MPOL_F_STATIC_NODES cannot be combined with the
28062306a36Sopenharmony_ci	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
28162306a36Sopenharmony_ci	MPOL_PREFERRED policies that were created with an empty nodemask
28262306a36Sopenharmony_ci	(local allocation).
28362306a36Sopenharmony_ci
28462306a36Sopenharmony_ciMPOL_F_RELATIVE_NODES
28562306a36Sopenharmony_ci	This flag specifies that the nodemask passed
28662306a36Sopenharmony_ci	by the user will be mapped relative to the set of the task or VMA's
28762306a36Sopenharmony_ci	set of allowed nodes.  The kernel stores the user-passed nodemask,
28862306a36Sopenharmony_ci	and if the allowed nodes changes, then that original nodemask will
28962306a36Sopenharmony_ci	be remapped relative to the new set of allowed nodes.
29062306a36Sopenharmony_ci
29162306a36Sopenharmony_ci	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
29262306a36Sopenharmony_ci	mempolicy is rebound because of a change in the set of allowed
29362306a36Sopenharmony_ci	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
29462306a36Sopenharmony_ci	remapped to the new set of allowed nodes.  That remap may not
29562306a36Sopenharmony_ci	preserve the relative nature of the user's passed nodemask to its
29662306a36Sopenharmony_ci	set of allowed nodes upon successive rebinds: a nodemask of
29762306a36Sopenharmony_ci	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
29862306a36Sopenharmony_ci	allowed nodes is restored to its original state.
29962306a36Sopenharmony_ci
30062306a36Sopenharmony_ci	With this flag, the remap is done so that the node numbers from
30162306a36Sopenharmony_ci	the user's passed nodemask are relative to the set of allowed
30262306a36Sopenharmony_ci	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
30362306a36Sopenharmony_ci	nodemask, the policy will be effected over the first (and in the
30462306a36Sopenharmony_ci	Bind or Interleave case, the third and fifth) nodes in the set of
30562306a36Sopenharmony_ci	allowed nodes.  The nodemask passed by the user represents nodes
30662306a36Sopenharmony_ci	relative to task or VMA's set of allowed nodes.
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ci	If the user's nodemask includes nodes that are outside the range
30962306a36Sopenharmony_ci	of the new set of allowed nodes (for example, node 5 is set in
31062306a36Sopenharmony_ci	the user's nodemask when the set of allowed nodes is only 0-3),
31162306a36Sopenharmony_ci	then the remap wraps around to the beginning of the nodemask and,
31262306a36Sopenharmony_ci	if not already set, sets the node in the mempolicy nodemask.
31362306a36Sopenharmony_ci
31462306a36Sopenharmony_ci	For example, consider a task that is attached to a cpuset with
31562306a36Sopenharmony_ci	mems 2-5 that sets an Interleave policy over the same set with
31662306a36Sopenharmony_ci	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
31762306a36Sopenharmony_ci	interleave now occurs over nodes 3,5-7.  If the cpuset's mems
31862306a36Sopenharmony_ci	then change to 0,2-3,5, then the interleave occurs over nodes
31962306a36Sopenharmony_ci	0,2-3,5.
32062306a36Sopenharmony_ci
32162306a36Sopenharmony_ci	Thanks to the consistent remapping, applications preparing
32262306a36Sopenharmony_ci	nodemasks to specify memory policies using this flag should
32362306a36Sopenharmony_ci	disregard their current, actual cpuset imposed memory placement
32462306a36Sopenharmony_ci	and prepare the nodemask as if they were always located on
32562306a36Sopenharmony_ci	memory nodes 0 to N-1, where N is the number of memory nodes the
32662306a36Sopenharmony_ci	policy is intended to manage.  Let the kernel then remap to the
32762306a36Sopenharmony_ci	set of memory nodes allowed by the task's cpuset, as that may
32862306a36Sopenharmony_ci	change over time.
32962306a36Sopenharmony_ci
33062306a36Sopenharmony_ci	MPOL_F_RELATIVE_NODES cannot be combined with the
33162306a36Sopenharmony_ci	MPOL_F_STATIC_NODES flag.  It also cannot be used for
33262306a36Sopenharmony_ci	MPOL_PREFERRED policies that were created with an empty nodemask
33362306a36Sopenharmony_ci	(local allocation).
33462306a36Sopenharmony_ci
33562306a36Sopenharmony_ciMemory Policy Reference Counting
33662306a36Sopenharmony_ci================================
33762306a36Sopenharmony_ci
33862306a36Sopenharmony_ciTo resolve use/free races, struct mempolicy contains an atomic reference
33962306a36Sopenharmony_cicount field.  Internal interfaces, mpol_get()/mpol_put() increment and
34062306a36Sopenharmony_cidecrement this reference count, respectively.  mpol_put() will only free
34162306a36Sopenharmony_cithe structure back to the mempolicy kmem cache when the reference count
34262306a36Sopenharmony_cigoes to zero.
34362306a36Sopenharmony_ci
34462306a36Sopenharmony_ciWhen a new memory policy is allocated, its reference count is initialized
34562306a36Sopenharmony_cito '1', representing the reference held by the task that is installing the
34662306a36Sopenharmony_cinew policy.  When a pointer to a memory policy structure is stored in another
34762306a36Sopenharmony_cistructure, another reference is added, as the task's reference will be dropped
34862306a36Sopenharmony_cion completion of the policy installation.
34962306a36Sopenharmony_ci
35062306a36Sopenharmony_ciDuring run-time "usage" of the policy, we attempt to minimize atomic operations
35162306a36Sopenharmony_cion the reference count, as this can lead to cache lines bouncing between cpus
35262306a36Sopenharmony_ciand NUMA nodes.  "Usage" here means one of the following:
35362306a36Sopenharmony_ci
35462306a36Sopenharmony_ci1) querying of the policy, either by the task itself [using the get_mempolicy()
35562306a36Sopenharmony_ci   API discussed below] or by another task using the /proc/<pid>/numa_maps
35662306a36Sopenharmony_ci   interface.
35762306a36Sopenharmony_ci
35862306a36Sopenharmony_ci2) examination of the policy to determine the policy mode and associated node
35962306a36Sopenharmony_ci   or node lists, if any, for page allocation.  This is considered a "hot
36062306a36Sopenharmony_ci   path".  Note that for MPOL_BIND, the "usage" extends across the entire
36162306a36Sopenharmony_ci   allocation process, which may sleep during page reclamation, because the
36262306a36Sopenharmony_ci   BIND policy nodemask is used, by reference, to filter ineligible nodes.
36362306a36Sopenharmony_ci
36462306a36Sopenharmony_ciWe can avoid taking an extra reference during the usages listed above as
36562306a36Sopenharmony_cifollows:
36662306a36Sopenharmony_ci
36762306a36Sopenharmony_ci1) we never need to get/free the system default policy as this is never
36862306a36Sopenharmony_ci   changed nor freed, once the system is up and running.
36962306a36Sopenharmony_ci
37062306a36Sopenharmony_ci2) for querying the policy, we do not need to take an extra reference on the
37162306a36Sopenharmony_ci   target task's task policy nor vma policies because we always acquire the
37262306a36Sopenharmony_ci   task's mm's mmap_lock for read during the query.  The set_mempolicy() and
37362306a36Sopenharmony_ci   mbind() APIs [see below] always acquire the mmap_lock for write when
37462306a36Sopenharmony_ci   installing or replacing task or vma policies.  Thus, there is no possibility
37562306a36Sopenharmony_ci   of a task or thread freeing a policy while another task or thread is
37662306a36Sopenharmony_ci   querying it.
37762306a36Sopenharmony_ci
37862306a36Sopenharmony_ci3) Page allocation usage of task or vma policy occurs in the fault path where
37962306a36Sopenharmony_ci   we hold them mmap_lock for read.  Again, because replacing the task or vma
38062306a36Sopenharmony_ci   policy requires that the mmap_lock be held for write, the policy can't be
38162306a36Sopenharmony_ci   freed out from under us while we're using it for page allocation.
38262306a36Sopenharmony_ci
38362306a36Sopenharmony_ci4) Shared policies require special consideration.  One task can replace a
38462306a36Sopenharmony_ci   shared memory policy while another task, with a distinct mmap_lock, is
38562306a36Sopenharmony_ci   querying or allocating a page based on the policy.  To resolve this
38662306a36Sopenharmony_ci   potential race, the shared policy infrastructure adds an extra reference
38762306a36Sopenharmony_ci   to the shared policy during lookup while holding a spin lock on the shared
38862306a36Sopenharmony_ci   policy management structure.  This requires that we drop this extra
38962306a36Sopenharmony_ci   reference when we're finished "using" the policy.  We must drop the
39062306a36Sopenharmony_ci   extra reference on shared policies in the same query/allocation paths
39162306a36Sopenharmony_ci   used for non-shared policies.  For this reason, shared policies are marked
39262306a36Sopenharmony_ci   as such, and the extra reference is dropped "conditionally"--i.e., only
39362306a36Sopenharmony_ci   for shared policies.
39462306a36Sopenharmony_ci
39562306a36Sopenharmony_ci   Because of this extra reference counting, and because we must lookup
39662306a36Sopenharmony_ci   shared policies in a tree structure under spinlock, shared policies are
39762306a36Sopenharmony_ci   more expensive to use in the page allocation path.  This is especially
39862306a36Sopenharmony_ci   true for shared policies on shared memory regions shared by tasks running
39962306a36Sopenharmony_ci   on different NUMA nodes.  This extra overhead can be avoided by always
40062306a36Sopenharmony_ci   falling back to task or system default policy for shared memory regions,
40162306a36Sopenharmony_ci   or by prefaulting the entire shared memory region into memory and locking
40262306a36Sopenharmony_ci   it down.  However, this might not be appropriate for all applications.
40362306a36Sopenharmony_ci
40462306a36Sopenharmony_ci.. _memory_policy_apis:
40562306a36Sopenharmony_ci
40662306a36Sopenharmony_ciMemory Policy APIs
40762306a36Sopenharmony_ci==================
40862306a36Sopenharmony_ci
40962306a36Sopenharmony_ciLinux supports 4 system calls for controlling memory policy.  These APIS
41062306a36Sopenharmony_cialways affect only the calling task, the calling task's address space, or
41162306a36Sopenharmony_cisome shared object mapped into the calling task's address space.
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ci.. note::
41462306a36Sopenharmony_ci   the headers that define these APIs and the parameter data types for
41562306a36Sopenharmony_ci   user space applications reside in a package that is not part of the
41662306a36Sopenharmony_ci   Linux kernel.  The kernel system call interfaces, with the 'sys\_'
41762306a36Sopenharmony_ci   prefix, are defined in <linux/syscalls.h>; the mode and flag
41862306a36Sopenharmony_ci   definitions are defined in <linux/mempolicy.h>.
41962306a36Sopenharmony_ci
42062306a36Sopenharmony_ciSet [Task] Memory Policy::
42162306a36Sopenharmony_ci
42262306a36Sopenharmony_ci	long set_mempolicy(int mode, const unsigned long *nmask,
42362306a36Sopenharmony_ci					unsigned long maxnode);
42462306a36Sopenharmony_ci
42562306a36Sopenharmony_ciSet's the calling task's "task/process memory policy" to mode
42662306a36Sopenharmony_cispecified by the 'mode' argument and the set of nodes defined by
42762306a36Sopenharmony_ci'nmask'.  'nmask' points to a bit mask of node ids containing at least
42862306a36Sopenharmony_ci'maxnode' ids.  Optional mode flags may be passed by combining the
42962306a36Sopenharmony_ci'mode' argument with the flag (for example: MPOL_INTERLEAVE |
43062306a36Sopenharmony_ciMPOL_F_STATIC_NODES).
43162306a36Sopenharmony_ci
43262306a36Sopenharmony_ciSee the set_mempolicy(2) man page for more details
43362306a36Sopenharmony_ci
43462306a36Sopenharmony_ci
43562306a36Sopenharmony_ciGet [Task] Memory Policy or Related Information::
43662306a36Sopenharmony_ci
43762306a36Sopenharmony_ci	long get_mempolicy(int *mode,
43862306a36Sopenharmony_ci			   const unsigned long *nmask, unsigned long maxnode,
43962306a36Sopenharmony_ci			   void *addr, int flags);
44062306a36Sopenharmony_ci
44162306a36Sopenharmony_ciQueries the "task/process memory policy" of the calling task, or the
44262306a36Sopenharmony_cipolicy or location of a specified virtual address, depending on the
44362306a36Sopenharmony_ci'flags' argument.
44462306a36Sopenharmony_ci
44562306a36Sopenharmony_ciSee the get_mempolicy(2) man page for more details
44662306a36Sopenharmony_ci
44762306a36Sopenharmony_ci
44862306a36Sopenharmony_ciInstall VMA/Shared Policy for a Range of Task's Address Space::
44962306a36Sopenharmony_ci
45062306a36Sopenharmony_ci	long mbind(void *start, unsigned long len, int mode,
45162306a36Sopenharmony_ci		   const unsigned long *nmask, unsigned long maxnode,
45262306a36Sopenharmony_ci		   unsigned flags);
45362306a36Sopenharmony_ci
45462306a36Sopenharmony_cimbind() installs the policy specified by (mode, nmask, maxnodes) as a
45562306a36Sopenharmony_ciVMA policy for the range of the calling task's address space specified
45662306a36Sopenharmony_ciby the 'start' and 'len' arguments.  Additional actions may be
45762306a36Sopenharmony_cirequested via the 'flags' argument.
45862306a36Sopenharmony_ci
45962306a36Sopenharmony_ciSee the mbind(2) man page for more details.
46062306a36Sopenharmony_ci
46162306a36Sopenharmony_ciSet home node for a Range of Task's Address Spacec::
46262306a36Sopenharmony_ci
46362306a36Sopenharmony_ci	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
46462306a36Sopenharmony_ci					 unsigned long home_node,
46562306a36Sopenharmony_ci					 unsigned long flags);
46662306a36Sopenharmony_ci
46762306a36Sopenharmony_cisys_set_mempolicy_home_node set the home node for a VMA policy present in the
46862306a36Sopenharmony_citask's address range. The system call updates the home node only for the existing
46962306a36Sopenharmony_cimempolicy range. Other address ranges are ignored. A home node is the NUMA node
47062306a36Sopenharmony_ciclosest to which page allocation will come from. Specifying the home node override
47162306a36Sopenharmony_cithe default allocation policy to allocate memory close to the local node for an
47262306a36Sopenharmony_ciexecuting CPU.
47362306a36Sopenharmony_ci
47462306a36Sopenharmony_ci
47562306a36Sopenharmony_ciMemory Policy Command Line Interface
47662306a36Sopenharmony_ci====================================
47762306a36Sopenharmony_ci
47862306a36Sopenharmony_ciAlthough not strictly part of the Linux implementation of memory policy,
47962306a36Sopenharmony_cia command line tool, numactl(8), exists that allows one to:
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ci+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
48262306a36Sopenharmony_ci  exec(2)
48362306a36Sopenharmony_ci
48462306a36Sopenharmony_ci+ set the shared policy for a shared memory segment via mbind(2)
48562306a36Sopenharmony_ci
48662306a36Sopenharmony_ciThe numactl(8) tool is packaged with the run-time version of the library
48762306a36Sopenharmony_cicontaining the memory policy system call wrappers.  Some distributions
48862306a36Sopenharmony_cipackage the headers and compile-time libraries in a separate development
48962306a36Sopenharmony_cipackage.
49062306a36Sopenharmony_ci
49162306a36Sopenharmony_ci.. _mem_pol_and_cpusets:
49262306a36Sopenharmony_ci
49362306a36Sopenharmony_ciMemory Policies and cpusets
49462306a36Sopenharmony_ci===========================
49562306a36Sopenharmony_ci
49662306a36Sopenharmony_ciMemory policies work within cpusets as described above.  For memory policies
49762306a36Sopenharmony_cithat require a node or set of nodes, the nodes are restricted to the set of
49862306a36Sopenharmony_cinodes whose memories are allowed by the cpuset constraints.  If the nodemask
49962306a36Sopenharmony_cispecified for the policy contains nodes that are not allowed by the cpuset and
50062306a36Sopenharmony_ciMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
50162306a36Sopenharmony_cispecified for the policy and the set of nodes with memory is used.  If the
50262306a36Sopenharmony_ciresult is the empty set, the policy is considered invalid and cannot be
50362306a36Sopenharmony_ciinstalled.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
50462306a36Sopenharmony_cionto and folded into the task's set of allowed nodes as previously described.
50562306a36Sopenharmony_ci
50662306a36Sopenharmony_ciThe interaction of memory policies and cpusets can be problematic when tasks
50762306a36Sopenharmony_ciin two cpusets share access to a memory region, such as shared memory segments
50862306a36Sopenharmony_cicreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
50962306a36Sopenharmony_ciany of the tasks install shared policy on the region, only nodes whose
51062306a36Sopenharmony_cimemories are allowed in both cpusets may be used in the policies.  Obtaining
51162306a36Sopenharmony_cithis information requires "stepping outside" the memory policy APIs to use the
51262306a36Sopenharmony_cicpuset information and requires that one know in what cpusets other task might
51362306a36Sopenharmony_cibe attaching to the shared region.  Furthermore, if the cpusets' allowed
51462306a36Sopenharmony_cimemory sets are disjoint, "local" allocation is the only valid policy.
515