18c2ecf20Sopenharmony_ci.. _numa:
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ciStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ci=============
68c2ecf20Sopenharmony_ciWhat is NUMA?
78c2ecf20Sopenharmony_ci=============
88c2ecf20Sopenharmony_ci
98c2ecf20Sopenharmony_ciThis question can be answered from a couple of perspectives:  the
108c2ecf20Sopenharmony_cihardware view and the Linux software view.
118c2ecf20Sopenharmony_ci
128c2ecf20Sopenharmony_ciFrom the hardware perspective, a NUMA system is a computer platform that
138c2ecf20Sopenharmony_cicomprises multiple components or assemblies each of which may contain 0
148c2ecf20Sopenharmony_cior more CPUs, local memory, and/or IO buses.  For brevity and to
158c2ecf20Sopenharmony_cidisambiguate the hardware view of these physical components/assemblies
168c2ecf20Sopenharmony_cifrom the software abstraction thereof, we'll call the components/assemblies
178c2ecf20Sopenharmony_ci'cells' in this document.
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ciEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
208c2ecf20Sopenharmony_ciof the system--although some components necessary for a stand-alone SMP system
218c2ecf20Sopenharmony_cimay not be populated on any given cell.   The cells of the NUMA system are
228c2ecf20Sopenharmony_ciconnected together with some sort of system interconnect--e.g., a crossbar or
238c2ecf20Sopenharmony_cipoint-to-point link are common types of NUMA system interconnects.  Both of
248c2ecf20Sopenharmony_cithese types of interconnects can be aggregated to create NUMA platforms with
258c2ecf20Sopenharmony_cicells at multiple distances from other cells.
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ciFor Linux, the NUMA platforms of interest are primarily what is known as Cache
288c2ecf20Sopenharmony_ciCoherent NUMA or ccNUMA systems.   With ccNUMA systems, all memory is visible
298c2ecf20Sopenharmony_cito and accessible from any CPU attached to any cell and cache coherency
308c2ecf20Sopenharmony_ciis handled in hardware by the processor caches and/or the system interconnect.
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ciMemory access time and effective memory bandwidth varies depending on how far
338c2ecf20Sopenharmony_ciaway the cell containing the CPU or IO bus making the memory access is from the
348c2ecf20Sopenharmony_cicell containing the target memory.  For example, access to memory by CPUs
358c2ecf20Sopenharmony_ciattached to the same cell will experience faster access times and higher
368c2ecf20Sopenharmony_cibandwidths than accesses to memory on other, remote cells.  NUMA platforms
378c2ecf20Sopenharmony_cican have cells at multiple remote distances from any given cell.
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_ciPlatform vendors don't build NUMA systems just to make software developers'
408c2ecf20Sopenharmony_cilives interesting.  Rather, this architecture is a means to provide scalable
418c2ecf20Sopenharmony_cimemory bandwidth.  However, to achieve scalable memory bandwidth, system and
428c2ecf20Sopenharmony_ciapplication software must arrange for a large majority of the memory references
438c2ecf20Sopenharmony_ci[cache misses] to be to "local" memory--memory on the same cell, if any--or
448c2ecf20Sopenharmony_cito the closest cell with memory.
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ciThis leads to the Linux software view of a NUMA system:
478c2ecf20Sopenharmony_ci
488c2ecf20Sopenharmony_ciLinux divides the system's hardware resources into multiple software
498c2ecf20Sopenharmony_ciabstractions called "nodes".  Linux maps the nodes onto the physical cells
508c2ecf20Sopenharmony_ciof the hardware platform, abstracting away some of the details for some
518c2ecf20Sopenharmony_ciarchitectures.  As with physical cells, software nodes may contain 0 or more
528c2ecf20Sopenharmony_ciCPUs, memory and/or IO buses.  And, again, memory accesses to memory on
538c2ecf20Sopenharmony_ci"closer" nodes--nodes that map to closer cells--will generally experience
548c2ecf20Sopenharmony_cifaster access times and higher effective bandwidth than accesses to more
558c2ecf20Sopenharmony_ciremote cells.
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ciFor some architectures, such as x86, Linux will "hide" any node representing a
588c2ecf20Sopenharmony_ciphysical cell that has no memory attached, and reassign any CPUs attached to
598c2ecf20Sopenharmony_cithat cell to a node representing a cell that does have memory.  Thus, on
608c2ecf20Sopenharmony_cithese architectures, one cannot assume that all CPUs that Linux associates with
618c2ecf20Sopenharmony_cia given node will see the same local memory access times and bandwidth.
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ciIn addition, for some architectures, again x86 is an example, Linux supports
648c2ecf20Sopenharmony_cithe emulation of additional nodes.  For NUMA emulation, linux will carve up
658c2ecf20Sopenharmony_cithe existing nodes--or the system memory for non-NUMA platforms--into multiple
668c2ecf20Sopenharmony_cinodes.  Each emulated node will manage a fraction of the underlying cells'
678c2ecf20Sopenharmony_ciphysical memory.  NUMA emluation is useful for testing NUMA kernel and
688c2ecf20Sopenharmony_ciapplication features on non-NUMA platforms, and as a sort of memory resource
698c2ecf20Sopenharmony_cimanagement mechanism when used together with cpusets.
708c2ecf20Sopenharmony_ci[see Documentation/admin-guide/cgroup-v1/cpusets.rst]
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ciFor each node with memory, Linux constructs an independent memory management
738c2ecf20Sopenharmony_cisubsystem, complete with its own free page lists, in-use page lists, usage
748c2ecf20Sopenharmony_cistatistics and locks to mediate access.  In addition, Linux constructs for
758c2ecf20Sopenharmony_cieach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
768c2ecf20Sopenharmony_cian ordered "zonelist".  A zonelist specifies the zones/nodes to visit when a
778c2ecf20Sopenharmony_ciselected zone/node cannot satisfy the allocation request.  This situation,
788c2ecf20Sopenharmony_ciwhen a zone has no available memory to satisfy a request, is called
798c2ecf20Sopenharmony_ci"overflow" or "fallback".
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ciBecause some nodes contain multiple zones containing different types of
828c2ecf20Sopenharmony_cimemory, Linux must decide whether to order the zonelists such that allocations
838c2ecf20Sopenharmony_cifall back to the same zone type on a different node, or to a different zone
848c2ecf20Sopenharmony_citype on the same node.  This is an important consideration because some zones,
858c2ecf20Sopenharmony_cisuch as DMA or DMA32, represent relatively scarce resources.  Linux chooses
868c2ecf20Sopenharmony_cia default Node ordered zonelist. This means it tries to fallback to other zones
878c2ecf20Sopenharmony_cifrom the same node before using remote nodes which are ordered by NUMA distance.
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ciBy default, Linux will attempt to satisfy memory allocation requests from the
908c2ecf20Sopenharmony_cinode to which the CPU that executes the request is assigned.  Specifically,
918c2ecf20Sopenharmony_ciLinux will attempt to allocate from the first node in the appropriate zonelist
928c2ecf20Sopenharmony_cifor the node where the request originates.  This is called "local allocation."
938c2ecf20Sopenharmony_ciIf the "local" node cannot satisfy the request, the kernel will examine other
948c2ecf20Sopenharmony_cinodes' zones in the selected zonelist looking for the first zone in the list
958c2ecf20Sopenharmony_cithat can satisfy the request.
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ciLocal allocation will tend to keep subsequent access to the allocated memory
988c2ecf20Sopenharmony_ci"local" to the underlying physical resources and off the system interconnect--
998c2ecf20Sopenharmony_cias long as the task on whose behalf the kernel allocated some memory does not
1008c2ecf20Sopenharmony_cilater migrate away from that memory.  The Linux scheduler is aware of the
1018c2ecf20Sopenharmony_ciNUMA topology of the platform--embodied in the "scheduling domains" data
1028c2ecf20Sopenharmony_cistructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
1038c2ecf20Sopenharmony_ciattempts to minimize task migration to distant scheduling domains.  However,
1048c2ecf20Sopenharmony_cithe scheduler does not take a task's NUMA footprint into account directly.
1058c2ecf20Sopenharmony_ciThus, under sufficient imbalance, tasks can migrate between nodes, remote
1068c2ecf20Sopenharmony_cifrom their initial node and kernel data structures.
1078c2ecf20Sopenharmony_ci
1088c2ecf20Sopenharmony_ciSystem administrators and application designers can restrict a task's migration
1098c2ecf20Sopenharmony_cito improve NUMA locality using various CPU affinity command line interfaces,
1108c2ecf20Sopenharmony_cisuch as taskset(1) and numactl(1), and program interfaces such as
1118c2ecf20Sopenharmony_cisched_setaffinity(2).  Further, one can modify the kernel's default local
1128c2ecf20Sopenharmony_ciallocation behavior using Linux NUMA memory policy. [see
1138c2ecf20Sopenharmony_ci:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
1148c2ecf20Sopenharmony_ci
1158c2ecf20Sopenharmony_ciSystem administrators can restrict the CPUs and nodes' memories that a non-
1168c2ecf20Sopenharmony_ciprivileged user can specify in the scheduling or NUMA commands and functions
1178c2ecf20Sopenharmony_ciusing control groups and CPUsets.  [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
1188c2ecf20Sopenharmony_ci
1198c2ecf20Sopenharmony_ciOn architectures that do not hide memoryless nodes, Linux will include only
1208c2ecf20Sopenharmony_cizones [nodes] with memory in the zonelists.  This means that for a memoryless
1218c2ecf20Sopenharmony_cinode the "local memory node"--the node of the first zone in CPU's node's
1228c2ecf20Sopenharmony_cizonelist--will not be the node itself.  Rather, it will be the node that the
1238c2ecf20Sopenharmony_cikernel selected as the nearest node with memory when it built the zonelists.
1248c2ecf20Sopenharmony_ciSo, default, local allocations will succeed with the kernel supplying the
1258c2ecf20Sopenharmony_ciclosest available memory.  This is a consequence of the same mechanism that
1268c2ecf20Sopenharmony_ciallows such allocations to fallback to other nearby nodes when a node that
1278c2ecf20Sopenharmony_cidoes contain memory overflows.
1288c2ecf20Sopenharmony_ci
1298c2ecf20Sopenharmony_ciSome kernel allocations do not want or cannot tolerate this allocation fallback
1308c2ecf20Sopenharmony_cibehavior.  Rather they want to be sure they get memory from the specified node
1318c2ecf20Sopenharmony_cior get notified that the node has no free memory.  This is usually the case when
1328c2ecf20Sopenharmony_cia subsystem allocates per CPU memory resources, for example.
1338c2ecf20Sopenharmony_ci
1348c2ecf20Sopenharmony_ciA typical model for making such an allocation is to obtain the node id of the
1358c2ecf20Sopenharmony_cinode to which the "current CPU" is attached using one of the kernel's
1368c2ecf20Sopenharmony_cinuma_node_id() or CPU_to_node() functions and then request memory from only
1378c2ecf20Sopenharmony_cithe node id returned.  When such an allocation fails, the requesting subsystem
1388c2ecf20Sopenharmony_cimay revert to its own fallback path.  The slab kernel memory allocator is an
1398c2ecf20Sopenharmony_ciexample of this.  Or, the subsystem may choose to disable or not to enable
1408c2ecf20Sopenharmony_ciitself on allocation failure.  The kernel profiling subsystem is an example of
1418c2ecf20Sopenharmony_cithis.
1428c2ecf20Sopenharmony_ci
1438c2ecf20Sopenharmony_ciIf the architecture supports--does not hide--memoryless nodes, then CPUs
1448c2ecf20Sopenharmony_ciattached to memoryless nodes would always incur the fallback path overhead
1458c2ecf20Sopenharmony_cior some subsystems would fail to initialize if they attempted to allocated
1468c2ecf20Sopenharmony_cimemory exclusively from a node without memory.  To support such
1478c2ecf20Sopenharmony_ciarchitectures transparently, kernel subsystems can use the numa_mem_id()
1488c2ecf20Sopenharmony_cior cpu_to_mem() function to locate the "local memory node" for the calling or
1498c2ecf20Sopenharmony_cispecified CPU.  Again, this is the same node from which default, local page
1508c2ecf20Sopenharmony_ciallocations will be attempted.
151