162306a36Sopenharmony_ciStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============
462306a36Sopenharmony_ciWhat is NUMA?
562306a36Sopenharmony_ci=============
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciThis question can be answered from a couple of perspectives:  the
862306a36Sopenharmony_cihardware view and the Linux software view.
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciFrom the hardware perspective, a NUMA system is a computer platform that
1162306a36Sopenharmony_cicomprises multiple components or assemblies each of which may contain 0
1262306a36Sopenharmony_cior more CPUs, local memory, and/or IO buses.  For brevity and to
1362306a36Sopenharmony_cidisambiguate the hardware view of these physical components/assemblies
1462306a36Sopenharmony_cifrom the software abstraction thereof, we'll call the components/assemblies
1562306a36Sopenharmony_ci'cells' in this document.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
1862306a36Sopenharmony_ciof the system--although some components necessary for a stand-alone SMP system
1962306a36Sopenharmony_cimay not be populated on any given cell.   The cells of the NUMA system are
2062306a36Sopenharmony_ciconnected together with some sort of system interconnect--e.g., a crossbar or
2162306a36Sopenharmony_cipoint-to-point link are common types of NUMA system interconnects.  Both of
2262306a36Sopenharmony_cithese types of interconnects can be aggregated to create NUMA platforms with
2362306a36Sopenharmony_cicells at multiple distances from other cells.
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciFor Linux, the NUMA platforms of interest are primarily what is known as Cache
2662306a36Sopenharmony_ciCoherent NUMA or ccNUMA systems.   With ccNUMA systems, all memory is visible
2762306a36Sopenharmony_cito and accessible from any CPU attached to any cell and cache coherency
2862306a36Sopenharmony_ciis handled in hardware by the processor caches and/or the system interconnect.
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciMemory access time and effective memory bandwidth varies depending on how far
3162306a36Sopenharmony_ciaway the cell containing the CPU or IO bus making the memory access is from the
3262306a36Sopenharmony_cicell containing the target memory.  For example, access to memory by CPUs
3362306a36Sopenharmony_ciattached to the same cell will experience faster access times and higher
3462306a36Sopenharmony_cibandwidths than accesses to memory on other, remote cells.  NUMA platforms
3562306a36Sopenharmony_cican have cells at multiple remote distances from any given cell.
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ciPlatform vendors don't build NUMA systems just to make software developers'
3862306a36Sopenharmony_cilives interesting.  Rather, this architecture is a means to provide scalable
3962306a36Sopenharmony_cimemory bandwidth.  However, to achieve scalable memory bandwidth, system and
4062306a36Sopenharmony_ciapplication software must arrange for a large majority of the memory references
4162306a36Sopenharmony_ci[cache misses] to be to "local" memory--memory on the same cell, if any--or
4262306a36Sopenharmony_cito the closest cell with memory.
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ciThis leads to the Linux software view of a NUMA system:
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ciLinux divides the system's hardware resources into multiple software
4762306a36Sopenharmony_ciabstractions called "nodes".  Linux maps the nodes onto the physical cells
4862306a36Sopenharmony_ciof the hardware platform, abstracting away some of the details for some
4962306a36Sopenharmony_ciarchitectures.  As with physical cells, software nodes may contain 0 or more
5062306a36Sopenharmony_ciCPUs, memory and/or IO buses.  And, again, memory accesses to memory on
5162306a36Sopenharmony_ci"closer" nodes--nodes that map to closer cells--will generally experience
5262306a36Sopenharmony_cifaster access times and higher effective bandwidth than accesses to more
5362306a36Sopenharmony_ciremote cells.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciFor some architectures, such as x86, Linux will "hide" any node representing a
5662306a36Sopenharmony_ciphysical cell that has no memory attached, and reassign any CPUs attached to
5762306a36Sopenharmony_cithat cell to a node representing a cell that does have memory.  Thus, on
5862306a36Sopenharmony_cithese architectures, one cannot assume that all CPUs that Linux associates with
5962306a36Sopenharmony_cia given node will see the same local memory access times and bandwidth.
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ciIn addition, for some architectures, again x86 is an example, Linux supports
6262306a36Sopenharmony_cithe emulation of additional nodes.  For NUMA emulation, linux will carve up
6362306a36Sopenharmony_cithe existing nodes--or the system memory for non-NUMA platforms--into multiple
6462306a36Sopenharmony_cinodes.  Each emulated node will manage a fraction of the underlying cells'
6562306a36Sopenharmony_ciphysical memory.  NUMA emulation is useful for testing NUMA kernel and
6662306a36Sopenharmony_ciapplication features on non-NUMA platforms, and as a sort of memory resource
6762306a36Sopenharmony_cimanagement mechanism when used together with cpusets.
6862306a36Sopenharmony_ci[see Documentation/admin-guide/cgroup-v1/cpusets.rst]
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ciFor each node with memory, Linux constructs an independent memory management
7162306a36Sopenharmony_cisubsystem, complete with its own free page lists, in-use page lists, usage
7262306a36Sopenharmony_cistatistics and locks to mediate access.  In addition, Linux constructs for
7362306a36Sopenharmony_cieach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
7462306a36Sopenharmony_cian ordered "zonelist".  A zonelist specifies the zones/nodes to visit when a
7562306a36Sopenharmony_ciselected zone/node cannot satisfy the allocation request.  This situation,
7662306a36Sopenharmony_ciwhen a zone has no available memory to satisfy a request, is called
7762306a36Sopenharmony_ci"overflow" or "fallback".
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciBecause some nodes contain multiple zones containing different types of
8062306a36Sopenharmony_cimemory, Linux must decide whether to order the zonelists such that allocations
8162306a36Sopenharmony_cifall back to the same zone type on a different node, or to a different zone
8262306a36Sopenharmony_citype on the same node.  This is an important consideration because some zones,
8362306a36Sopenharmony_cisuch as DMA or DMA32, represent relatively scarce resources.  Linux chooses
8462306a36Sopenharmony_cia default Node ordered zonelist. This means it tries to fallback to other zones
8562306a36Sopenharmony_cifrom the same node before using remote nodes which are ordered by NUMA distance.
8662306a36Sopenharmony_ci
8762306a36Sopenharmony_ciBy default, Linux will attempt to satisfy memory allocation requests from the
8862306a36Sopenharmony_cinode to which the CPU that executes the request is assigned.  Specifically,
8962306a36Sopenharmony_ciLinux will attempt to allocate from the first node in the appropriate zonelist
9062306a36Sopenharmony_cifor the node where the request originates.  This is called "local allocation."
9162306a36Sopenharmony_ciIf the "local" node cannot satisfy the request, the kernel will examine other
9262306a36Sopenharmony_cinodes' zones in the selected zonelist looking for the first zone in the list
9362306a36Sopenharmony_cithat can satisfy the request.
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ciLocal allocation will tend to keep subsequent access to the allocated memory
9662306a36Sopenharmony_ci"local" to the underlying physical resources and off the system interconnect--
9762306a36Sopenharmony_cias long as the task on whose behalf the kernel allocated some memory does not
9862306a36Sopenharmony_cilater migrate away from that memory.  The Linux scheduler is aware of the
9962306a36Sopenharmony_ciNUMA topology of the platform--embodied in the "scheduling domains" data
10062306a36Sopenharmony_cistructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
10162306a36Sopenharmony_ciattempts to minimize task migration to distant scheduling domains.  However,
10262306a36Sopenharmony_cithe scheduler does not take a task's NUMA footprint into account directly.
10362306a36Sopenharmony_ciThus, under sufficient imbalance, tasks can migrate between nodes, remote
10462306a36Sopenharmony_cifrom their initial node and kernel data structures.
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciSystem administrators and application designers can restrict a task's migration
10762306a36Sopenharmony_cito improve NUMA locality using various CPU affinity command line interfaces,
10862306a36Sopenharmony_cisuch as taskset(1) and numactl(1), and program interfaces such as
10962306a36Sopenharmony_cisched_setaffinity(2).  Further, one can modify the kernel's default local
11062306a36Sopenharmony_ciallocation behavior using Linux NUMA memory policy. [see
11162306a36Sopenharmony_ciDocumentation/admin-guide/mm/numa_memory_policy.rst].
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ciSystem administrators can restrict the CPUs and nodes' memories that a non-
11462306a36Sopenharmony_ciprivileged user can specify in the scheduling or NUMA commands and functions
11562306a36Sopenharmony_ciusing control groups and CPUsets.  [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciOn architectures that do not hide memoryless nodes, Linux will include only
11862306a36Sopenharmony_cizones [nodes] with memory in the zonelists.  This means that for a memoryless
11962306a36Sopenharmony_cinode the "local memory node"--the node of the first zone in CPU's node's
12062306a36Sopenharmony_cizonelist--will not be the node itself.  Rather, it will be the node that the
12162306a36Sopenharmony_cikernel selected as the nearest node with memory when it built the zonelists.
12262306a36Sopenharmony_ciSo, default, local allocations will succeed with the kernel supplying the
12362306a36Sopenharmony_ciclosest available memory.  This is a consequence of the same mechanism that
12462306a36Sopenharmony_ciallows such allocations to fallback to other nearby nodes when a node that
12562306a36Sopenharmony_cidoes contain memory overflows.
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ciSome kernel allocations do not want or cannot tolerate this allocation fallback
12862306a36Sopenharmony_cibehavior.  Rather they want to be sure they get memory from the specified node
12962306a36Sopenharmony_cior get notified that the node has no free memory.  This is usually the case when
13062306a36Sopenharmony_cia subsystem allocates per CPU memory resources, for example.
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciA typical model for making such an allocation is to obtain the node id of the
13362306a36Sopenharmony_cinode to which the "current CPU" is attached using one of the kernel's
13462306a36Sopenharmony_cinuma_node_id() or CPU_to_node() functions and then request memory from only
13562306a36Sopenharmony_cithe node id returned.  When such an allocation fails, the requesting subsystem
13662306a36Sopenharmony_cimay revert to its own fallback path.  The slab kernel memory allocator is an
13762306a36Sopenharmony_ciexample of this.  Or, the subsystem may choose to disable or not to enable
13862306a36Sopenharmony_ciitself on allocation failure.  The kernel profiling subsystem is an example of
13962306a36Sopenharmony_cithis.
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ciIf the architecture supports--does not hide--memoryless nodes, then CPUs
14262306a36Sopenharmony_ciattached to memoryless nodes would always incur the fallback path overhead
14362306a36Sopenharmony_cior some subsystems would fail to initialize if they attempted to allocated
14462306a36Sopenharmony_cimemory exclusively from a node without memory.  To support such
14562306a36Sopenharmony_ciarchitectures transparently, kernel subsystems can use the numa_mem_id()
14662306a36Sopenharmony_cior cpu_to_mem() function to locate the "local memory node" for the calling or
14762306a36Sopenharmony_cispecified CPU.  Again, this is the same node from which default, local page
14862306a36Sopenharmony_ciallocations will be attempted.
149