162306a36Sopenharmony_ciStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============= 462306a36Sopenharmony_ciWhat is NUMA? 562306a36Sopenharmony_ci============= 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciThis question can be answered from a couple of perspectives: the 862306a36Sopenharmony_cihardware view and the Linux software view. 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciFrom the hardware perspective, a NUMA system is a computer platform that 1162306a36Sopenharmony_cicomprises multiple components or assemblies each of which may contain 0 1262306a36Sopenharmony_cior more CPUs, local memory, and/or IO buses. For brevity and to 1362306a36Sopenharmony_cidisambiguate the hardware view of these physical components/assemblies 1462306a36Sopenharmony_cifrom the software abstraction thereof, we'll call the components/assemblies 1562306a36Sopenharmony_ci'cells' in this document. 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ciEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset 1862306a36Sopenharmony_ciof the system--although some components necessary for a stand-alone SMP system 1962306a36Sopenharmony_cimay not be populated on any given cell. The cells of the NUMA system are 2062306a36Sopenharmony_ciconnected together with some sort of system interconnect--e.g., a crossbar or 2162306a36Sopenharmony_cipoint-to-point link are common types of NUMA system interconnects. Both of 2262306a36Sopenharmony_cithese types of interconnects can be aggregated to create NUMA platforms with 2362306a36Sopenharmony_cicells at multiple distances from other cells. 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciFor Linux, the NUMA platforms of interest are primarily what is known as Cache 2662306a36Sopenharmony_ciCoherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible 2762306a36Sopenharmony_cito and accessible from any CPU attached to any cell and cache coherency 2862306a36Sopenharmony_ciis handled in hardware by the processor caches and/or the system interconnect. 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ciMemory access time and effective memory bandwidth varies depending on how far 3162306a36Sopenharmony_ciaway the cell containing the CPU or IO bus making the memory access is from the 3262306a36Sopenharmony_cicell containing the target memory. For example, access to memory by CPUs 3362306a36Sopenharmony_ciattached to the same cell will experience faster access times and higher 3462306a36Sopenharmony_cibandwidths than accesses to memory on other, remote cells. NUMA platforms 3562306a36Sopenharmony_cican have cells at multiple remote distances from any given cell. 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ciPlatform vendors don't build NUMA systems just to make software developers' 3862306a36Sopenharmony_cilives interesting. Rather, this architecture is a means to provide scalable 3962306a36Sopenharmony_cimemory bandwidth. However, to achieve scalable memory bandwidth, system and 4062306a36Sopenharmony_ciapplication software must arrange for a large majority of the memory references 4162306a36Sopenharmony_ci[cache misses] to be to "local" memory--memory on the same cell, if any--or 4262306a36Sopenharmony_cito the closest cell with memory. 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ciThis leads to the Linux software view of a NUMA system: 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ciLinux divides the system's hardware resources into multiple software 4762306a36Sopenharmony_ciabstractions called "nodes". Linux maps the nodes onto the physical cells 4862306a36Sopenharmony_ciof the hardware platform, abstracting away some of the details for some 4962306a36Sopenharmony_ciarchitectures. As with physical cells, software nodes may contain 0 or more 5062306a36Sopenharmony_ciCPUs, memory and/or IO buses. And, again, memory accesses to memory on 5162306a36Sopenharmony_ci"closer" nodes--nodes that map to closer cells--will generally experience 5262306a36Sopenharmony_cifaster access times and higher effective bandwidth than accesses to more 5362306a36Sopenharmony_ciremote cells. 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciFor some architectures, such as x86, Linux will "hide" any node representing a 5662306a36Sopenharmony_ciphysical cell that has no memory attached, and reassign any CPUs attached to 5762306a36Sopenharmony_cithat cell to a node representing a cell that does have memory. Thus, on 5862306a36Sopenharmony_cithese architectures, one cannot assume that all CPUs that Linux associates with 5962306a36Sopenharmony_cia given node will see the same local memory access times and bandwidth. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ciIn addition, for some architectures, again x86 is an example, Linux supports 6262306a36Sopenharmony_cithe emulation of additional nodes. For NUMA emulation, linux will carve up 6362306a36Sopenharmony_cithe existing nodes--or the system memory for non-NUMA platforms--into multiple 6462306a36Sopenharmony_cinodes. Each emulated node will manage a fraction of the underlying cells' 6562306a36Sopenharmony_ciphysical memory. NUMA emulation is useful for testing NUMA kernel and 6662306a36Sopenharmony_ciapplication features on non-NUMA platforms, and as a sort of memory resource 6762306a36Sopenharmony_cimanagement mechanism when used together with cpusets. 6862306a36Sopenharmony_ci[see Documentation/admin-guide/cgroup-v1/cpusets.rst] 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ciFor each node with memory, Linux constructs an independent memory management 7162306a36Sopenharmony_cisubsystem, complete with its own free page lists, in-use page lists, usage 7262306a36Sopenharmony_cistatistics and locks to mediate access. In addition, Linux constructs for 7362306a36Sopenharmony_cieach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], 7462306a36Sopenharmony_cian ordered "zonelist". A zonelist specifies the zones/nodes to visit when a 7562306a36Sopenharmony_ciselected zone/node cannot satisfy the allocation request. This situation, 7662306a36Sopenharmony_ciwhen a zone has no available memory to satisfy a request, is called 7762306a36Sopenharmony_ci"overflow" or "fallback". 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ciBecause some nodes contain multiple zones containing different types of 8062306a36Sopenharmony_cimemory, Linux must decide whether to order the zonelists such that allocations 8162306a36Sopenharmony_cifall back to the same zone type on a different node, or to a different zone 8262306a36Sopenharmony_citype on the same node. This is an important consideration because some zones, 8362306a36Sopenharmony_cisuch as DMA or DMA32, represent relatively scarce resources. Linux chooses 8462306a36Sopenharmony_cia default Node ordered zonelist. This means it tries to fallback to other zones 8562306a36Sopenharmony_cifrom the same node before using remote nodes which are ordered by NUMA distance. 8662306a36Sopenharmony_ci 8762306a36Sopenharmony_ciBy default, Linux will attempt to satisfy memory allocation requests from the 8862306a36Sopenharmony_cinode to which the CPU that executes the request is assigned. Specifically, 8962306a36Sopenharmony_ciLinux will attempt to allocate from the first node in the appropriate zonelist 9062306a36Sopenharmony_cifor the node where the request originates. This is called "local allocation." 9162306a36Sopenharmony_ciIf the "local" node cannot satisfy the request, the kernel will examine other 9262306a36Sopenharmony_cinodes' zones in the selected zonelist looking for the first zone in the list 9362306a36Sopenharmony_cithat can satisfy the request. 9462306a36Sopenharmony_ci 9562306a36Sopenharmony_ciLocal allocation will tend to keep subsequent access to the allocated memory 9662306a36Sopenharmony_ci"local" to the underlying physical resources and off the system interconnect-- 9762306a36Sopenharmony_cias long as the task on whose behalf the kernel allocated some memory does not 9862306a36Sopenharmony_cilater migrate away from that memory. The Linux scheduler is aware of the 9962306a36Sopenharmony_ciNUMA topology of the platform--embodied in the "scheduling domains" data 10062306a36Sopenharmony_cistructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler 10162306a36Sopenharmony_ciattempts to minimize task migration to distant scheduling domains. However, 10262306a36Sopenharmony_cithe scheduler does not take a task's NUMA footprint into account directly. 10362306a36Sopenharmony_ciThus, under sufficient imbalance, tasks can migrate between nodes, remote 10462306a36Sopenharmony_cifrom their initial node and kernel data structures. 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciSystem administrators and application designers can restrict a task's migration 10762306a36Sopenharmony_cito improve NUMA locality using various CPU affinity command line interfaces, 10862306a36Sopenharmony_cisuch as taskset(1) and numactl(1), and program interfaces such as 10962306a36Sopenharmony_cisched_setaffinity(2). Further, one can modify the kernel's default local 11062306a36Sopenharmony_ciallocation behavior using Linux NUMA memory policy. [see 11162306a36Sopenharmony_ciDocumentation/admin-guide/mm/numa_memory_policy.rst]. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ciSystem administrators can restrict the CPUs and nodes' memories that a non- 11462306a36Sopenharmony_ciprivileged user can specify in the scheduling or NUMA commands and functions 11562306a36Sopenharmony_ciusing control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ciOn architectures that do not hide memoryless nodes, Linux will include only 11862306a36Sopenharmony_cizones [nodes] with memory in the zonelists. This means that for a memoryless 11962306a36Sopenharmony_cinode the "local memory node"--the node of the first zone in CPU's node's 12062306a36Sopenharmony_cizonelist--will not be the node itself. Rather, it will be the node that the 12162306a36Sopenharmony_cikernel selected as the nearest node with memory when it built the zonelists. 12262306a36Sopenharmony_ciSo, default, local allocations will succeed with the kernel supplying the 12362306a36Sopenharmony_ciclosest available memory. This is a consequence of the same mechanism that 12462306a36Sopenharmony_ciallows such allocations to fallback to other nearby nodes when a node that 12562306a36Sopenharmony_cidoes contain memory overflows. 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ciSome kernel allocations do not want or cannot tolerate this allocation fallback 12862306a36Sopenharmony_cibehavior. Rather they want to be sure they get memory from the specified node 12962306a36Sopenharmony_cior get notified that the node has no free memory. This is usually the case when 13062306a36Sopenharmony_cia subsystem allocates per CPU memory resources, for example. 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciA typical model for making such an allocation is to obtain the node id of the 13362306a36Sopenharmony_cinode to which the "current CPU" is attached using one of the kernel's 13462306a36Sopenharmony_cinuma_node_id() or CPU_to_node() functions and then request memory from only 13562306a36Sopenharmony_cithe node id returned. When such an allocation fails, the requesting subsystem 13662306a36Sopenharmony_cimay revert to its own fallback path. The slab kernel memory allocator is an 13762306a36Sopenharmony_ciexample of this. Or, the subsystem may choose to disable or not to enable 13862306a36Sopenharmony_ciitself on allocation failure. The kernel profiling subsystem is an example of 13962306a36Sopenharmony_cithis. 14062306a36Sopenharmony_ci 14162306a36Sopenharmony_ciIf the architecture supports--does not hide--memoryless nodes, then CPUs 14262306a36Sopenharmony_ciattached to memoryless nodes would always incur the fallback path overhead 14362306a36Sopenharmony_cior some subsystems would fail to initialize if they attempted to allocated 14462306a36Sopenharmony_cimemory exclusively from a node without memory. To support such 14562306a36Sopenharmony_ciarchitectures transparently, kernel subsystems can use the numa_mem_id() 14662306a36Sopenharmony_cior cpu_to_mem() function to locate the "local memory node" for the calling or 14762306a36Sopenharmony_cispecified CPU. Again, this is the same node from which default, local page 14862306a36Sopenharmony_ciallocations will be attempted. 149