18c2ecf20Sopenharmony_ci.. _numa: 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ciStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ci============= 68c2ecf20Sopenharmony_ciWhat is NUMA? 78c2ecf20Sopenharmony_ci============= 88c2ecf20Sopenharmony_ci 98c2ecf20Sopenharmony_ciThis question can be answered from a couple of perspectives: the 108c2ecf20Sopenharmony_cihardware view and the Linux software view. 118c2ecf20Sopenharmony_ci 128c2ecf20Sopenharmony_ciFrom the hardware perspective, a NUMA system is a computer platform that 138c2ecf20Sopenharmony_cicomprises multiple components or assemblies each of which may contain 0 148c2ecf20Sopenharmony_cior more CPUs, local memory, and/or IO buses. For brevity and to 158c2ecf20Sopenharmony_cidisambiguate the hardware view of these physical components/assemblies 168c2ecf20Sopenharmony_cifrom the software abstraction thereof, we'll call the components/assemblies 178c2ecf20Sopenharmony_ci'cells' in this document. 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset 208c2ecf20Sopenharmony_ciof the system--although some components necessary for a stand-alone SMP system 218c2ecf20Sopenharmony_cimay not be populated on any given cell. The cells of the NUMA system are 228c2ecf20Sopenharmony_ciconnected together with some sort of system interconnect--e.g., a crossbar or 238c2ecf20Sopenharmony_cipoint-to-point link are common types of NUMA system interconnects. Both of 248c2ecf20Sopenharmony_cithese types of interconnects can be aggregated to create NUMA platforms with 258c2ecf20Sopenharmony_cicells at multiple distances from other cells. 268c2ecf20Sopenharmony_ci 278c2ecf20Sopenharmony_ciFor Linux, the NUMA platforms of interest are primarily what is known as Cache 288c2ecf20Sopenharmony_ciCoherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible 298c2ecf20Sopenharmony_cito and accessible from any CPU attached to any cell and cache coherency 308c2ecf20Sopenharmony_ciis handled in hardware by the processor caches and/or the system interconnect. 318c2ecf20Sopenharmony_ci 328c2ecf20Sopenharmony_ciMemory access time and effective memory bandwidth varies depending on how far 338c2ecf20Sopenharmony_ciaway the cell containing the CPU or IO bus making the memory access is from the 348c2ecf20Sopenharmony_cicell containing the target memory. For example, access to memory by CPUs 358c2ecf20Sopenharmony_ciattached to the same cell will experience faster access times and higher 368c2ecf20Sopenharmony_cibandwidths than accesses to memory on other, remote cells. NUMA platforms 378c2ecf20Sopenharmony_cican have cells at multiple remote distances from any given cell. 388c2ecf20Sopenharmony_ci 398c2ecf20Sopenharmony_ciPlatform vendors don't build NUMA systems just to make software developers' 408c2ecf20Sopenharmony_cilives interesting. Rather, this architecture is a means to provide scalable 418c2ecf20Sopenharmony_cimemory bandwidth. However, to achieve scalable memory bandwidth, system and 428c2ecf20Sopenharmony_ciapplication software must arrange for a large majority of the memory references 438c2ecf20Sopenharmony_ci[cache misses] to be to "local" memory--memory on the same cell, if any--or 448c2ecf20Sopenharmony_cito the closest cell with memory. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ciThis leads to the Linux software view of a NUMA system: 478c2ecf20Sopenharmony_ci 488c2ecf20Sopenharmony_ciLinux divides the system's hardware resources into multiple software 498c2ecf20Sopenharmony_ciabstractions called "nodes". Linux maps the nodes onto the physical cells 508c2ecf20Sopenharmony_ciof the hardware platform, abstracting away some of the details for some 518c2ecf20Sopenharmony_ciarchitectures. As with physical cells, software nodes may contain 0 or more 528c2ecf20Sopenharmony_ciCPUs, memory and/or IO buses. And, again, memory accesses to memory on 538c2ecf20Sopenharmony_ci"closer" nodes--nodes that map to closer cells--will generally experience 548c2ecf20Sopenharmony_cifaster access times and higher effective bandwidth than accesses to more 558c2ecf20Sopenharmony_ciremote cells. 568c2ecf20Sopenharmony_ci 578c2ecf20Sopenharmony_ciFor some architectures, such as x86, Linux will "hide" any node representing a 588c2ecf20Sopenharmony_ciphysical cell that has no memory attached, and reassign any CPUs attached to 598c2ecf20Sopenharmony_cithat cell to a node representing a cell that does have memory. Thus, on 608c2ecf20Sopenharmony_cithese architectures, one cannot assume that all CPUs that Linux associates with 618c2ecf20Sopenharmony_cia given node will see the same local memory access times and bandwidth. 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ciIn addition, for some architectures, again x86 is an example, Linux supports 648c2ecf20Sopenharmony_cithe emulation of additional nodes. For NUMA emulation, linux will carve up 658c2ecf20Sopenharmony_cithe existing nodes--or the system memory for non-NUMA platforms--into multiple 668c2ecf20Sopenharmony_cinodes. Each emulated node will manage a fraction of the underlying cells' 678c2ecf20Sopenharmony_ciphysical memory. NUMA emluation is useful for testing NUMA kernel and 688c2ecf20Sopenharmony_ciapplication features on non-NUMA platforms, and as a sort of memory resource 698c2ecf20Sopenharmony_cimanagement mechanism when used together with cpusets. 708c2ecf20Sopenharmony_ci[see Documentation/admin-guide/cgroup-v1/cpusets.rst] 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ciFor each node with memory, Linux constructs an independent memory management 738c2ecf20Sopenharmony_cisubsystem, complete with its own free page lists, in-use page lists, usage 748c2ecf20Sopenharmony_cistatistics and locks to mediate access. In addition, Linux constructs for 758c2ecf20Sopenharmony_cieach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], 768c2ecf20Sopenharmony_cian ordered "zonelist". A zonelist specifies the zones/nodes to visit when a 778c2ecf20Sopenharmony_ciselected zone/node cannot satisfy the allocation request. This situation, 788c2ecf20Sopenharmony_ciwhen a zone has no available memory to satisfy a request, is called 798c2ecf20Sopenharmony_ci"overflow" or "fallback". 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ciBecause some nodes contain multiple zones containing different types of 828c2ecf20Sopenharmony_cimemory, Linux must decide whether to order the zonelists such that allocations 838c2ecf20Sopenharmony_cifall back to the same zone type on a different node, or to a different zone 848c2ecf20Sopenharmony_citype on the same node. This is an important consideration because some zones, 858c2ecf20Sopenharmony_cisuch as DMA or DMA32, represent relatively scarce resources. Linux chooses 868c2ecf20Sopenharmony_cia default Node ordered zonelist. This means it tries to fallback to other zones 878c2ecf20Sopenharmony_cifrom the same node before using remote nodes which are ordered by NUMA distance. 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ciBy default, Linux will attempt to satisfy memory allocation requests from the 908c2ecf20Sopenharmony_cinode to which the CPU that executes the request is assigned. Specifically, 918c2ecf20Sopenharmony_ciLinux will attempt to allocate from the first node in the appropriate zonelist 928c2ecf20Sopenharmony_cifor the node where the request originates. This is called "local allocation." 938c2ecf20Sopenharmony_ciIf the "local" node cannot satisfy the request, the kernel will examine other 948c2ecf20Sopenharmony_cinodes' zones in the selected zonelist looking for the first zone in the list 958c2ecf20Sopenharmony_cithat can satisfy the request. 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ciLocal allocation will tend to keep subsequent access to the allocated memory 988c2ecf20Sopenharmony_ci"local" to the underlying physical resources and off the system interconnect-- 998c2ecf20Sopenharmony_cias long as the task on whose behalf the kernel allocated some memory does not 1008c2ecf20Sopenharmony_cilater migrate away from that memory. The Linux scheduler is aware of the 1018c2ecf20Sopenharmony_ciNUMA topology of the platform--embodied in the "scheduling domains" data 1028c2ecf20Sopenharmony_cistructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler 1038c2ecf20Sopenharmony_ciattempts to minimize task migration to distant scheduling domains. However, 1048c2ecf20Sopenharmony_cithe scheduler does not take a task's NUMA footprint into account directly. 1058c2ecf20Sopenharmony_ciThus, under sufficient imbalance, tasks can migrate between nodes, remote 1068c2ecf20Sopenharmony_cifrom their initial node and kernel data structures. 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ciSystem administrators and application designers can restrict a task's migration 1098c2ecf20Sopenharmony_cito improve NUMA locality using various CPU affinity command line interfaces, 1108c2ecf20Sopenharmony_cisuch as taskset(1) and numactl(1), and program interfaces such as 1118c2ecf20Sopenharmony_cisched_setaffinity(2). Further, one can modify the kernel's default local 1128c2ecf20Sopenharmony_ciallocation behavior using Linux NUMA memory policy. [see 1138c2ecf20Sopenharmony_ci:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`]. 1148c2ecf20Sopenharmony_ci 1158c2ecf20Sopenharmony_ciSystem administrators can restrict the CPUs and nodes' memories that a non- 1168c2ecf20Sopenharmony_ciprivileged user can specify in the scheduling or NUMA commands and functions 1178c2ecf20Sopenharmony_ciusing control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] 1188c2ecf20Sopenharmony_ci 1198c2ecf20Sopenharmony_ciOn architectures that do not hide memoryless nodes, Linux will include only 1208c2ecf20Sopenharmony_cizones [nodes] with memory in the zonelists. This means that for a memoryless 1218c2ecf20Sopenharmony_cinode the "local memory node"--the node of the first zone in CPU's node's 1228c2ecf20Sopenharmony_cizonelist--will not be the node itself. Rather, it will be the node that the 1238c2ecf20Sopenharmony_cikernel selected as the nearest node with memory when it built the zonelists. 1248c2ecf20Sopenharmony_ciSo, default, local allocations will succeed with the kernel supplying the 1258c2ecf20Sopenharmony_ciclosest available memory. This is a consequence of the same mechanism that 1268c2ecf20Sopenharmony_ciallows such allocations to fallback to other nearby nodes when a node that 1278c2ecf20Sopenharmony_cidoes contain memory overflows. 1288c2ecf20Sopenharmony_ci 1298c2ecf20Sopenharmony_ciSome kernel allocations do not want or cannot tolerate this allocation fallback 1308c2ecf20Sopenharmony_cibehavior. Rather they want to be sure they get memory from the specified node 1318c2ecf20Sopenharmony_cior get notified that the node has no free memory. This is usually the case when 1328c2ecf20Sopenharmony_cia subsystem allocates per CPU memory resources, for example. 1338c2ecf20Sopenharmony_ci 1348c2ecf20Sopenharmony_ciA typical model for making such an allocation is to obtain the node id of the 1358c2ecf20Sopenharmony_cinode to which the "current CPU" is attached using one of the kernel's 1368c2ecf20Sopenharmony_cinuma_node_id() or CPU_to_node() functions and then request memory from only 1378c2ecf20Sopenharmony_cithe node id returned. When such an allocation fails, the requesting subsystem 1388c2ecf20Sopenharmony_cimay revert to its own fallback path. The slab kernel memory allocator is an 1398c2ecf20Sopenharmony_ciexample of this. Or, the subsystem may choose to disable or not to enable 1408c2ecf20Sopenharmony_ciitself on allocation failure. The kernel profiling subsystem is an example of 1418c2ecf20Sopenharmony_cithis. 1428c2ecf20Sopenharmony_ci 1438c2ecf20Sopenharmony_ciIf the architecture supports--does not hide--memoryless nodes, then CPUs 1448c2ecf20Sopenharmony_ciattached to memoryless nodes would always incur the fallback path overhead 1458c2ecf20Sopenharmony_cior some subsystems would fail to initialize if they attempted to allocated 1468c2ecf20Sopenharmony_cimemory exclusively from a node without memory. To support such 1478c2ecf20Sopenharmony_ciarchitectures transparently, kernel subsystems can use the numa_mem_id() 1488c2ecf20Sopenharmony_cior cpu_to_mem() function to locate the "local memory node" for the calling or 1498c2ecf20Sopenharmony_cispecified CPU. Again, this is the same node from which default, local page 1508c2ecf20Sopenharmony_ciallocations will be attempted. 151