162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=====================
462306a36Sopenharmony_ciPhysical Memory Model
562306a36Sopenharmony_ci=====================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciPhysical memory in a system may be addressed in different ways. The
862306a36Sopenharmony_cisimplest case is when the physical memory starts at address 0 and
962306a36Sopenharmony_cispans a contiguous range up to the maximal address. It could be,
1062306a36Sopenharmony_cihowever, that this range contains small holes that are not accessible
1162306a36Sopenharmony_cifor the CPU. Then there could be several contiguous ranges at
1262306a36Sopenharmony_cicompletely distinct addresses. And, don't forget about NUMA, where
1362306a36Sopenharmony_cidifferent memory banks are attached to different CPUs.
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciLinux abstracts this diversity using one of the two memory models:
1662306a36Sopenharmony_ciFLATMEM and SPARSEMEM. Each architecture defines what
1762306a36Sopenharmony_cimemory models it supports, what the default memory model is and
1862306a36Sopenharmony_ciwhether it is possible to manually override that default.
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciAll the memory models track the status of physical page frames using
2162306a36Sopenharmony_cistruct page arranged in one or more arrays.
2262306a36Sopenharmony_ci
2362306a36Sopenharmony_ciRegardless of the selected memory model, there exists one-to-one
2462306a36Sopenharmony_cimapping between the physical page frame number (PFN) and the
2562306a36Sopenharmony_cicorresponding `struct page`.
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ciEach memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
2862306a36Sopenharmony_cihelpers that allow the conversion from PFN to `struct page` and vice
2962306a36Sopenharmony_civersa.
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciFLATMEM
3262306a36Sopenharmony_ci=======
3362306a36Sopenharmony_ci
3462306a36Sopenharmony_ciThe simplest memory model is FLATMEM. This model is suitable for
3562306a36Sopenharmony_cinon-NUMA systems with contiguous, or mostly contiguous, physical
3662306a36Sopenharmony_cimemory.
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciIn the FLATMEM memory model, there is a global `mem_map` array that
3962306a36Sopenharmony_cimaps the entire physical memory. For most architectures, the holes
4062306a36Sopenharmony_cihave entries in the `mem_map` array. The `struct page` objects
4162306a36Sopenharmony_cicorresponding to the holes are never fully initialized.
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ciTo allocate the `mem_map` array, architecture specific setup code should
4462306a36Sopenharmony_cicall :c:func:`free_area_init` function. Yet, the mappings array is not
4562306a36Sopenharmony_ciusable until the call to :c:func:`memblock_free_all` that hands all the
4662306a36Sopenharmony_cimemory to the page allocator.
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciAn architecture may free parts of the `mem_map` array that do not cover the
4962306a36Sopenharmony_ciactual physical pages. In such case, the architecture specific
5062306a36Sopenharmony_ci:c:func:`pfn_valid` implementation should take the holes in the
5162306a36Sopenharmony_ci`mem_map` into account.
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciWith FLATMEM, the conversion between a PFN and the `struct page` is
5462306a36Sopenharmony_cistraightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
5562306a36Sopenharmony_ci`mem_map` array.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciThe `ARCH_PFN_OFFSET` defines the first page frame number for
5862306a36Sopenharmony_cisystems with physical memory starting at address different from 0.
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciSPARSEMEM
6162306a36Sopenharmony_ci=========
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ciSPARSEMEM is the most versatile memory model available in Linux and it
6462306a36Sopenharmony_ciis the only memory model that supports several advanced features such
6562306a36Sopenharmony_cias hot-plug and hot-remove of the physical memory, alternative memory
6662306a36Sopenharmony_cimaps for non-volatile memory devices and deferred initialization of
6762306a36Sopenharmony_cithe memory map for larger systems.
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_ciThe SPARSEMEM model presents the physical memory as a collection of
7062306a36Sopenharmony_cisections. A section is represented with struct mem_section
7162306a36Sopenharmony_cithat contains `section_mem_map` that is, logically, a pointer to an
7262306a36Sopenharmony_ciarray of struct pages. However, it is stored with some other magic
7362306a36Sopenharmony_cithat aids the sections management. The section size and maximal number
7462306a36Sopenharmony_ciof section is specified using `SECTION_SIZE_BITS` and
7562306a36Sopenharmony_ci`MAX_PHYSMEM_BITS` constants defined by each architecture that
7662306a36Sopenharmony_cisupports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
7762306a36Sopenharmony_ciphysical address that an architecture supports, the
7862306a36Sopenharmony_ci`SECTION_SIZE_BITS` is an arbitrary value.
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_ciThe maximal number of sections is denoted `NR_MEM_SECTIONS` and
8162306a36Sopenharmony_cidefined as
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ci.. math::
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ci   NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
8662306a36Sopenharmony_ci
8762306a36Sopenharmony_ciThe `mem_section` objects are arranged in a two-dimensional array
8862306a36Sopenharmony_cicalled `mem_sections`. The size and placement of this array depend
8962306a36Sopenharmony_cion `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
9062306a36Sopenharmony_cisections:
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ci* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
9362306a36Sopenharmony_ci  array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
9462306a36Sopenharmony_ci  single `mem_section` object.
9562306a36Sopenharmony_ci* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
9662306a36Sopenharmony_ci  array is dynamically allocated. Each row contains PAGE_SIZE worth of
9762306a36Sopenharmony_ci  `mem_section` objects and the number of rows is calculated to fit
9862306a36Sopenharmony_ci  all the memory sections.
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ciThe architecture setup code should call sparse_init() to
10162306a36Sopenharmony_ciinitialize the memory sections and the memory maps.
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ciWith SPARSEMEM there are two possible ways to convert a PFN to the
10462306a36Sopenharmony_cicorresponding `struct page` - a "classic sparse" and "sparse
10562306a36Sopenharmony_civmemmap". The selection is made at build time and it is determined by
10662306a36Sopenharmony_cithe value of `CONFIG_SPARSEMEM_VMEMMAP`.
10762306a36Sopenharmony_ci
10862306a36Sopenharmony_ciThe classic sparse encodes the section number of a page in page->flags
10962306a36Sopenharmony_ciand uses high bits of a PFN to access the section that maps that page
11062306a36Sopenharmony_ciframe. Inside a section, the PFN is the index to the array of pages.
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ciThe sparse vmemmap uses a virtually mapped memory map to optimize
11362306a36Sopenharmony_cipfn_to_page and page_to_pfn operations. There is a global `struct
11462306a36Sopenharmony_cipage *vmemmap` pointer that points to a virtually contiguous array of
11562306a36Sopenharmony_ci`struct page` objects. A PFN is an index to that array and the
11662306a36Sopenharmony_cioffset of the `struct page` from `vmemmap` is the PFN of that
11762306a36Sopenharmony_cipage.
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ciTo use vmemmap, an architecture has to reserve a range of virtual
12062306a36Sopenharmony_ciaddresses that will map the physical pages containing the memory
12162306a36Sopenharmony_cimap and make sure that `vmemmap` points to that range. In addition,
12262306a36Sopenharmony_cithe architecture should implement :c:func:`vmemmap_populate` method
12362306a36Sopenharmony_cithat will allocate the physical memory and create page tables for the
12462306a36Sopenharmony_civirtual memory map. If an architecture does not have any special
12562306a36Sopenharmony_cirequirements for the vmemmap mappings, it can use default
12662306a36Sopenharmony_ci:c:func:`vmemmap_populate_basepages` provided by the generic memory
12762306a36Sopenharmony_cimanagement.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ciThe virtually mapped memory map allows storing `struct page` objects
13062306a36Sopenharmony_cifor persistent memory devices in pre-allocated storage on those
13162306a36Sopenharmony_cidevices. This storage is represented with struct vmem_altmap
13262306a36Sopenharmony_cithat is eventually passed to vmemmap_populate() through a long chain
13362306a36Sopenharmony_ciof function calls. The vmemmap_populate() implementation may use the
13462306a36Sopenharmony_ci`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
13562306a36Sopenharmony_ciallocate memory map on the persistent memory device.
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ciZONE_DEVICE
13862306a36Sopenharmony_ci===========
13962306a36Sopenharmony_ciThe `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
14062306a36Sopenharmony_ci`struct page` `mem_map` services for device driver identified physical
14162306a36Sopenharmony_ciaddress ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
14262306a36Sopenharmony_cithat the page objects for these address ranges are never marked online,
14362306a36Sopenharmony_ciand that a reference must be taken against the device, not just the page
14462306a36Sopenharmony_cito keep the memory pinned for active use. `ZONE_DEVICE`, via
14562306a36Sopenharmony_ci:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
14662306a36Sopenharmony_citurn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
14762306a36Sopenharmony_ci:c:func:`get_user_pages` service for the given range of pfns. Since the
14862306a36Sopenharmony_cipage reference count never drops below 1 the page is never tracked as
14962306a36Sopenharmony_cifree memory and the page's `struct list_head lru` space is repurposed
15062306a36Sopenharmony_cifor back referencing to the host device / driver that mapped the memory.
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ciWhile `SPARSEMEM` presents memory as a collection of sections,
15362306a36Sopenharmony_cioptionally collected into memory blocks, `ZONE_DEVICE` users have a need
15462306a36Sopenharmony_cifor smaller granularity of populating the `mem_map`. Given that
15562306a36Sopenharmony_ci`ZONE_DEVICE` memory is never marked online it is subsequently never
15662306a36Sopenharmony_cisubject to its memory ranges being exposed through the sysfs memory
15762306a36Sopenharmony_cihotplug api on memory block boundaries. The implementation relies on
15862306a36Sopenharmony_cithis lack of user-api constraint to allow sub-section sized memory
15962306a36Sopenharmony_ciranges to be specified to :c:func:`arch_add_memory`, the top-half of
16062306a36Sopenharmony_cimemory hotplug. Sub-section support allows for 2MB as the cross-arch
16162306a36Sopenharmony_cicommon alignment granularity for :c:func:`devm_memremap_pages`.
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciThe users of `ZONE_DEVICE` are:
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ci* pmem: Map platform persistent memory to be used as a direct-I/O target
16662306a36Sopenharmony_ci  via DAX mappings.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
16962306a36Sopenharmony_ci  event callbacks to allow a device-driver to coordinate memory management
17062306a36Sopenharmony_ci  events related to device-memory, typically GPU memory. See
17162306a36Sopenharmony_ci  Documentation/mm/hmm.rst.
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ci* p2pdma: Create `struct page` objects to allow peer devices in a
17462306a36Sopenharmony_ci  PCI/-E topology to coordinate direct-DMA operations between themselves,
17562306a36Sopenharmony_ci  i.e. bypass host memory.
176