162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci==============
462306a36Sopenharmony_ci5-level paging
562306a36Sopenharmony_ci==============
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciOverview
862306a36Sopenharmony_ci========
962306a36Sopenharmony_ciOriginal x86-64 was limited by 4-level paging to 256 TiB of virtual address
1062306a36Sopenharmony_cispace and 64 TiB of physical address space. We are already bumping into
1162306a36Sopenharmony_cithis limit: some vendors offer servers with 64 TiB of memory today.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciTo overcome the limitation upcoming hardware will introduce support for
1462306a36Sopenharmony_ci5-level paging. It is a straight-forward extension of the current page
1562306a36Sopenharmony_citable structure adding one more layer of translation.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciIt bumps the limits to 128 PiB of virtual address space and 4 PiB of
1862306a36Sopenharmony_ciphysical address space. This "ought to be enough for anybody" ©.
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciQEMU 2.9 and later support 5-level paging.
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ciVirtual memory layout for 5-level paging is described in
2362306a36Sopenharmony_ciDocumentation/arch/x86/x86_64/mm.rst
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciEnabling 5-level paging
2762306a36Sopenharmony_ci=======================
2862306a36Sopenharmony_ciCONFIG_X86_5LEVEL=y enables the feature.
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciKernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
3162306a36Sopenharmony_ciIn this case additional page table level -- p4d -- will be folded at
3262306a36Sopenharmony_ciruntime.
3362306a36Sopenharmony_ci
3462306a36Sopenharmony_ciUser-space and large virtual address space
3562306a36Sopenharmony_ci==========================================
3662306a36Sopenharmony_ciOn x86, 5-level paging enables 56-bit userspace virtual address space.
3762306a36Sopenharmony_ciNot all user space is ready to handle wide addresses. It's known that
3862306a36Sopenharmony_ciat least some JIT compilers use higher bits in pointers to encode their
3962306a36Sopenharmony_ciinformation. It collides with valid pointers with 5-level paging and
4062306a36Sopenharmony_cileads to crashes.
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciTo mitigate this, we are not going to allocate virtual address space
4362306a36Sopenharmony_ciabove 47-bit by default.
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ciBut userspace can ask for allocation from full address space by
4662306a36Sopenharmony_cispecifying hint address (with or without MAP_FIXED) above 47-bits.
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciIf hint address set above 47-bit, but MAP_FIXED is not specified, we try
4962306a36Sopenharmony_cito look for unmapped area by specified address. If it's already
5062306a36Sopenharmony_cioccupied, we look for unmapped area in *full* address space, rather than
5162306a36Sopenharmony_cifrom 47-bit window.
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciA high hint address would only affect the allocation in question, but not
5462306a36Sopenharmony_ciany future mmap()s.
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ciSpecifying high hint address on older kernel or on machine without 5-level
5762306a36Sopenharmony_cipaging support is safe. The hint will be ignored and kernel will fall back
5862306a36Sopenharmony_cito allocation from 47-bit address space.
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciThis approach helps to easily make application's memory allocator aware
6162306a36Sopenharmony_ciabout large address space without manually tracking allocated virtual
6262306a36Sopenharmony_ciaddress space.
6362306a36Sopenharmony_ci
6462306a36Sopenharmony_ciOne important case we need to handle here is interaction with MPX.
6562306a36Sopenharmony_ciMPX (without MAWA extension) cannot handle addresses above 47-bit, so we
6662306a36Sopenharmony_cineed to make sure that MPX cannot be enabled we already have VMA above
6762306a36Sopenharmony_cithe boundary and forbid creating such VMAs once MPX is enabled.
68