18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci============== 48c2ecf20Sopenharmony_ci5-level paging 58c2ecf20Sopenharmony_ci============== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciOverview 88c2ecf20Sopenharmony_ci======== 98c2ecf20Sopenharmony_ciOriginal x86-64 was limited by 4-level paing to 256 TiB of virtual address 108c2ecf20Sopenharmony_cispace and 64 TiB of physical address space. We are already bumping into 118c2ecf20Sopenharmony_cithis limit: some vendors offers servers with 64 TiB of memory today. 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ciTo overcome the limitation upcoming hardware will introduce support for 148c2ecf20Sopenharmony_ci5-level paging. It is a straight-forward extension of the current page 158c2ecf20Sopenharmony_citable structure adding one more layer of translation. 168c2ecf20Sopenharmony_ci 178c2ecf20Sopenharmony_ciIt bumps the limits to 128 PiB of virtual address space and 4 PiB of 188c2ecf20Sopenharmony_ciphysical address space. This "ought to be enough for anybody" ©. 198c2ecf20Sopenharmony_ci 208c2ecf20Sopenharmony_ciQEMU 2.9 and later support 5-level paging. 218c2ecf20Sopenharmony_ci 228c2ecf20Sopenharmony_ciVirtual memory layout for 5-level paging is described in 238c2ecf20Sopenharmony_ciDocumentation/x86/x86_64/mm.rst 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ciEnabling 5-level paging 278c2ecf20Sopenharmony_ci======================= 288c2ecf20Sopenharmony_ciCONFIG_X86_5LEVEL=y enables the feature. 298c2ecf20Sopenharmony_ci 308c2ecf20Sopenharmony_ciKernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware. 318c2ecf20Sopenharmony_ciIn this case additional page table level -- p4d -- will be folded at 328c2ecf20Sopenharmony_ciruntime. 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ciUser-space and large virtual address space 358c2ecf20Sopenharmony_ci========================================== 368c2ecf20Sopenharmony_ciOn x86, 5-level paging enables 56-bit userspace virtual address space. 378c2ecf20Sopenharmony_ciNot all user space is ready to handle wide addresses. It's known that 388c2ecf20Sopenharmony_ciat least some JIT compilers use higher bits in pointers to encode their 398c2ecf20Sopenharmony_ciinformation. It collides with valid pointers with 5-level paging and 408c2ecf20Sopenharmony_cileads to crashes. 418c2ecf20Sopenharmony_ci 428c2ecf20Sopenharmony_ciTo mitigate this, we are not going to allocate virtual address space 438c2ecf20Sopenharmony_ciabove 47-bit by default. 448c2ecf20Sopenharmony_ci 458c2ecf20Sopenharmony_ciBut userspace can ask for allocation from full address space by 468c2ecf20Sopenharmony_cispecifying hint address (with or without MAP_FIXED) above 47-bits. 478c2ecf20Sopenharmony_ci 488c2ecf20Sopenharmony_ciIf hint address set above 47-bit, but MAP_FIXED is not specified, we try 498c2ecf20Sopenharmony_cito look for unmapped area by specified address. If it's already 508c2ecf20Sopenharmony_cioccupied, we look for unmapped area in *full* address space, rather than 518c2ecf20Sopenharmony_cifrom 47-bit window. 528c2ecf20Sopenharmony_ci 538c2ecf20Sopenharmony_ciA high hint address would only affect the allocation in question, but not 548c2ecf20Sopenharmony_ciany future mmap()s. 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ciSpecifying high hint address on older kernel or on machine without 5-level 578c2ecf20Sopenharmony_cipaging support is safe. The hint will be ignored and kernel will fall back 588c2ecf20Sopenharmony_cito allocation from 47-bit address space. 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ciThis approach helps to easily make application's memory allocator aware 618c2ecf20Sopenharmony_ciabout large address space without manually tracking allocated virtual 628c2ecf20Sopenharmony_ciaddress space. 638c2ecf20Sopenharmony_ci 648c2ecf20Sopenharmony_ciOne important case we need to handle here is interaction with MPX. 658c2ecf20Sopenharmony_ciMPX (without MAWA extension) cannot handle addresses above 47-bit, so we 668c2ecf20Sopenharmony_cineed to make sure that MPX cannot be enabled we already have VMA above 678c2ecf20Sopenharmony_cithe boundary and forbid creating such VMAs once MPX is enabled. 68