18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci====================== 48c2ecf20Sopenharmony_ciThe x86 kvm shadow mmu 58c2ecf20Sopenharmony_ci====================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciThe mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible 88c2ecf20Sopenharmony_cifor presenting a standard x86 mmu to the guest, while translating guest 98c2ecf20Sopenharmony_ciphysical addresses to host physical addresses. 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ciThe mmu code attempts to satisfy the following requirements: 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ci- correctness: 148c2ecf20Sopenharmony_ci the guest should not be able to determine that it is running 158c2ecf20Sopenharmony_ci on an emulated mmu except for timing (we attempt to comply 168c2ecf20Sopenharmony_ci with the specification, not emulate the characteristics of 178c2ecf20Sopenharmony_ci a particular implementation such as tlb size) 188c2ecf20Sopenharmony_ci- security: 198c2ecf20Sopenharmony_ci the guest must not be able to touch host memory not assigned 208c2ecf20Sopenharmony_ci to it 218c2ecf20Sopenharmony_ci- performance: 228c2ecf20Sopenharmony_ci minimize the performance penalty imposed by the mmu 238c2ecf20Sopenharmony_ci- scaling: 248c2ecf20Sopenharmony_ci need to scale to large memory and large vcpu guests 258c2ecf20Sopenharmony_ci- hardware: 268c2ecf20Sopenharmony_ci support the full range of x86 virtualization hardware 278c2ecf20Sopenharmony_ci- integration: 288c2ecf20Sopenharmony_ci Linux memory management code must be in control of guest memory 298c2ecf20Sopenharmony_ci so that swapping, page migration, page merging, transparent 308c2ecf20Sopenharmony_ci hugepages, and similar features work without change 318c2ecf20Sopenharmony_ci- dirty tracking: 328c2ecf20Sopenharmony_ci report writes to guest memory to enable live migration 338c2ecf20Sopenharmony_ci and framebuffer-based displays 348c2ecf20Sopenharmony_ci- footprint: 358c2ecf20Sopenharmony_ci keep the amount of pinned kernel memory low (most memory 368c2ecf20Sopenharmony_ci should be shrinkable) 378c2ecf20Sopenharmony_ci- reliability: 388c2ecf20Sopenharmony_ci avoid multipage or GFP_ATOMIC allocations 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ciAcronyms 418c2ecf20Sopenharmony_ci======== 428c2ecf20Sopenharmony_ci 438c2ecf20Sopenharmony_ci==== ==================================================================== 448c2ecf20Sopenharmony_cipfn host page frame number 458c2ecf20Sopenharmony_cihpa host physical address 468c2ecf20Sopenharmony_cihva host virtual address 478c2ecf20Sopenharmony_cigfn guest frame number 488c2ecf20Sopenharmony_cigpa guest physical address 498c2ecf20Sopenharmony_cigva guest virtual address 508c2ecf20Sopenharmony_cingpa nested guest physical address 518c2ecf20Sopenharmony_cingva nested guest virtual address 528c2ecf20Sopenharmony_cipte page table entry (used also to refer generically to paging structure 538c2ecf20Sopenharmony_ci entries) 548c2ecf20Sopenharmony_cigpte guest pte (referring to gfns) 558c2ecf20Sopenharmony_cispte shadow pte (referring to pfns) 568c2ecf20Sopenharmony_citdp two dimensional paging (vendor neutral term for NPT and EPT) 578c2ecf20Sopenharmony_ci==== ==================================================================== 588c2ecf20Sopenharmony_ci 598c2ecf20Sopenharmony_ciVirtual and real hardware supported 608c2ecf20Sopenharmony_ci=================================== 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ciThe mmu supports first-generation mmu hardware, which allows an atomic switch 638c2ecf20Sopenharmony_ciof the current paging mode and cr3 during guest entry, as well as 648c2ecf20Sopenharmony_citwo-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware 658c2ecf20Sopenharmony_ciit exposes is the traditional 2/3/4 level x86 mmu, with support for global 668c2ecf20Sopenharmony_cipages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also 678c2ecf20Sopenharmony_ciable to expose NPT capable hardware on NPT capable hosts. 688c2ecf20Sopenharmony_ci 698c2ecf20Sopenharmony_ciTranslation 708c2ecf20Sopenharmony_ci=========== 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ciThe primary job of the mmu is to program the processor's mmu to translate 738c2ecf20Sopenharmony_ciaddresses for the guest. Different translations are required at different 748c2ecf20Sopenharmony_citimes: 758c2ecf20Sopenharmony_ci 768c2ecf20Sopenharmony_ci- when guest paging is disabled, we translate guest physical addresses to 778c2ecf20Sopenharmony_ci host physical addresses (gpa->hpa) 788c2ecf20Sopenharmony_ci- when guest paging is enabled, we translate guest virtual addresses, to 798c2ecf20Sopenharmony_ci guest physical addresses, to host physical addresses (gva->gpa->hpa) 808c2ecf20Sopenharmony_ci- when the guest launches a guest of its own, we translate nested guest 818c2ecf20Sopenharmony_ci virtual addresses, to nested guest physical addresses, to guest physical 828c2ecf20Sopenharmony_ci addresses, to host physical addresses (ngva->ngpa->gpa->hpa) 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ciThe primary challenge is to encode between 1 and 3 translations into hardware 858c2ecf20Sopenharmony_cithat support only 1 (traditional) and 2 (tdp) translations. When the 868c2ecf20Sopenharmony_cinumber of required translations matches the hardware, the mmu operates in 878c2ecf20Sopenharmony_cidirect mode; otherwise it operates in shadow mode (see below). 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ciMemory 908c2ecf20Sopenharmony_ci====== 918c2ecf20Sopenharmony_ci 928c2ecf20Sopenharmony_ciGuest memory (gpa) is part of the user address space of the process that is 938c2ecf20Sopenharmony_ciusing kvm. Userspace defines the translation between guest addresses and user 948c2ecf20Sopenharmony_ciaddresses (gpa->hva); note that two gpas may alias to the same hva, but not 958c2ecf20Sopenharmony_civice versa. 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ciThese hvas may be backed using any method available to the host: anonymous 988c2ecf20Sopenharmony_cimemory, file backed memory, and device memory. Memory might be paged by the 998c2ecf20Sopenharmony_cihost at any time. 1008c2ecf20Sopenharmony_ci 1018c2ecf20Sopenharmony_ciEvents 1028c2ecf20Sopenharmony_ci====== 1038c2ecf20Sopenharmony_ci 1048c2ecf20Sopenharmony_ciThe mmu is driven by events, some from the guest, some from the host. 1058c2ecf20Sopenharmony_ci 1068c2ecf20Sopenharmony_ciGuest generated events: 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ci- writes to control registers (especially cr3) 1098c2ecf20Sopenharmony_ci- invlpg/invlpga instruction execution 1108c2ecf20Sopenharmony_ci- access to missing or protected translations 1118c2ecf20Sopenharmony_ci 1128c2ecf20Sopenharmony_ciHost generated events: 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ci- changes in the gpa->hpa translation (either through gpa->hva changes or 1158c2ecf20Sopenharmony_ci through hva->hpa changes) 1168c2ecf20Sopenharmony_ci- memory pressure (the shrinker) 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ciShadow pages 1198c2ecf20Sopenharmony_ci============ 1208c2ecf20Sopenharmony_ci 1218c2ecf20Sopenharmony_ciThe principal data structure is the shadow page, 'struct kvm_mmu_page'. A 1228c2ecf20Sopenharmony_cishadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A 1238c2ecf20Sopenharmony_cishadow page may contain a mix of leaf and nonleaf sptes. 1248c2ecf20Sopenharmony_ci 1258c2ecf20Sopenharmony_ciA nonleaf spte allows the hardware mmu to reach the leaf pages and 1268c2ecf20Sopenharmony_ciis not related to a translation directly. It points to other shadow pages. 1278c2ecf20Sopenharmony_ci 1288c2ecf20Sopenharmony_ciA leaf spte corresponds to either one or two translations encoded into 1298c2ecf20Sopenharmony_cione paging structure entry. These are always the lowest level of the 1308c2ecf20Sopenharmony_citranslation stack, with optional higher level translations left to NPT/EPT. 1318c2ecf20Sopenharmony_ciLeaf ptes point at guest pages. 1328c2ecf20Sopenharmony_ci 1338c2ecf20Sopenharmony_ciThe following table shows translations encoded by leaf ptes, with higher-level 1348c2ecf20Sopenharmony_citranslations in parentheses: 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ci Non-nested guests:: 1378c2ecf20Sopenharmony_ci 1388c2ecf20Sopenharmony_ci nonpaging: gpa->hpa 1398c2ecf20Sopenharmony_ci paging: gva->gpa->hpa 1408c2ecf20Sopenharmony_ci paging, tdp: (gva->)gpa->hpa 1418c2ecf20Sopenharmony_ci 1428c2ecf20Sopenharmony_ci Nested guests:: 1438c2ecf20Sopenharmony_ci 1448c2ecf20Sopenharmony_ci non-tdp: ngva->gpa->hpa (*) 1458c2ecf20Sopenharmony_ci tdp: (ngva->)ngpa->gpa->hpa 1468c2ecf20Sopenharmony_ci 1478c2ecf20Sopenharmony_ci (*) the guest hypervisor will encode the ngva->gpa translation into its page 1488c2ecf20Sopenharmony_ci tables if npt is not present 1498c2ecf20Sopenharmony_ci 1508c2ecf20Sopenharmony_ciShadow pages contain the following information: 1518c2ecf20Sopenharmony_ci role.level: 1528c2ecf20Sopenharmony_ci The level in the shadow paging hierarchy that this shadow page belongs to. 1538c2ecf20Sopenharmony_ci 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. 1548c2ecf20Sopenharmony_ci role.direct: 1558c2ecf20Sopenharmony_ci If set, leaf sptes reachable from this page are for a linear range. 1568c2ecf20Sopenharmony_ci Examples include real mode translation, large guest pages backed by small 1578c2ecf20Sopenharmony_ci host pages, and gpa->hpa translations when NPT or EPT is active. 1588c2ecf20Sopenharmony_ci The linear range starts at (gfn << PAGE_SHIFT) and its size is determined 1598c2ecf20Sopenharmony_ci by role.level (2MB for first level, 1GB for second level, 0.5TB for third 1608c2ecf20Sopenharmony_ci level, 256TB for fourth level) 1618c2ecf20Sopenharmony_ci If clear, this page corresponds to a guest page table denoted by the gfn 1628c2ecf20Sopenharmony_ci field. 1638c2ecf20Sopenharmony_ci role.quadrant: 1648c2ecf20Sopenharmony_ci When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit 1658c2ecf20Sopenharmony_ci sptes. That means a guest page table contains more ptes than the host, 1668c2ecf20Sopenharmony_ci so multiple shadow pages are needed to shadow one guest page. 1678c2ecf20Sopenharmony_ci For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the 1688c2ecf20Sopenharmony_ci first or second 512-gpte block in the guest page table. For second-level 1698c2ecf20Sopenharmony_ci page tables, each 32-bit gpte is converted to two 64-bit sptes 1708c2ecf20Sopenharmony_ci (since each first-level guest page is shadowed by two first-level 1718c2ecf20Sopenharmony_ci shadow pages) so role.quadrant takes values in the range 0..3. Each 1728c2ecf20Sopenharmony_ci quadrant maps 1GB virtual address space. 1738c2ecf20Sopenharmony_ci role.access: 1748c2ecf20Sopenharmony_ci Inherited guest access permissions from the parent ptes in the form uwx. 1758c2ecf20Sopenharmony_ci Note execute permission is positive, not negative. 1768c2ecf20Sopenharmony_ci role.invalid: 1778c2ecf20Sopenharmony_ci The page is invalid and should not be used. It is a root page that is 1788c2ecf20Sopenharmony_ci currently pinned (by a cpu hardware register pointing to it); once it is 1798c2ecf20Sopenharmony_ci unpinned it will be destroyed. 1808c2ecf20Sopenharmony_ci role.gpte_is_8_bytes: 1818c2ecf20Sopenharmony_ci Reflects the size of the guest PTE for which the page is valid, i.e. '1' 1828c2ecf20Sopenharmony_ci if 64-bit gptes are in use, '0' if 32-bit gptes are in use. 1838c2ecf20Sopenharmony_ci role.nxe: 1848c2ecf20Sopenharmony_ci Contains the value of efer.nxe for which the page is valid. 1858c2ecf20Sopenharmony_ci role.cr0_wp: 1868c2ecf20Sopenharmony_ci Contains the value of cr0.wp for which the page is valid. 1878c2ecf20Sopenharmony_ci role.smep_andnot_wp: 1888c2ecf20Sopenharmony_ci Contains the value of cr4.smep && !cr0.wp for which the page is valid 1898c2ecf20Sopenharmony_ci (pages for which this is true are different from other pages; see the 1908c2ecf20Sopenharmony_ci treatment of cr0.wp=0 below). 1918c2ecf20Sopenharmony_ci role.smap_andnot_wp: 1928c2ecf20Sopenharmony_ci Contains the value of cr4.smap && !cr0.wp for which the page is valid 1938c2ecf20Sopenharmony_ci (pages for which this is true are different from other pages; see the 1948c2ecf20Sopenharmony_ci treatment of cr0.wp=0 below). 1958c2ecf20Sopenharmony_ci role.ept_sp: 1968c2ecf20Sopenharmony_ci This is a virtual flag to denote a shadowed nested EPT page. ept_sp 1978c2ecf20Sopenharmony_ci is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination. 1988c2ecf20Sopenharmony_ci role.smm: 1998c2ecf20Sopenharmony_ci Is 1 if the page is valid in system management mode. This field 2008c2ecf20Sopenharmony_ci determines which of the kvm_memslots array was used to build this 2018c2ecf20Sopenharmony_ci shadow page; it is also used to go back from a struct kvm_mmu_page 2028c2ecf20Sopenharmony_ci to a memslot, through the kvm_memslots_for_spte_role macro and 2038c2ecf20Sopenharmony_ci __gfn_to_memslot. 2048c2ecf20Sopenharmony_ci role.ad_disabled: 2058c2ecf20Sopenharmony_ci Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D 2068c2ecf20Sopenharmony_ci bits before Haswell; shadow EPT page tables also cannot use A/D bits 2078c2ecf20Sopenharmony_ci if the L1 hypervisor does not enable them. 2088c2ecf20Sopenharmony_ci gfn: 2098c2ecf20Sopenharmony_ci Either the guest page table containing the translations shadowed by this 2108c2ecf20Sopenharmony_ci page, or the base page frame for linear translations. See role.direct. 2118c2ecf20Sopenharmony_ci spt: 2128c2ecf20Sopenharmony_ci A pageful of 64-bit sptes containing the translations for this page. 2138c2ecf20Sopenharmony_ci Accessed by both kvm and hardware. 2148c2ecf20Sopenharmony_ci The page pointed to by spt will have its page->private pointing back 2158c2ecf20Sopenharmony_ci at the shadow page structure. 2168c2ecf20Sopenharmony_ci sptes in spt point either at guest pages, or at lower-level shadow pages. 2178c2ecf20Sopenharmony_ci Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point 2188c2ecf20Sopenharmony_ci at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. 2198c2ecf20Sopenharmony_ci The spt array forms a DAG structure with the shadow page as a node, and 2208c2ecf20Sopenharmony_ci guest pages as leaves. 2218c2ecf20Sopenharmony_ci gfns: 2228c2ecf20Sopenharmony_ci An array of 512 guest frame numbers, one for each present pte. Used to 2238c2ecf20Sopenharmony_ci perform a reverse map from a pte to a gfn. When role.direct is set, any 2248c2ecf20Sopenharmony_ci element of this array can be calculated from the gfn field when used, in 2258c2ecf20Sopenharmony_ci this case, the array of gfns is not allocated. See role.direct and gfn. 2268c2ecf20Sopenharmony_ci root_count: 2278c2ecf20Sopenharmony_ci A counter keeping track of how many hardware registers (guest cr3 or 2288c2ecf20Sopenharmony_ci pdptrs) are now pointing at the page. While this counter is nonzero, the 2298c2ecf20Sopenharmony_ci page cannot be destroyed. See role.invalid. 2308c2ecf20Sopenharmony_ci parent_ptes: 2318c2ecf20Sopenharmony_ci The reverse mapping for the pte/ptes pointing at this page's spt. If 2328c2ecf20Sopenharmony_ci parent_ptes bit 0 is zero, only one spte points at this page and 2338c2ecf20Sopenharmony_ci parent_ptes points at this single spte, otherwise, there exists multiple 2348c2ecf20Sopenharmony_ci sptes pointing at this page and (parent_ptes & ~0x1) points at a data 2358c2ecf20Sopenharmony_ci structure with a list of parent sptes. 2368c2ecf20Sopenharmony_ci unsync: 2378c2ecf20Sopenharmony_ci If true, then the translations in this page may not match the guest's 2388c2ecf20Sopenharmony_ci translation. This is equivalent to the state of the tlb when a pte is 2398c2ecf20Sopenharmony_ci changed but before the tlb entry is flushed. Accordingly, unsync ptes 2408c2ecf20Sopenharmony_ci are synchronized when the guest executes invlpg or flushes its tlb by 2418c2ecf20Sopenharmony_ci other means. Valid for leaf pages. 2428c2ecf20Sopenharmony_ci unsync_children: 2438c2ecf20Sopenharmony_ci How many sptes in the page point at pages that are unsync (or have 2448c2ecf20Sopenharmony_ci unsynchronized children). 2458c2ecf20Sopenharmony_ci unsync_child_bitmap: 2468c2ecf20Sopenharmony_ci A bitmap indicating which sptes in spt point (directly or indirectly) at 2478c2ecf20Sopenharmony_ci pages that may be unsynchronized. Used to quickly locate all unsychronized 2488c2ecf20Sopenharmony_ci pages reachable from a given page. 2498c2ecf20Sopenharmony_ci clear_spte_count: 2508c2ecf20Sopenharmony_ci Only present on 32-bit hosts, where a 64-bit spte cannot be written 2518c2ecf20Sopenharmony_ci atomically. The reader uses this while running out of the MMU lock 2528c2ecf20Sopenharmony_ci to detect in-progress updates and retry them until the writer has 2538c2ecf20Sopenharmony_ci finished the write. 2548c2ecf20Sopenharmony_ci write_flooding_count: 2558c2ecf20Sopenharmony_ci A guest may write to a page table many times, causing a lot of 2568c2ecf20Sopenharmony_ci emulations if the page needs to be write-protected (see "Synchronized 2578c2ecf20Sopenharmony_ci and unsynchronized pages" below). Leaf pages can be unsynchronized 2588c2ecf20Sopenharmony_ci so that they do not trigger frequent emulation, but this is not 2598c2ecf20Sopenharmony_ci possible for non-leafs. This field counts the number of emulations 2608c2ecf20Sopenharmony_ci since the last time the page table was actually used; if emulation 2618c2ecf20Sopenharmony_ci is triggered too frequently on this page, KVM will unmap the page 2628c2ecf20Sopenharmony_ci to avoid emulation in the future. 2638c2ecf20Sopenharmony_ci 2648c2ecf20Sopenharmony_ciReverse map 2658c2ecf20Sopenharmony_ci=========== 2668c2ecf20Sopenharmony_ci 2678c2ecf20Sopenharmony_ciThe mmu maintains a reverse mapping whereby all ptes mapping a page can be 2688c2ecf20Sopenharmony_cireached given its gfn. This is used, for example, when swapping out a page. 2698c2ecf20Sopenharmony_ci 2708c2ecf20Sopenharmony_ciSynchronized and unsynchronized pages 2718c2ecf20Sopenharmony_ci===================================== 2728c2ecf20Sopenharmony_ci 2738c2ecf20Sopenharmony_ciThe guest uses two events to synchronize its tlb and page tables: tlb flushes 2748c2ecf20Sopenharmony_ciand page invalidations (invlpg). 2758c2ecf20Sopenharmony_ci 2768c2ecf20Sopenharmony_ciA tlb flush means that we need to synchronize all sptes reachable from the 2778c2ecf20Sopenharmony_ciguest's cr3. This is expensive, so we keep all guest page tables write 2788c2ecf20Sopenharmony_ciprotected, and synchronize sptes to gptes when a gpte is written. 2798c2ecf20Sopenharmony_ci 2808c2ecf20Sopenharmony_ciA special case is when a guest page table is reachable from the current 2818c2ecf20Sopenharmony_ciguest cr3. In this case, the guest is obliged to issue an invlpg instruction 2828c2ecf20Sopenharmony_cibefore using the translation. We take advantage of that by removing write 2838c2ecf20Sopenharmony_ciprotection from the guest page, and allowing the guest to modify it freely. 2848c2ecf20Sopenharmony_ciWe synchronize modified gptes when the guest invokes invlpg. This reduces 2858c2ecf20Sopenharmony_cithe amount of emulation we have to do when the guest modifies multiple gptes, 2868c2ecf20Sopenharmony_cior when the a guest page is no longer used as a page table and is used for 2878c2ecf20Sopenharmony_cirandom guest data. 2888c2ecf20Sopenharmony_ci 2898c2ecf20Sopenharmony_ciAs a side effect we have to resynchronize all reachable unsynchronized shadow 2908c2ecf20Sopenharmony_cipages on a tlb flush. 2918c2ecf20Sopenharmony_ci 2928c2ecf20Sopenharmony_ci 2938c2ecf20Sopenharmony_ciReaction to events 2948c2ecf20Sopenharmony_ci================== 2958c2ecf20Sopenharmony_ci 2968c2ecf20Sopenharmony_ci- guest page fault (or npt page fault, or ept violation) 2978c2ecf20Sopenharmony_ci 2988c2ecf20Sopenharmony_ciThis is the most complicated event. The cause of a page fault can be: 2998c2ecf20Sopenharmony_ci 3008c2ecf20Sopenharmony_ci - a true guest fault (the guest translation won't allow the access) (*) 3018c2ecf20Sopenharmony_ci - access to a missing translation 3028c2ecf20Sopenharmony_ci - access to a protected translation 3038c2ecf20Sopenharmony_ci - when logging dirty pages, memory is write protected 3048c2ecf20Sopenharmony_ci - synchronized shadow pages are write protected (*) 3058c2ecf20Sopenharmony_ci - access to untranslatable memory (mmio) 3068c2ecf20Sopenharmony_ci 3078c2ecf20Sopenharmony_ci (*) not applicable in direct mode 3088c2ecf20Sopenharmony_ci 3098c2ecf20Sopenharmony_ciHandling a page fault is performed as follows: 3108c2ecf20Sopenharmony_ci 3118c2ecf20Sopenharmony_ci - if the RSV bit of the error code is set, the page fault is caused by guest 3128c2ecf20Sopenharmony_ci accessing MMIO and cached MMIO information is available. 3138c2ecf20Sopenharmony_ci 3148c2ecf20Sopenharmony_ci - walk shadow page table 3158c2ecf20Sopenharmony_ci - check for valid generation number in the spte (see "Fast invalidation of 3168c2ecf20Sopenharmony_ci MMIO sptes" below) 3178c2ecf20Sopenharmony_ci - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and 3188c2ecf20Sopenharmony_ci vcpu->arch.mmio_gfn, and call the emulator 3198c2ecf20Sopenharmony_ci 3208c2ecf20Sopenharmony_ci - If both P bit and R/W bit of error code are set, this could possibly 3218c2ecf20Sopenharmony_ci be handled as a "fast page fault" (fixed without taking the MMU lock). See 3228c2ecf20Sopenharmony_ci the description in Documentation/virt/kvm/locking.rst. 3238c2ecf20Sopenharmony_ci 3248c2ecf20Sopenharmony_ci - if needed, walk the guest page tables to determine the guest translation 3258c2ecf20Sopenharmony_ci (gva->gpa or ngpa->gpa) 3268c2ecf20Sopenharmony_ci 3278c2ecf20Sopenharmony_ci - if permissions are insufficient, reflect the fault back to the guest 3288c2ecf20Sopenharmony_ci 3298c2ecf20Sopenharmony_ci - determine the host page 3308c2ecf20Sopenharmony_ci 3318c2ecf20Sopenharmony_ci - if this is an mmio request, there is no host page; cache the info to 3328c2ecf20Sopenharmony_ci vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn 3338c2ecf20Sopenharmony_ci 3348c2ecf20Sopenharmony_ci - walk the shadow page table to find the spte for the translation, 3358c2ecf20Sopenharmony_ci instantiating missing intermediate page tables as necessary 3368c2ecf20Sopenharmony_ci 3378c2ecf20Sopenharmony_ci - If this is an mmio request, cache the mmio info to the spte and set some 3388c2ecf20Sopenharmony_ci reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) 3398c2ecf20Sopenharmony_ci 3408c2ecf20Sopenharmony_ci - try to unsynchronize the page 3418c2ecf20Sopenharmony_ci 3428c2ecf20Sopenharmony_ci - if successful, we can let the guest continue and modify the gpte 3438c2ecf20Sopenharmony_ci 3448c2ecf20Sopenharmony_ci - emulate the instruction 3458c2ecf20Sopenharmony_ci 3468c2ecf20Sopenharmony_ci - if failed, unshadow the page and let the guest continue 3478c2ecf20Sopenharmony_ci 3488c2ecf20Sopenharmony_ci - update any translations that were modified by the instruction 3498c2ecf20Sopenharmony_ci 3508c2ecf20Sopenharmony_ciinvlpg handling: 3518c2ecf20Sopenharmony_ci 3528c2ecf20Sopenharmony_ci - walk the shadow page hierarchy and drop affected translations 3538c2ecf20Sopenharmony_ci - try to reinstantiate the indicated translation in the hope that the 3548c2ecf20Sopenharmony_ci guest will use it in the near future 3558c2ecf20Sopenharmony_ci 3568c2ecf20Sopenharmony_ciGuest control register updates: 3578c2ecf20Sopenharmony_ci 3588c2ecf20Sopenharmony_ci- mov to cr3 3598c2ecf20Sopenharmony_ci 3608c2ecf20Sopenharmony_ci - look up new shadow roots 3618c2ecf20Sopenharmony_ci - synchronize newly reachable shadow pages 3628c2ecf20Sopenharmony_ci 3638c2ecf20Sopenharmony_ci- mov to cr0/cr4/efer 3648c2ecf20Sopenharmony_ci 3658c2ecf20Sopenharmony_ci - set up mmu context for new paging mode 3668c2ecf20Sopenharmony_ci - look up new shadow roots 3678c2ecf20Sopenharmony_ci - synchronize newly reachable shadow pages 3688c2ecf20Sopenharmony_ci 3698c2ecf20Sopenharmony_ciHost translation updates: 3708c2ecf20Sopenharmony_ci 3718c2ecf20Sopenharmony_ci - mmu notifier called with updated hva 3728c2ecf20Sopenharmony_ci - look up affected sptes through reverse map 3738c2ecf20Sopenharmony_ci - drop (or update) translations 3748c2ecf20Sopenharmony_ci 3758c2ecf20Sopenharmony_ciEmulating cr0.wp 3768c2ecf20Sopenharmony_ci================ 3778c2ecf20Sopenharmony_ci 3788c2ecf20Sopenharmony_ciIf tdp is not enabled, the host must keep cr0.wp=1 so page write protection 3798c2ecf20Sopenharmony_ciworks for the guest kernel, not guest guest userspace. When the guest 3808c2ecf20Sopenharmony_cicr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, 3818c2ecf20Sopenharmony_ciwe cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the 3828c2ecf20Sopenharmony_cisemantics require allowing any guest kernel access plus user read access). 3838c2ecf20Sopenharmony_ci 3848c2ecf20Sopenharmony_ciWe handle this by mapping the permissions to two possible sptes, depending 3858c2ecf20Sopenharmony_cion fault type: 3868c2ecf20Sopenharmony_ci 3878c2ecf20Sopenharmony_ci- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, 3888c2ecf20Sopenharmony_ci disallows user access) 3898c2ecf20Sopenharmony_ci- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel 3908c2ecf20Sopenharmony_ci write access) 3918c2ecf20Sopenharmony_ci 3928c2ecf20Sopenharmony_ci(user write faults generate a #PF) 3938c2ecf20Sopenharmony_ci 3948c2ecf20Sopenharmony_ciIn the first case there are two additional complications: 3958c2ecf20Sopenharmony_ci 3968c2ecf20Sopenharmony_ci- if CR4.SMEP is enabled: since we've turned the page into a kernel page, 3978c2ecf20Sopenharmony_ci the kernel may now execute it. We handle this by also setting spte.nx. 3988c2ecf20Sopenharmony_ci If we get a user fetch or read fault, we'll change spte.u=1 and 3998c2ecf20Sopenharmony_ci spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when 4008c2ecf20Sopenharmony_ci shadow paging is in use. 4018c2ecf20Sopenharmony_ci- if CR4.SMAP is disabled: since the page has been changed to a kernel 4028c2ecf20Sopenharmony_ci page, it can not be reused when CR4.SMAP is enabled. We set 4038c2ecf20Sopenharmony_ci CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, 4048c2ecf20Sopenharmony_ci here we do not care the case that CR4.SMAP is enabled since KVM will 4058c2ecf20Sopenharmony_ci directly inject #PF to guest due to failed permission check. 4068c2ecf20Sopenharmony_ci 4078c2ecf20Sopenharmony_ciTo prevent an spte that was converted into a kernel page with cr0.wp=0 4088c2ecf20Sopenharmony_cifrom being written by the kernel after cr0.wp has changed to 1, we make 4098c2ecf20Sopenharmony_cithe value of cr0.wp part of the page role. This means that an spte created 4108c2ecf20Sopenharmony_ciwith one value of cr0.wp cannot be used when cr0.wp has a different value - 4118c2ecf20Sopenharmony_ciit will simply be missed by the shadow page lookup code. A similar issue 4128c2ecf20Sopenharmony_ciexists when an spte created with cr0.wp=0 and cr4.smep=0 is used after 4138c2ecf20Sopenharmony_cichanging cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep 4148c2ecf20Sopenharmony_ciis also made a part of the page role. 4158c2ecf20Sopenharmony_ci 4168c2ecf20Sopenharmony_ciLarge pages 4178c2ecf20Sopenharmony_ci=========== 4188c2ecf20Sopenharmony_ci 4198c2ecf20Sopenharmony_ciThe mmu supports all combinations of large and small guest and host pages. 4208c2ecf20Sopenharmony_ciSupported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as 4218c2ecf20Sopenharmony_citwo separate 2M pages, on both guest and host, since the mmu always uses PAE 4228c2ecf20Sopenharmony_cipaging. 4238c2ecf20Sopenharmony_ci 4248c2ecf20Sopenharmony_ciTo instantiate a large spte, four constraints must be satisfied: 4258c2ecf20Sopenharmony_ci 4268c2ecf20Sopenharmony_ci- the spte must point to a large host page 4278c2ecf20Sopenharmony_ci- the guest pte must be a large pte of at least equivalent size (if tdp is 4288c2ecf20Sopenharmony_ci enabled, there is no guest pte and this condition is satisfied) 4298c2ecf20Sopenharmony_ci- if the spte will be writeable, the large page frame may not overlap any 4308c2ecf20Sopenharmony_ci write-protected pages 4318c2ecf20Sopenharmony_ci- the guest page must be wholly contained by a single memory slot 4328c2ecf20Sopenharmony_ci 4338c2ecf20Sopenharmony_ciTo check the last two conditions, the mmu maintains a ->disallow_lpage set of 4348c2ecf20Sopenharmony_ciarrays for each memory slot and large page size. Every write protected page 4358c2ecf20Sopenharmony_cicauses its disallow_lpage to be incremented, thus preventing instantiation of 4368c2ecf20Sopenharmony_cia large spte. The frames at the end of an unaligned memory slot have 4378c2ecf20Sopenharmony_ciartificially inflated ->disallow_lpages so they can never be instantiated. 4388c2ecf20Sopenharmony_ci 4398c2ecf20Sopenharmony_ciFast invalidation of MMIO sptes 4408c2ecf20Sopenharmony_ci=============================== 4418c2ecf20Sopenharmony_ci 4428c2ecf20Sopenharmony_ciAs mentioned in "Reaction to events" above, kvm will cache MMIO 4438c2ecf20Sopenharmony_ciinformation in leaf sptes. When a new memslot is added or an existing 4448c2ecf20Sopenharmony_cimemslot is changed, this information may become stale and needs to be 4458c2ecf20Sopenharmony_ciinvalidated. This also needs to hold the MMU lock while walking all 4468c2ecf20Sopenharmony_cishadow pages, and is made more scalable with a similar technique. 4478c2ecf20Sopenharmony_ci 4488c2ecf20Sopenharmony_ciMMIO sptes have a few spare bits, which are used to store a 4498c2ecf20Sopenharmony_cigeneration number. The global generation number is stored in 4508c2ecf20Sopenharmony_cikvm_memslots(kvm)->generation, and increased whenever guest memory info 4518c2ecf20Sopenharmony_cichanges. 4528c2ecf20Sopenharmony_ci 4538c2ecf20Sopenharmony_ciWhen KVM finds an MMIO spte, it checks the generation number of the spte. 4548c2ecf20Sopenharmony_ciIf the generation number of the spte does not equal the global generation 4558c2ecf20Sopenharmony_cinumber, it will ignore the cached MMIO information and handle the page 4568c2ecf20Sopenharmony_cifault through the slow path. 4578c2ecf20Sopenharmony_ci 4588c2ecf20Sopenharmony_ciSince only 18 bits are used to store generation-number on mmio spte, all 4598c2ecf20Sopenharmony_cipages are zapped when there is an overflow. 4608c2ecf20Sopenharmony_ci 4618c2ecf20Sopenharmony_ciUnfortunately, a single memory access might access kvm_memslots(kvm) multiple 4628c2ecf20Sopenharmony_citimes, the last one happening when the generation number is retrieved and 4638c2ecf20Sopenharmony_cistored into the MMIO spte. Thus, the MMIO spte might be created based on 4648c2ecf20Sopenharmony_ciout-of-date information, but with an up-to-date generation number. 4658c2ecf20Sopenharmony_ci 4668c2ecf20Sopenharmony_ciTo avoid this, the generation number is incremented again after synchronize_srcu 4678c2ecf20Sopenharmony_cireturns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a 4688c2ecf20Sopenharmony_cimemslot update, while some SRCU readers might be using the old copy. We do not 4698c2ecf20Sopenharmony_ciwant to use an MMIO sptes created with an odd generation number, and we can do 4708c2ecf20Sopenharmony_cithis without losing a bit in the MMIO spte. The "update in-progress" bit of the 4718c2ecf20Sopenharmony_cigeneration is not stored in MMIO spte, and is so is implicitly zero when the 4728c2ecf20Sopenharmony_cigeneration is extracted out of the spte. If KVM is unlucky and creates an MMIO 4738c2ecf20Sopenharmony_cispte while an update is in-progress, the next access to the spte will always be 4748c2ecf20Sopenharmony_cia cache miss. For example, a subsequent access during the update window will 4758c2ecf20Sopenharmony_cimiss due to the in-progress flag diverging, while an access after the update 4768c2ecf20Sopenharmony_ciwindow closes will have a higher generation number (as compared to the spte). 4778c2ecf20Sopenharmony_ci 4788c2ecf20Sopenharmony_ci 4798c2ecf20Sopenharmony_ciFurther reading 4808c2ecf20Sopenharmony_ci=============== 4818c2ecf20Sopenharmony_ci 4828c2ecf20Sopenharmony_ci- NPT presentation from KVM Forum 2008 4838c2ecf20Sopenharmony_ci https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf 484