18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci================= 48c2ecf20Sopenharmony_ciKVM Lock Overview 58c2ecf20Sopenharmony_ci================= 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ci1. Acquisition Orders 88c2ecf20Sopenharmony_ci--------------------- 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciThe acquisition orders for mutexes are as follows: 118c2ecf20Sopenharmony_ci 128c2ecf20Sopenharmony_ci- kvm->lock is taken outside vcpu->mutex 138c2ecf20Sopenharmony_ci 148c2ecf20Sopenharmony_ci- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ci- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring 178c2ecf20Sopenharmony_ci them together is quite rare. 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciOn x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciEverything else is a leaf: no other lock is taken inside the critical 228c2ecf20Sopenharmony_cisections. 238c2ecf20Sopenharmony_ci 248c2ecf20Sopenharmony_ci2. Exception 258c2ecf20Sopenharmony_ci------------ 268c2ecf20Sopenharmony_ci 278c2ecf20Sopenharmony_ciFast page fault: 288c2ecf20Sopenharmony_ci 298c2ecf20Sopenharmony_ciFast page fault is the fast path which fixes the guest page fault out of 308c2ecf20Sopenharmony_cithe mmu-lock on x86. Currently, the page fault can be fast in one of the 318c2ecf20Sopenharmony_cifollowing two cases: 328c2ecf20Sopenharmony_ci 338c2ecf20Sopenharmony_ci1. Access Tracking: The SPTE is not present, but it is marked for access 348c2ecf20Sopenharmony_ci tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to 358c2ecf20Sopenharmony_ci restore the saved R/X bits. This is described in more detail later below. 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ci2. Write-Protection: The SPTE is present and the fault is 388c2ecf20Sopenharmony_ci caused by write-protect. That means we just need to change the W bit of 398c2ecf20Sopenharmony_ci the spte. 408c2ecf20Sopenharmony_ci 418c2ecf20Sopenharmony_ciWhat we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and 428c2ecf20Sopenharmony_ciSPTE_MMU_WRITEABLE bit on the spte: 438c2ecf20Sopenharmony_ci 448c2ecf20Sopenharmony_ci- SPTE_HOST_WRITEABLE means the gfn is writable on host. 458c2ecf20Sopenharmony_ci- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when 468c2ecf20Sopenharmony_ci the gfn is writable on guest mmu and it is not write-protected by shadow 478c2ecf20Sopenharmony_ci page write-protection. 488c2ecf20Sopenharmony_ci 498c2ecf20Sopenharmony_ciOn fast page fault path, we will use cmpxchg to atomically set the spte W 508c2ecf20Sopenharmony_cibit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 518c2ecf20Sopenharmony_cirestore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This 528c2ecf20Sopenharmony_ciis safe because whenever changing these bits can be detected by cmpxchg. 538c2ecf20Sopenharmony_ci 548c2ecf20Sopenharmony_ciBut we need carefully check these cases: 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ci1) The mapping from gfn to pfn 578c2ecf20Sopenharmony_ci 588c2ecf20Sopenharmony_ciThe mapping from gfn to pfn may be changed since we can only ensure the pfn 598c2ecf20Sopenharmony_ciis not changed during cmpxchg. This is a ABA problem, for example, below case 608c2ecf20Sopenharmony_ciwill happen: 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+ 638c2ecf20Sopenharmony_ci| At the beginning:: | 648c2ecf20Sopenharmony_ci| | 658c2ecf20Sopenharmony_ci| gpte = gfn1 | 668c2ecf20Sopenharmony_ci| gfn1 is mapped to pfn1 on host | 678c2ecf20Sopenharmony_ci| spte is the shadow page table entry corresponding with gpte and | 688c2ecf20Sopenharmony_ci| spte = pfn1 | 698c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+ 708c2ecf20Sopenharmony_ci| On fast page fault path: | 718c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 728c2ecf20Sopenharmony_ci| CPU 0: | CPU 1: | 738c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 748c2ecf20Sopenharmony_ci| :: | | 758c2ecf20Sopenharmony_ci| | | 768c2ecf20Sopenharmony_ci| old_spte = *spte; | | 778c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 788c2ecf20Sopenharmony_ci| | pfn1 is swapped out:: | 798c2ecf20Sopenharmony_ci| | | 808c2ecf20Sopenharmony_ci| | spte = 0; | 818c2ecf20Sopenharmony_ci| | | 828c2ecf20Sopenharmony_ci| | pfn1 is re-alloced for gfn2. | 838c2ecf20Sopenharmony_ci| | | 848c2ecf20Sopenharmony_ci| | gpte is changed to point to | 858c2ecf20Sopenharmony_ci| | gfn2 by the guest:: | 868c2ecf20Sopenharmony_ci| | | 878c2ecf20Sopenharmony_ci| | spte = pfn1; | 888c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 898c2ecf20Sopenharmony_ci| :: | 908c2ecf20Sopenharmony_ci| | 918c2ecf20Sopenharmony_ci| if (cmpxchg(spte, old_spte, old_spte+W) | 928c2ecf20Sopenharmony_ci| mark_page_dirty(vcpu->kvm, gfn1) | 938c2ecf20Sopenharmony_ci| OOPS!!! | 948c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+ 958c2ecf20Sopenharmony_ci 968c2ecf20Sopenharmony_ciWe dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. 978c2ecf20Sopenharmony_ci 988c2ecf20Sopenharmony_ciFor direct sp, we can easily avoid it since the spte of direct sp is fixed 998c2ecf20Sopenharmony_cito gfn. For indirect sp, we disabled fast page fault for simplicity. 1008c2ecf20Sopenharmony_ci 1018c2ecf20Sopenharmony_ciA solution for indirect sp could be to pin the gfn, for example via 1028c2ecf20Sopenharmony_cikvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning: 1038c2ecf20Sopenharmony_ci 1048c2ecf20Sopenharmony_ci- We have held the refcount of pfn that means the pfn can not be freed and 1058c2ecf20Sopenharmony_ci be reused for another gfn. 1068c2ecf20Sopenharmony_ci- The pfn is writable and therefore it cannot be shared between different gfns 1078c2ecf20Sopenharmony_ci by KSM. 1088c2ecf20Sopenharmony_ci 1098c2ecf20Sopenharmony_ciThen, we can ensure the dirty bitmaps is correctly set for a gfn. 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ci2) Dirty bit tracking 1128c2ecf20Sopenharmony_ci 1138c2ecf20Sopenharmony_ciIn the origin code, the spte can be fast updated (non-atomically) if the 1148c2ecf20Sopenharmony_cispte is read-only and the Accessed bit has already been set since the 1158c2ecf20Sopenharmony_ciAccessed bit and Dirty bit can not be lost. 1168c2ecf20Sopenharmony_ci 1178c2ecf20Sopenharmony_ciBut it is not true after fast page fault since the spte can be marked 1188c2ecf20Sopenharmony_ciwritable between reading spte and updating spte. Like below case: 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+ 1218c2ecf20Sopenharmony_ci| At the beginning:: | 1228c2ecf20Sopenharmony_ci| | 1238c2ecf20Sopenharmony_ci| spte.W = 0 | 1248c2ecf20Sopenharmony_ci| spte.Accessed = 1 | 1258c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 1268c2ecf20Sopenharmony_ci| CPU 0: | CPU 1: | 1278c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 1288c2ecf20Sopenharmony_ci| In mmu_spte_clear_track_bits():: | | 1298c2ecf20Sopenharmony_ci| | | 1308c2ecf20Sopenharmony_ci| old_spte = *spte; | | 1318c2ecf20Sopenharmony_ci| | | 1328c2ecf20Sopenharmony_ci| | | 1338c2ecf20Sopenharmony_ci| /* 'if' condition is satisfied. */| | 1348c2ecf20Sopenharmony_ci| if (old_spte.Accessed == 1 && | | 1358c2ecf20Sopenharmony_ci| old_spte.W == 0) | | 1368c2ecf20Sopenharmony_ci| spte = 0ull; | | 1378c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 1388c2ecf20Sopenharmony_ci| | on fast page fault path:: | 1398c2ecf20Sopenharmony_ci| | | 1408c2ecf20Sopenharmony_ci| | spte.W = 1 | 1418c2ecf20Sopenharmony_ci| | | 1428c2ecf20Sopenharmony_ci| | memory write on the spte:: | 1438c2ecf20Sopenharmony_ci| | | 1448c2ecf20Sopenharmony_ci| | spte.Dirty = 1 | 1458c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 1468c2ecf20Sopenharmony_ci| :: | | 1478c2ecf20Sopenharmony_ci| | | 1488c2ecf20Sopenharmony_ci| else | | 1498c2ecf20Sopenharmony_ci| old_spte = xchg(spte, 0ull) | | 1508c2ecf20Sopenharmony_ci| if (old_spte.Accessed == 1) | | 1518c2ecf20Sopenharmony_ci| kvm_set_pfn_accessed(spte.pfn);| | 1528c2ecf20Sopenharmony_ci| if (old_spte.Dirty == 1) | | 1538c2ecf20Sopenharmony_ci| kvm_set_pfn_dirty(spte.pfn); | | 1548c2ecf20Sopenharmony_ci| OOPS!!! | | 1558c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+ 1568c2ecf20Sopenharmony_ci 1578c2ecf20Sopenharmony_ciThe Dirty bit is lost in this case. 1588c2ecf20Sopenharmony_ci 1598c2ecf20Sopenharmony_ciIn order to avoid this kind of issue, we always treat the spte as "volatile" 1608c2ecf20Sopenharmony_ciif it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, 1618c2ecf20Sopenharmony_cithe spte is always atomically updated in this case. 1628c2ecf20Sopenharmony_ci 1638c2ecf20Sopenharmony_ci3) flush tlbs due to spte updated 1648c2ecf20Sopenharmony_ci 1658c2ecf20Sopenharmony_ciIf the spte is updated from writable to readonly, we should flush all TLBs, 1668c2ecf20Sopenharmony_ciotherwise rmap_write_protect will find a read-only spte, even though the 1678c2ecf20Sopenharmony_ciwritable spte might be cached on a CPU's TLB. 1688c2ecf20Sopenharmony_ci 1698c2ecf20Sopenharmony_ciAs mentioned before, the spte can be updated to writable out of mmu-lock on 1708c2ecf20Sopenharmony_cifast page fault path, in order to easily audit the path, we see if TLBs need 1718c2ecf20Sopenharmony_cibe flushed caused by this reason in mmu_spte_update() since this is a common 1728c2ecf20Sopenharmony_cifunction to update spte (present -> present). 1738c2ecf20Sopenharmony_ci 1748c2ecf20Sopenharmony_ciSince the spte is "volatile" if it can be updated out of mmu-lock, we always 1758c2ecf20Sopenharmony_ciatomically update the spte, the race caused by fast page fault can be avoided, 1768c2ecf20Sopenharmony_ciSee the comments in spte_has_volatile_bits() and mmu_spte_update(). 1778c2ecf20Sopenharmony_ci 1788c2ecf20Sopenharmony_ciLockless Access Tracking: 1798c2ecf20Sopenharmony_ci 1808c2ecf20Sopenharmony_ciThis is used for Intel CPUs that are using EPT but do not support the EPT A/D 1818c2ecf20Sopenharmony_cibits. In this case, when the KVM MMU notifier is called to track accesses to a 1828c2ecf20Sopenharmony_cipage (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present 1838c2ecf20Sopenharmony_ciby clearing the RWX bits in the PTE and storing the original R & X bits in 1848c2ecf20Sopenharmony_cisome unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the 1858c2ecf20Sopenharmony_ciPTE (using the ignored bit 62). When the VM tries to access the page later on, 1868c2ecf20Sopenharmony_cia fault is generated and the fast page fault mechanism described above is used 1878c2ecf20Sopenharmony_cito atomically restore the PTE to a Present state. The W bit is not saved when 1888c2ecf20Sopenharmony_cithe PTE is marked for access tracking and during restoration to the Present 1898c2ecf20Sopenharmony_cistate, the W bit is set depending on whether or not it was a write access. If 1908c2ecf20Sopenharmony_ciit wasn't, then the W bit will remain clear until a write access happens, at 1918c2ecf20Sopenharmony_ciwhich time it will be set using the Dirty tracking mechanism described above. 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ci3. Reference 1948c2ecf20Sopenharmony_ci------------ 1958c2ecf20Sopenharmony_ci 1968c2ecf20Sopenharmony_ci:Name: kvm_lock 1978c2ecf20Sopenharmony_ci:Type: mutex 1988c2ecf20Sopenharmony_ci:Arch: any 1998c2ecf20Sopenharmony_ci:Protects: - vm_list 2008c2ecf20Sopenharmony_ci 2018c2ecf20Sopenharmony_ci:Name: kvm_count_lock 2028c2ecf20Sopenharmony_ci:Type: raw_spinlock_t 2038c2ecf20Sopenharmony_ci:Arch: any 2048c2ecf20Sopenharmony_ci:Protects: - hardware virtualization enable/disable 2058c2ecf20Sopenharmony_ci:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt 2068c2ecf20Sopenharmony_ci migration. 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ci:Name: kvm_arch::tsc_write_lock 2098c2ecf20Sopenharmony_ci:Type: raw_spinlock 2108c2ecf20Sopenharmony_ci:Arch: x86 2118c2ecf20Sopenharmony_ci:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 2128c2ecf20Sopenharmony_ci - tsc offset in vmcb 2138c2ecf20Sopenharmony_ci:Comment: 'raw' because updating the tsc offsets must not be preempted. 2148c2ecf20Sopenharmony_ci 2158c2ecf20Sopenharmony_ci:Name: kvm->mmu_lock 2168c2ecf20Sopenharmony_ci:Type: spinlock_t 2178c2ecf20Sopenharmony_ci:Arch: any 2188c2ecf20Sopenharmony_ci:Protects: -shadow page/shadow tlb entry 2198c2ecf20Sopenharmony_ci:Comment: it is a spinlock since it is used in mmu notifier. 2208c2ecf20Sopenharmony_ci 2218c2ecf20Sopenharmony_ci:Name: kvm->srcu 2228c2ecf20Sopenharmony_ci:Type: srcu lock 2238c2ecf20Sopenharmony_ci:Arch: any 2248c2ecf20Sopenharmony_ci:Protects: - kvm->memslots 2258c2ecf20Sopenharmony_ci - kvm->buses 2268c2ecf20Sopenharmony_ci:Comment: The srcu read lock must be held while accessing memslots (e.g. 2278c2ecf20Sopenharmony_ci when using gfn_to_* functions) and while accessing in-kernel 2288c2ecf20Sopenharmony_ci MMIO/PIO address->device structure mapping (kvm->buses). 2298c2ecf20Sopenharmony_ci The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 2308c2ecf20Sopenharmony_ci if it is needed by multiple functions. 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ci:Name: blocked_vcpu_on_cpu_lock 2338c2ecf20Sopenharmony_ci:Type: spinlock_t 2348c2ecf20Sopenharmony_ci:Arch: x86 2358c2ecf20Sopenharmony_ci:Protects: blocked_vcpu_on_cpu 2368c2ecf20Sopenharmony_ci:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 2378c2ecf20Sopenharmony_ci When VT-d posted-interrupts is supported and the VM has assigned 2388c2ecf20Sopenharmony_ci devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 2398c2ecf20Sopenharmony_ci protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues 2408c2ecf20Sopenharmony_ci wakeup notification event since external interrupts from the 2418c2ecf20Sopenharmony_ci assigned devices happens, we will find the vCPU on the list to 2428c2ecf20Sopenharmony_ci wakeup. 243