18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=================
48c2ecf20Sopenharmony_ciKVM Lock Overview
58c2ecf20Sopenharmony_ci=================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ci1. Acquisition Orders
88c2ecf20Sopenharmony_ci---------------------
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciThe acquisition orders for mutexes are as follows:
118c2ecf20Sopenharmony_ci
128c2ecf20Sopenharmony_ci- kvm->lock is taken outside vcpu->mutex
138c2ecf20Sopenharmony_ci
148c2ecf20Sopenharmony_ci- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ci- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
178c2ecf20Sopenharmony_ci  them together is quite rare.
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ciOn x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciEverything else is a leaf: no other lock is taken inside the critical
228c2ecf20Sopenharmony_cisections.
238c2ecf20Sopenharmony_ci
248c2ecf20Sopenharmony_ci2. Exception
258c2ecf20Sopenharmony_ci------------
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ciFast page fault:
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ciFast page fault is the fast path which fixes the guest page fault out of
308c2ecf20Sopenharmony_cithe mmu-lock on x86. Currently, the page fault can be fast in one of the
318c2ecf20Sopenharmony_cifollowing two cases:
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ci1. Access Tracking: The SPTE is not present, but it is marked for access
348c2ecf20Sopenharmony_ci   tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
358c2ecf20Sopenharmony_ci   restore the saved R/X bits. This is described in more detail later below.
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ci2. Write-Protection: The SPTE is present and the fault is
388c2ecf20Sopenharmony_ci   caused by write-protect. That means we just need to change the W bit of
398c2ecf20Sopenharmony_ci   the spte.
408c2ecf20Sopenharmony_ci
418c2ecf20Sopenharmony_ciWhat we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
428c2ecf20Sopenharmony_ciSPTE_MMU_WRITEABLE bit on the spte:
438c2ecf20Sopenharmony_ci
448c2ecf20Sopenharmony_ci- SPTE_HOST_WRITEABLE means the gfn is writable on host.
458c2ecf20Sopenharmony_ci- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
468c2ecf20Sopenharmony_ci  the gfn is writable on guest mmu and it is not write-protected by shadow
478c2ecf20Sopenharmony_ci  page write-protection.
488c2ecf20Sopenharmony_ci
498c2ecf20Sopenharmony_ciOn fast page fault path, we will use cmpxchg to atomically set the spte W
508c2ecf20Sopenharmony_cibit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
518c2ecf20Sopenharmony_cirestore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
528c2ecf20Sopenharmony_ciis safe because whenever changing these bits can be detected by cmpxchg.
538c2ecf20Sopenharmony_ci
548c2ecf20Sopenharmony_ciBut we need carefully check these cases:
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ci1) The mapping from gfn to pfn
578c2ecf20Sopenharmony_ci
588c2ecf20Sopenharmony_ciThe mapping from gfn to pfn may be changed since we can only ensure the pfn
598c2ecf20Sopenharmony_ciis not changed during cmpxchg. This is a ABA problem, for example, below case
608c2ecf20Sopenharmony_ciwill happen:
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+
638c2ecf20Sopenharmony_ci| At the beginning::                                                     |
648c2ecf20Sopenharmony_ci|                                                                        |
658c2ecf20Sopenharmony_ci|	gpte = gfn1                                                      |
668c2ecf20Sopenharmony_ci|	gfn1 is mapped to pfn1 on host                                   |
678c2ecf20Sopenharmony_ci|	spte is the shadow page table entry corresponding with gpte and  |
688c2ecf20Sopenharmony_ci|	spte = pfn1                                                      |
698c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+
708c2ecf20Sopenharmony_ci| On fast page fault path:                                               |
718c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
728c2ecf20Sopenharmony_ci| CPU 0:                             | CPU 1:                            |
738c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
748c2ecf20Sopenharmony_ci| ::                                 |                                   |
758c2ecf20Sopenharmony_ci|                                    |                                   |
768c2ecf20Sopenharmony_ci|   old_spte = *spte;                |                                   |
778c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
788c2ecf20Sopenharmony_ci|                                    | pfn1 is swapped out::             |
798c2ecf20Sopenharmony_ci|                                    |                                   |
808c2ecf20Sopenharmony_ci|                                    |    spte = 0;                      |
818c2ecf20Sopenharmony_ci|                                    |                                   |
828c2ecf20Sopenharmony_ci|                                    | pfn1 is re-alloced for gfn2.      |
838c2ecf20Sopenharmony_ci|                                    |                                   |
848c2ecf20Sopenharmony_ci|                                    | gpte is changed to point to       |
858c2ecf20Sopenharmony_ci|                                    | gfn2 by the guest::               |
868c2ecf20Sopenharmony_ci|                                    |                                   |
878c2ecf20Sopenharmony_ci|                                    |    spte = pfn1;                   |
888c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
898c2ecf20Sopenharmony_ci| ::                                                                     |
908c2ecf20Sopenharmony_ci|                                                                        |
918c2ecf20Sopenharmony_ci|   if (cmpxchg(spte, old_spte, old_spte+W)                              |
928c2ecf20Sopenharmony_ci|	mark_page_dirty(vcpu->kvm, gfn1)                                 |
938c2ecf20Sopenharmony_ci|            OOPS!!!                                                     |
948c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+
958c2ecf20Sopenharmony_ci
968c2ecf20Sopenharmony_ciWe dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
978c2ecf20Sopenharmony_ci
988c2ecf20Sopenharmony_ciFor direct sp, we can easily avoid it since the spte of direct sp is fixed
998c2ecf20Sopenharmony_cito gfn.  For indirect sp, we disabled fast page fault for simplicity.
1008c2ecf20Sopenharmony_ci
1018c2ecf20Sopenharmony_ciA solution for indirect sp could be to pin the gfn, for example via
1028c2ecf20Sopenharmony_cikvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg.  After the pinning:
1038c2ecf20Sopenharmony_ci
1048c2ecf20Sopenharmony_ci- We have held the refcount of pfn that means the pfn can not be freed and
1058c2ecf20Sopenharmony_ci  be reused for another gfn.
1068c2ecf20Sopenharmony_ci- The pfn is writable and therefore it cannot be shared between different gfns
1078c2ecf20Sopenharmony_ci  by KSM.
1088c2ecf20Sopenharmony_ci
1098c2ecf20Sopenharmony_ciThen, we can ensure the dirty bitmaps is correctly set for a gfn.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ci2) Dirty bit tracking
1128c2ecf20Sopenharmony_ci
1138c2ecf20Sopenharmony_ciIn the origin code, the spte can be fast updated (non-atomically) if the
1148c2ecf20Sopenharmony_cispte is read-only and the Accessed bit has already been set since the
1158c2ecf20Sopenharmony_ciAccessed bit and Dirty bit can not be lost.
1168c2ecf20Sopenharmony_ci
1178c2ecf20Sopenharmony_ciBut it is not true after fast page fault since the spte can be marked
1188c2ecf20Sopenharmony_ciwritable between reading spte and updating spte. Like below case:
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ci+------------------------------------------------------------------------+
1218c2ecf20Sopenharmony_ci| At the beginning::                                                     |
1228c2ecf20Sopenharmony_ci|                                                                        |
1238c2ecf20Sopenharmony_ci|	spte.W = 0                                                       |
1248c2ecf20Sopenharmony_ci|	spte.Accessed = 1                                                |
1258c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
1268c2ecf20Sopenharmony_ci| CPU 0:                             | CPU 1:                            |
1278c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
1288c2ecf20Sopenharmony_ci| In mmu_spte_clear_track_bits()::   |                                   |
1298c2ecf20Sopenharmony_ci|                                    |                                   |
1308c2ecf20Sopenharmony_ci|  old_spte = *spte;                 |                                   |
1318c2ecf20Sopenharmony_ci|                                    |                                   |
1328c2ecf20Sopenharmony_ci|                                    |                                   |
1338c2ecf20Sopenharmony_ci|  /* 'if' condition is satisfied. */|                                   |
1348c2ecf20Sopenharmony_ci|  if (old_spte.Accessed == 1 &&     |                                   |
1358c2ecf20Sopenharmony_ci|       old_spte.W == 0)             |                                   |
1368c2ecf20Sopenharmony_ci|     spte = 0ull;                   |                                   |
1378c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
1388c2ecf20Sopenharmony_ci|                                    | on fast page fault path::         |
1398c2ecf20Sopenharmony_ci|                                    |                                   |
1408c2ecf20Sopenharmony_ci|                                    |    spte.W = 1                     |
1418c2ecf20Sopenharmony_ci|                                    |                                   |
1428c2ecf20Sopenharmony_ci|                                    | memory write on the spte::        |
1438c2ecf20Sopenharmony_ci|                                    |                                   |
1448c2ecf20Sopenharmony_ci|                                    |    spte.Dirty = 1                 |
1458c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
1468c2ecf20Sopenharmony_ci|  ::                                |                                   |
1478c2ecf20Sopenharmony_ci|                                    |                                   |
1488c2ecf20Sopenharmony_ci|   else                             |                                   |
1498c2ecf20Sopenharmony_ci|     old_spte = xchg(spte, 0ull)    |                                   |
1508c2ecf20Sopenharmony_ci|   if (old_spte.Accessed == 1)      |                                   |
1518c2ecf20Sopenharmony_ci|     kvm_set_pfn_accessed(spte.pfn);|                                   |
1528c2ecf20Sopenharmony_ci|   if (old_spte.Dirty == 1)         |                                   |
1538c2ecf20Sopenharmony_ci|     kvm_set_pfn_dirty(spte.pfn);   |                                   |
1548c2ecf20Sopenharmony_ci|     OOPS!!!                        |                                   |
1558c2ecf20Sopenharmony_ci+------------------------------------+-----------------------------------+
1568c2ecf20Sopenharmony_ci
1578c2ecf20Sopenharmony_ciThe Dirty bit is lost in this case.
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ciIn order to avoid this kind of issue, we always treat the spte as "volatile"
1608c2ecf20Sopenharmony_ciif it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
1618c2ecf20Sopenharmony_cithe spte is always atomically updated in this case.
1628c2ecf20Sopenharmony_ci
1638c2ecf20Sopenharmony_ci3) flush tlbs due to spte updated
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciIf the spte is updated from writable to readonly, we should flush all TLBs,
1668c2ecf20Sopenharmony_ciotherwise rmap_write_protect will find a read-only spte, even though the
1678c2ecf20Sopenharmony_ciwritable spte might be cached on a CPU's TLB.
1688c2ecf20Sopenharmony_ci
1698c2ecf20Sopenharmony_ciAs mentioned before, the spte can be updated to writable out of mmu-lock on
1708c2ecf20Sopenharmony_cifast page fault path, in order to easily audit the path, we see if TLBs need
1718c2ecf20Sopenharmony_cibe flushed caused by this reason in mmu_spte_update() since this is a common
1728c2ecf20Sopenharmony_cifunction to update spte (present -> present).
1738c2ecf20Sopenharmony_ci
1748c2ecf20Sopenharmony_ciSince the spte is "volatile" if it can be updated out of mmu-lock, we always
1758c2ecf20Sopenharmony_ciatomically update the spte, the race caused by fast page fault can be avoided,
1768c2ecf20Sopenharmony_ciSee the comments in spte_has_volatile_bits() and mmu_spte_update().
1778c2ecf20Sopenharmony_ci
1788c2ecf20Sopenharmony_ciLockless Access Tracking:
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ciThis is used for Intel CPUs that are using EPT but do not support the EPT A/D
1818c2ecf20Sopenharmony_cibits. In this case, when the KVM MMU notifier is called to track accesses to a
1828c2ecf20Sopenharmony_cipage (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
1838c2ecf20Sopenharmony_ciby clearing the RWX bits in the PTE and storing the original R & X bits in
1848c2ecf20Sopenharmony_cisome unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
1858c2ecf20Sopenharmony_ciPTE (using the ignored bit 62). When the VM tries to access the page later on,
1868c2ecf20Sopenharmony_cia fault is generated and the fast page fault mechanism described above is used
1878c2ecf20Sopenharmony_cito atomically restore the PTE to a Present state. The W bit is not saved when
1888c2ecf20Sopenharmony_cithe PTE is marked for access tracking and during restoration to the Present
1898c2ecf20Sopenharmony_cistate, the W bit is set depending on whether or not it was a write access. If
1908c2ecf20Sopenharmony_ciit wasn't, then the W bit will remain clear until a write access happens, at
1918c2ecf20Sopenharmony_ciwhich time it will be set using the Dirty tracking mechanism described above.
1928c2ecf20Sopenharmony_ci
1938c2ecf20Sopenharmony_ci3. Reference
1948c2ecf20Sopenharmony_ci------------
1958c2ecf20Sopenharmony_ci
1968c2ecf20Sopenharmony_ci:Name:		kvm_lock
1978c2ecf20Sopenharmony_ci:Type:		mutex
1988c2ecf20Sopenharmony_ci:Arch:		any
1998c2ecf20Sopenharmony_ci:Protects:	- vm_list
2008c2ecf20Sopenharmony_ci
2018c2ecf20Sopenharmony_ci:Name:		kvm_count_lock
2028c2ecf20Sopenharmony_ci:Type:		raw_spinlock_t
2038c2ecf20Sopenharmony_ci:Arch:		any
2048c2ecf20Sopenharmony_ci:Protects:	- hardware virtualization enable/disable
2058c2ecf20Sopenharmony_ci:Comment:	'raw' because hardware enabling/disabling must be atomic /wrt
2068c2ecf20Sopenharmony_ci		migration.
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ci:Name:		kvm_arch::tsc_write_lock
2098c2ecf20Sopenharmony_ci:Type:		raw_spinlock
2108c2ecf20Sopenharmony_ci:Arch:		x86
2118c2ecf20Sopenharmony_ci:Protects:	- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
2128c2ecf20Sopenharmony_ci		- tsc offset in vmcb
2138c2ecf20Sopenharmony_ci:Comment:	'raw' because updating the tsc offsets must not be preempted.
2148c2ecf20Sopenharmony_ci
2158c2ecf20Sopenharmony_ci:Name:		kvm->mmu_lock
2168c2ecf20Sopenharmony_ci:Type:		spinlock_t
2178c2ecf20Sopenharmony_ci:Arch:		any
2188c2ecf20Sopenharmony_ci:Protects:	-shadow page/shadow tlb entry
2198c2ecf20Sopenharmony_ci:Comment:	it is a spinlock since it is used in mmu notifier.
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ci:Name:		kvm->srcu
2228c2ecf20Sopenharmony_ci:Type:		srcu lock
2238c2ecf20Sopenharmony_ci:Arch:		any
2248c2ecf20Sopenharmony_ci:Protects:	- kvm->memslots
2258c2ecf20Sopenharmony_ci		- kvm->buses
2268c2ecf20Sopenharmony_ci:Comment:	The srcu read lock must be held while accessing memslots (e.g.
2278c2ecf20Sopenharmony_ci		when using gfn_to_* functions) and while accessing in-kernel
2288c2ecf20Sopenharmony_ci		MMIO/PIO address->device structure mapping (kvm->buses).
2298c2ecf20Sopenharmony_ci		The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
2308c2ecf20Sopenharmony_ci		if it is needed by multiple functions.
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ci:Name:		blocked_vcpu_on_cpu_lock
2338c2ecf20Sopenharmony_ci:Type:		spinlock_t
2348c2ecf20Sopenharmony_ci:Arch:		x86
2358c2ecf20Sopenharmony_ci:Protects:	blocked_vcpu_on_cpu
2368c2ecf20Sopenharmony_ci:Comment:	This is a per-CPU lock and it is used for VT-d posted-interrupts.
2378c2ecf20Sopenharmony_ci		When VT-d posted-interrupts is supported and the VM has assigned
2388c2ecf20Sopenharmony_ci		devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
2398c2ecf20Sopenharmony_ci		protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
2408c2ecf20Sopenharmony_ci		wakeup notification event since external interrupts from the
2418c2ecf20Sopenharmony_ci		assigned devices happens, we will find the vCPU on the list to
2428c2ecf20Sopenharmony_ci		wakeup.
243