18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci========== 48c2ecf20Sopenharmony_ciNested VMX 58c2ecf20Sopenharmony_ci========== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciOverview 88c2ecf20Sopenharmony_ci--------- 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciOn Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) 118c2ecf20Sopenharmony_cito easily and efficiently run guest operating systems. Normally, these guests 128c2ecf20Sopenharmony_ci*cannot* themselves be hypervisors running their own guests, because in VMX, 138c2ecf20Sopenharmony_ciguests cannot use VMX instructions. 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_ciThe "Nested VMX" feature adds this missing capability - of running guest 168c2ecf20Sopenharmony_cihypervisors (which use VMX) with their own nested guests. It does so by 178c2ecf20Sopenharmony_ciallowing a guest to use VMX instructions, and correctly and efficiently 188c2ecf20Sopenharmony_ciemulating them using the single level of VMX available in the hardware. 198c2ecf20Sopenharmony_ci 208c2ecf20Sopenharmony_ciWe describe in much greater detail the theory behind the nested VMX feature, 218c2ecf20Sopenharmony_ciits implementation and its performance characteristics, in the OSDI 2010 paper 228c2ecf20Sopenharmony_ci"The Turtles Project: Design and Implementation of Nested Virtualization", 238c2ecf20Sopenharmony_ciavailable at: 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ci https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf 268c2ecf20Sopenharmony_ci 278c2ecf20Sopenharmony_ci 288c2ecf20Sopenharmony_ciTerminology 298c2ecf20Sopenharmony_ci----------- 308c2ecf20Sopenharmony_ci 318c2ecf20Sopenharmony_ciSingle-level virtualization has two levels - the host (KVM) and the guests. 328c2ecf20Sopenharmony_ciIn nested virtualization, we have three levels: The host (KVM), which we call 338c2ecf20Sopenharmony_ciL0, the guest hypervisor, which we call L1, and its nested guest, which we 348c2ecf20Sopenharmony_cicall L2. 358c2ecf20Sopenharmony_ci 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ciRunning nested VMX 388c2ecf20Sopenharmony_ci------------------ 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ciThe nested VMX feature is disabled by default. It can be enabled by giving 418c2ecf20Sopenharmony_cithe "nested=1" option to the kvm-intel module. 428c2ecf20Sopenharmony_ci 438c2ecf20Sopenharmony_ciNo modifications are required to user space (qemu). However, qemu's default 448c2ecf20Sopenharmony_ciemulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be 458c2ecf20Sopenharmony_ciexplicitly enabled, by giving qemu one of the following options: 468c2ecf20Sopenharmony_ci 478c2ecf20Sopenharmony_ci - cpu host (emulated CPU has all features of the real CPU) 488c2ecf20Sopenharmony_ci 498c2ecf20Sopenharmony_ci - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) 508c2ecf20Sopenharmony_ci 518c2ecf20Sopenharmony_ci 528c2ecf20Sopenharmony_ciABIs 538c2ecf20Sopenharmony_ci---- 548c2ecf20Sopenharmony_ci 558c2ecf20Sopenharmony_ciNested VMX aims to present a standard and (eventually) fully-functional VMX 568c2ecf20Sopenharmony_ciimplementation for the a guest hypervisor to use. As such, the official 578c2ecf20Sopenharmony_cispecification of the ABI that it provides is Intel's VMX specification, 588c2ecf20Sopenharmony_cinamely volume 3B of their "Intel 64 and IA-32 Architectures Software 598c2ecf20Sopenharmony_ciDeveloper's Manual". Not all of VMX's features are currently fully supported, 608c2ecf20Sopenharmony_cibut the goal is to eventually support them all, starting with the VMX features 618c2ecf20Sopenharmony_ciwhich are used in practice by popular hypervisors (KVM and others). 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ciAs a VMX implementation, nested VMX presents a VMCS structure to L1. 648c2ecf20Sopenharmony_ciAs mandated by the spec, other than the two fields revision_id and abort, 658c2ecf20Sopenharmony_cithis structure is *opaque* to its user, who is not supposed to know or care 668c2ecf20Sopenharmony_ciabout its internal structure. Rather, the structure is accessed through the 678c2ecf20Sopenharmony_ciVMREAD and VMWRITE instructions. 688c2ecf20Sopenharmony_ciStill, for debugging purposes, KVM developers might be interested to know the 698c2ecf20Sopenharmony_ciinternals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. 708c2ecf20Sopenharmony_ci 718c2ecf20Sopenharmony_ciThe name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we 728c2ecf20Sopenharmony_cialso have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS 738c2ecf20Sopenharmony_ciwhich L0 builds to actually run L2 - how this is done is explained in the 748c2ecf20Sopenharmony_ciaforementioned paper. 758c2ecf20Sopenharmony_ci 768c2ecf20Sopenharmony_ciFor convenience, we repeat the content of struct vmcs12 here. If the internals 778c2ecf20Sopenharmony_ciof this structure changes, this can break live migration across KVM versions. 788c2ecf20Sopenharmony_ciVMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner 798c2ecf20Sopenharmony_cistruct shadow_vmcs is ever changed. 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ci:: 828c2ecf20Sopenharmony_ci 838c2ecf20Sopenharmony_ci typedef u64 natural_width; 848c2ecf20Sopenharmony_ci struct __packed vmcs12 { 858c2ecf20Sopenharmony_ci /* According to the Intel spec, a VMCS region must start with 868c2ecf20Sopenharmony_ci * these two user-visible fields */ 878c2ecf20Sopenharmony_ci u32 revision_id; 888c2ecf20Sopenharmony_ci u32 abort; 898c2ecf20Sopenharmony_ci 908c2ecf20Sopenharmony_ci u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ 918c2ecf20Sopenharmony_ci u32 padding[7]; /* room for future expansion */ 928c2ecf20Sopenharmony_ci 938c2ecf20Sopenharmony_ci u64 io_bitmap_a; 948c2ecf20Sopenharmony_ci u64 io_bitmap_b; 958c2ecf20Sopenharmony_ci u64 msr_bitmap; 968c2ecf20Sopenharmony_ci u64 vm_exit_msr_store_addr; 978c2ecf20Sopenharmony_ci u64 vm_exit_msr_load_addr; 988c2ecf20Sopenharmony_ci u64 vm_entry_msr_load_addr; 998c2ecf20Sopenharmony_ci u64 tsc_offset; 1008c2ecf20Sopenharmony_ci u64 virtual_apic_page_addr; 1018c2ecf20Sopenharmony_ci u64 apic_access_addr; 1028c2ecf20Sopenharmony_ci u64 ept_pointer; 1038c2ecf20Sopenharmony_ci u64 guest_physical_address; 1048c2ecf20Sopenharmony_ci u64 vmcs_link_pointer; 1058c2ecf20Sopenharmony_ci u64 guest_ia32_debugctl; 1068c2ecf20Sopenharmony_ci u64 guest_ia32_pat; 1078c2ecf20Sopenharmony_ci u64 guest_ia32_efer; 1088c2ecf20Sopenharmony_ci u64 guest_pdptr0; 1098c2ecf20Sopenharmony_ci u64 guest_pdptr1; 1108c2ecf20Sopenharmony_ci u64 guest_pdptr2; 1118c2ecf20Sopenharmony_ci u64 guest_pdptr3; 1128c2ecf20Sopenharmony_ci u64 host_ia32_pat; 1138c2ecf20Sopenharmony_ci u64 host_ia32_efer; 1148c2ecf20Sopenharmony_ci u64 padding64[8]; /* room for future expansion */ 1158c2ecf20Sopenharmony_ci natural_width cr0_guest_host_mask; 1168c2ecf20Sopenharmony_ci natural_width cr4_guest_host_mask; 1178c2ecf20Sopenharmony_ci natural_width cr0_read_shadow; 1188c2ecf20Sopenharmony_ci natural_width cr4_read_shadow; 1198c2ecf20Sopenharmony_ci natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ 1208c2ecf20Sopenharmony_ci natural_width exit_qualification; 1218c2ecf20Sopenharmony_ci natural_width guest_linear_address; 1228c2ecf20Sopenharmony_ci natural_width guest_cr0; 1238c2ecf20Sopenharmony_ci natural_width guest_cr3; 1248c2ecf20Sopenharmony_ci natural_width guest_cr4; 1258c2ecf20Sopenharmony_ci natural_width guest_es_base; 1268c2ecf20Sopenharmony_ci natural_width guest_cs_base; 1278c2ecf20Sopenharmony_ci natural_width guest_ss_base; 1288c2ecf20Sopenharmony_ci natural_width guest_ds_base; 1298c2ecf20Sopenharmony_ci natural_width guest_fs_base; 1308c2ecf20Sopenharmony_ci natural_width guest_gs_base; 1318c2ecf20Sopenharmony_ci natural_width guest_ldtr_base; 1328c2ecf20Sopenharmony_ci natural_width guest_tr_base; 1338c2ecf20Sopenharmony_ci natural_width guest_gdtr_base; 1348c2ecf20Sopenharmony_ci natural_width guest_idtr_base; 1358c2ecf20Sopenharmony_ci natural_width guest_dr7; 1368c2ecf20Sopenharmony_ci natural_width guest_rsp; 1378c2ecf20Sopenharmony_ci natural_width guest_rip; 1388c2ecf20Sopenharmony_ci natural_width guest_rflags; 1398c2ecf20Sopenharmony_ci natural_width guest_pending_dbg_exceptions; 1408c2ecf20Sopenharmony_ci natural_width guest_sysenter_esp; 1418c2ecf20Sopenharmony_ci natural_width guest_sysenter_eip; 1428c2ecf20Sopenharmony_ci natural_width host_cr0; 1438c2ecf20Sopenharmony_ci natural_width host_cr3; 1448c2ecf20Sopenharmony_ci natural_width host_cr4; 1458c2ecf20Sopenharmony_ci natural_width host_fs_base; 1468c2ecf20Sopenharmony_ci natural_width host_gs_base; 1478c2ecf20Sopenharmony_ci natural_width host_tr_base; 1488c2ecf20Sopenharmony_ci natural_width host_gdtr_base; 1498c2ecf20Sopenharmony_ci natural_width host_idtr_base; 1508c2ecf20Sopenharmony_ci natural_width host_ia32_sysenter_esp; 1518c2ecf20Sopenharmony_ci natural_width host_ia32_sysenter_eip; 1528c2ecf20Sopenharmony_ci natural_width host_rsp; 1538c2ecf20Sopenharmony_ci natural_width host_rip; 1548c2ecf20Sopenharmony_ci natural_width paddingl[8]; /* room for future expansion */ 1558c2ecf20Sopenharmony_ci u32 pin_based_vm_exec_control; 1568c2ecf20Sopenharmony_ci u32 cpu_based_vm_exec_control; 1578c2ecf20Sopenharmony_ci u32 exception_bitmap; 1588c2ecf20Sopenharmony_ci u32 page_fault_error_code_mask; 1598c2ecf20Sopenharmony_ci u32 page_fault_error_code_match; 1608c2ecf20Sopenharmony_ci u32 cr3_target_count; 1618c2ecf20Sopenharmony_ci u32 vm_exit_controls; 1628c2ecf20Sopenharmony_ci u32 vm_exit_msr_store_count; 1638c2ecf20Sopenharmony_ci u32 vm_exit_msr_load_count; 1648c2ecf20Sopenharmony_ci u32 vm_entry_controls; 1658c2ecf20Sopenharmony_ci u32 vm_entry_msr_load_count; 1668c2ecf20Sopenharmony_ci u32 vm_entry_intr_info_field; 1678c2ecf20Sopenharmony_ci u32 vm_entry_exception_error_code; 1688c2ecf20Sopenharmony_ci u32 vm_entry_instruction_len; 1698c2ecf20Sopenharmony_ci u32 tpr_threshold; 1708c2ecf20Sopenharmony_ci u32 secondary_vm_exec_control; 1718c2ecf20Sopenharmony_ci u32 vm_instruction_error; 1728c2ecf20Sopenharmony_ci u32 vm_exit_reason; 1738c2ecf20Sopenharmony_ci u32 vm_exit_intr_info; 1748c2ecf20Sopenharmony_ci u32 vm_exit_intr_error_code; 1758c2ecf20Sopenharmony_ci u32 idt_vectoring_info_field; 1768c2ecf20Sopenharmony_ci u32 idt_vectoring_error_code; 1778c2ecf20Sopenharmony_ci u32 vm_exit_instruction_len; 1788c2ecf20Sopenharmony_ci u32 vmx_instruction_info; 1798c2ecf20Sopenharmony_ci u32 guest_es_limit; 1808c2ecf20Sopenharmony_ci u32 guest_cs_limit; 1818c2ecf20Sopenharmony_ci u32 guest_ss_limit; 1828c2ecf20Sopenharmony_ci u32 guest_ds_limit; 1838c2ecf20Sopenharmony_ci u32 guest_fs_limit; 1848c2ecf20Sopenharmony_ci u32 guest_gs_limit; 1858c2ecf20Sopenharmony_ci u32 guest_ldtr_limit; 1868c2ecf20Sopenharmony_ci u32 guest_tr_limit; 1878c2ecf20Sopenharmony_ci u32 guest_gdtr_limit; 1888c2ecf20Sopenharmony_ci u32 guest_idtr_limit; 1898c2ecf20Sopenharmony_ci u32 guest_es_ar_bytes; 1908c2ecf20Sopenharmony_ci u32 guest_cs_ar_bytes; 1918c2ecf20Sopenharmony_ci u32 guest_ss_ar_bytes; 1928c2ecf20Sopenharmony_ci u32 guest_ds_ar_bytes; 1938c2ecf20Sopenharmony_ci u32 guest_fs_ar_bytes; 1948c2ecf20Sopenharmony_ci u32 guest_gs_ar_bytes; 1958c2ecf20Sopenharmony_ci u32 guest_ldtr_ar_bytes; 1968c2ecf20Sopenharmony_ci u32 guest_tr_ar_bytes; 1978c2ecf20Sopenharmony_ci u32 guest_interruptibility_info; 1988c2ecf20Sopenharmony_ci u32 guest_activity_state; 1998c2ecf20Sopenharmony_ci u32 guest_sysenter_cs; 2008c2ecf20Sopenharmony_ci u32 host_ia32_sysenter_cs; 2018c2ecf20Sopenharmony_ci u32 padding32[8]; /* room for future expansion */ 2028c2ecf20Sopenharmony_ci u16 virtual_processor_id; 2038c2ecf20Sopenharmony_ci u16 guest_es_selector; 2048c2ecf20Sopenharmony_ci u16 guest_cs_selector; 2058c2ecf20Sopenharmony_ci u16 guest_ss_selector; 2068c2ecf20Sopenharmony_ci u16 guest_ds_selector; 2078c2ecf20Sopenharmony_ci u16 guest_fs_selector; 2088c2ecf20Sopenharmony_ci u16 guest_gs_selector; 2098c2ecf20Sopenharmony_ci u16 guest_ldtr_selector; 2108c2ecf20Sopenharmony_ci u16 guest_tr_selector; 2118c2ecf20Sopenharmony_ci u16 host_es_selector; 2128c2ecf20Sopenharmony_ci u16 host_cs_selector; 2138c2ecf20Sopenharmony_ci u16 host_ss_selector; 2148c2ecf20Sopenharmony_ci u16 host_ds_selector; 2158c2ecf20Sopenharmony_ci u16 host_fs_selector; 2168c2ecf20Sopenharmony_ci u16 host_gs_selector; 2178c2ecf20Sopenharmony_ci u16 host_tr_selector; 2188c2ecf20Sopenharmony_ci }; 2198c2ecf20Sopenharmony_ci 2208c2ecf20Sopenharmony_ci 2218c2ecf20Sopenharmony_ciAuthors 2228c2ecf20Sopenharmony_ci------- 2238c2ecf20Sopenharmony_ci 2248c2ecf20Sopenharmony_ciThese patches were written by: 2258c2ecf20Sopenharmony_ci - Abel Gordon, abelg <at> il.ibm.com 2268c2ecf20Sopenharmony_ci - Nadav Har'El, nyh <at> il.ibm.com 2278c2ecf20Sopenharmony_ci - Orit Wasserman, oritw <at> il.ibm.com 2288c2ecf20Sopenharmony_ci - Ben-Ami Yassor, benami <at> il.ibm.com 2298c2ecf20Sopenharmony_ci - Muli Ben-Yehuda, muli <at> il.ibm.com 2308c2ecf20Sopenharmony_ci 2318c2ecf20Sopenharmony_ciWith contributions by: 2328c2ecf20Sopenharmony_ci - Anthony Liguori, aliguori <at> us.ibm.com 2338c2ecf20Sopenharmony_ci - Mike Day, mdday <at> us.ibm.com 2348c2ecf20Sopenharmony_ci - Michael Factor, factor <at> il.ibm.com 2358c2ecf20Sopenharmony_ci - Zvi Dubitzky, dubi <at> il.ibm.com 2368c2ecf20Sopenharmony_ci 2378c2ecf20Sopenharmony_ciAnd valuable reviews by: 2388c2ecf20Sopenharmony_ci - Avi Kivity, avi <at> redhat.com 2398c2ecf20Sopenharmony_ci - Gleb Natapov, gleb <at> redhat.com 2408c2ecf20Sopenharmony_ci - Marcelo Tosatti, mtosatti <at> redhat.com 2418c2ecf20Sopenharmony_ci - Kevin Tian, kevin.tian <at> intel.com 2428c2ecf20Sopenharmony_ci - and others. 243