18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci==========
48c2ecf20Sopenharmony_ciNested VMX
58c2ecf20Sopenharmony_ci==========
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciOverview
88c2ecf20Sopenharmony_ci---------
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciOn Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
118c2ecf20Sopenharmony_cito easily and efficiently run guest operating systems. Normally, these guests
128c2ecf20Sopenharmony_ci*cannot* themselves be hypervisors running their own guests, because in VMX,
138c2ecf20Sopenharmony_ciguests cannot use VMX instructions.
148c2ecf20Sopenharmony_ci
158c2ecf20Sopenharmony_ciThe "Nested VMX" feature adds this missing capability - of running guest
168c2ecf20Sopenharmony_cihypervisors (which use VMX) with their own nested guests. It does so by
178c2ecf20Sopenharmony_ciallowing a guest to use VMX instructions, and correctly and efficiently
188c2ecf20Sopenharmony_ciemulating them using the single level of VMX available in the hardware.
198c2ecf20Sopenharmony_ci
208c2ecf20Sopenharmony_ciWe describe in much greater detail the theory behind the nested VMX feature,
218c2ecf20Sopenharmony_ciits implementation and its performance characteristics, in the OSDI 2010 paper
228c2ecf20Sopenharmony_ci"The Turtles Project: Design and Implementation of Nested Virtualization",
238c2ecf20Sopenharmony_ciavailable at:
248c2ecf20Sopenharmony_ci
258c2ecf20Sopenharmony_ci	https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ci
288c2ecf20Sopenharmony_ciTerminology
298c2ecf20Sopenharmony_ci-----------
308c2ecf20Sopenharmony_ci
318c2ecf20Sopenharmony_ciSingle-level virtualization has two levels - the host (KVM) and the guests.
328c2ecf20Sopenharmony_ciIn nested virtualization, we have three levels: The host (KVM), which we call
338c2ecf20Sopenharmony_ciL0, the guest hypervisor, which we call L1, and its nested guest, which we
348c2ecf20Sopenharmony_cicall L2.
358c2ecf20Sopenharmony_ci
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ciRunning nested VMX
388c2ecf20Sopenharmony_ci------------------
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ciThe nested VMX feature is disabled by default. It can be enabled by giving
418c2ecf20Sopenharmony_cithe "nested=1" option to the kvm-intel module.
428c2ecf20Sopenharmony_ci
438c2ecf20Sopenharmony_ciNo modifications are required to user space (qemu). However, qemu's default
448c2ecf20Sopenharmony_ciemulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
458c2ecf20Sopenharmony_ciexplicitly enabled, by giving qemu one of the following options:
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ci     - cpu host              (emulated CPU has all features of the real CPU)
488c2ecf20Sopenharmony_ci
498c2ecf20Sopenharmony_ci     - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ci
528c2ecf20Sopenharmony_ciABIs
538c2ecf20Sopenharmony_ci----
548c2ecf20Sopenharmony_ci
558c2ecf20Sopenharmony_ciNested VMX aims to present a standard and (eventually) fully-functional VMX
568c2ecf20Sopenharmony_ciimplementation for the a guest hypervisor to use. As such, the official
578c2ecf20Sopenharmony_cispecification of the ABI that it provides is Intel's VMX specification,
588c2ecf20Sopenharmony_cinamely volume 3B of their "Intel 64 and IA-32 Architectures Software
598c2ecf20Sopenharmony_ciDeveloper's Manual". Not all of VMX's features are currently fully supported,
608c2ecf20Sopenharmony_cibut the goal is to eventually support them all, starting with the VMX features
618c2ecf20Sopenharmony_ciwhich are used in practice by popular hypervisors (KVM and others).
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ciAs a VMX implementation, nested VMX presents a VMCS structure to L1.
648c2ecf20Sopenharmony_ciAs mandated by the spec, other than the two fields revision_id and abort,
658c2ecf20Sopenharmony_cithis structure is *opaque* to its user, who is not supposed to know or care
668c2ecf20Sopenharmony_ciabout its internal structure. Rather, the structure is accessed through the
678c2ecf20Sopenharmony_ciVMREAD and VMWRITE instructions.
688c2ecf20Sopenharmony_ciStill, for debugging purposes, KVM developers might be interested to know the
698c2ecf20Sopenharmony_ciinternals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ciThe name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
728c2ecf20Sopenharmony_cialso have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
738c2ecf20Sopenharmony_ciwhich L0 builds to actually run L2 - how this is done is explained in the
748c2ecf20Sopenharmony_ciaforementioned paper.
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciFor convenience, we repeat the content of struct vmcs12 here. If the internals
778c2ecf20Sopenharmony_ciof this structure changes, this can break live migration across KVM versions.
788c2ecf20Sopenharmony_ciVMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
798c2ecf20Sopenharmony_cistruct shadow_vmcs is ever changed.
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ci::
828c2ecf20Sopenharmony_ci
838c2ecf20Sopenharmony_ci	typedef u64 natural_width;
848c2ecf20Sopenharmony_ci	struct __packed vmcs12 {
858c2ecf20Sopenharmony_ci		/* According to the Intel spec, a VMCS region must start with
868c2ecf20Sopenharmony_ci		 * these two user-visible fields */
878c2ecf20Sopenharmony_ci		u32 revision_id;
888c2ecf20Sopenharmony_ci		u32 abort;
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ci		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
918c2ecf20Sopenharmony_ci		u32 padding[7]; /* room for future expansion */
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ci		u64 io_bitmap_a;
948c2ecf20Sopenharmony_ci		u64 io_bitmap_b;
958c2ecf20Sopenharmony_ci		u64 msr_bitmap;
968c2ecf20Sopenharmony_ci		u64 vm_exit_msr_store_addr;
978c2ecf20Sopenharmony_ci		u64 vm_exit_msr_load_addr;
988c2ecf20Sopenharmony_ci		u64 vm_entry_msr_load_addr;
998c2ecf20Sopenharmony_ci		u64 tsc_offset;
1008c2ecf20Sopenharmony_ci		u64 virtual_apic_page_addr;
1018c2ecf20Sopenharmony_ci		u64 apic_access_addr;
1028c2ecf20Sopenharmony_ci		u64 ept_pointer;
1038c2ecf20Sopenharmony_ci		u64 guest_physical_address;
1048c2ecf20Sopenharmony_ci		u64 vmcs_link_pointer;
1058c2ecf20Sopenharmony_ci		u64 guest_ia32_debugctl;
1068c2ecf20Sopenharmony_ci		u64 guest_ia32_pat;
1078c2ecf20Sopenharmony_ci		u64 guest_ia32_efer;
1088c2ecf20Sopenharmony_ci		u64 guest_pdptr0;
1098c2ecf20Sopenharmony_ci		u64 guest_pdptr1;
1108c2ecf20Sopenharmony_ci		u64 guest_pdptr2;
1118c2ecf20Sopenharmony_ci		u64 guest_pdptr3;
1128c2ecf20Sopenharmony_ci		u64 host_ia32_pat;
1138c2ecf20Sopenharmony_ci		u64 host_ia32_efer;
1148c2ecf20Sopenharmony_ci		u64 padding64[8]; /* room for future expansion */
1158c2ecf20Sopenharmony_ci		natural_width cr0_guest_host_mask;
1168c2ecf20Sopenharmony_ci		natural_width cr4_guest_host_mask;
1178c2ecf20Sopenharmony_ci		natural_width cr0_read_shadow;
1188c2ecf20Sopenharmony_ci		natural_width cr4_read_shadow;
1198c2ecf20Sopenharmony_ci		natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
1208c2ecf20Sopenharmony_ci		natural_width exit_qualification;
1218c2ecf20Sopenharmony_ci		natural_width guest_linear_address;
1228c2ecf20Sopenharmony_ci		natural_width guest_cr0;
1238c2ecf20Sopenharmony_ci		natural_width guest_cr3;
1248c2ecf20Sopenharmony_ci		natural_width guest_cr4;
1258c2ecf20Sopenharmony_ci		natural_width guest_es_base;
1268c2ecf20Sopenharmony_ci		natural_width guest_cs_base;
1278c2ecf20Sopenharmony_ci		natural_width guest_ss_base;
1288c2ecf20Sopenharmony_ci		natural_width guest_ds_base;
1298c2ecf20Sopenharmony_ci		natural_width guest_fs_base;
1308c2ecf20Sopenharmony_ci		natural_width guest_gs_base;
1318c2ecf20Sopenharmony_ci		natural_width guest_ldtr_base;
1328c2ecf20Sopenharmony_ci		natural_width guest_tr_base;
1338c2ecf20Sopenharmony_ci		natural_width guest_gdtr_base;
1348c2ecf20Sopenharmony_ci		natural_width guest_idtr_base;
1358c2ecf20Sopenharmony_ci		natural_width guest_dr7;
1368c2ecf20Sopenharmony_ci		natural_width guest_rsp;
1378c2ecf20Sopenharmony_ci		natural_width guest_rip;
1388c2ecf20Sopenharmony_ci		natural_width guest_rflags;
1398c2ecf20Sopenharmony_ci		natural_width guest_pending_dbg_exceptions;
1408c2ecf20Sopenharmony_ci		natural_width guest_sysenter_esp;
1418c2ecf20Sopenharmony_ci		natural_width guest_sysenter_eip;
1428c2ecf20Sopenharmony_ci		natural_width host_cr0;
1438c2ecf20Sopenharmony_ci		natural_width host_cr3;
1448c2ecf20Sopenharmony_ci		natural_width host_cr4;
1458c2ecf20Sopenharmony_ci		natural_width host_fs_base;
1468c2ecf20Sopenharmony_ci		natural_width host_gs_base;
1478c2ecf20Sopenharmony_ci		natural_width host_tr_base;
1488c2ecf20Sopenharmony_ci		natural_width host_gdtr_base;
1498c2ecf20Sopenharmony_ci		natural_width host_idtr_base;
1508c2ecf20Sopenharmony_ci		natural_width host_ia32_sysenter_esp;
1518c2ecf20Sopenharmony_ci		natural_width host_ia32_sysenter_eip;
1528c2ecf20Sopenharmony_ci		natural_width host_rsp;
1538c2ecf20Sopenharmony_ci		natural_width host_rip;
1548c2ecf20Sopenharmony_ci		natural_width paddingl[8]; /* room for future expansion */
1558c2ecf20Sopenharmony_ci		u32 pin_based_vm_exec_control;
1568c2ecf20Sopenharmony_ci		u32 cpu_based_vm_exec_control;
1578c2ecf20Sopenharmony_ci		u32 exception_bitmap;
1588c2ecf20Sopenharmony_ci		u32 page_fault_error_code_mask;
1598c2ecf20Sopenharmony_ci		u32 page_fault_error_code_match;
1608c2ecf20Sopenharmony_ci		u32 cr3_target_count;
1618c2ecf20Sopenharmony_ci		u32 vm_exit_controls;
1628c2ecf20Sopenharmony_ci		u32 vm_exit_msr_store_count;
1638c2ecf20Sopenharmony_ci		u32 vm_exit_msr_load_count;
1648c2ecf20Sopenharmony_ci		u32 vm_entry_controls;
1658c2ecf20Sopenharmony_ci		u32 vm_entry_msr_load_count;
1668c2ecf20Sopenharmony_ci		u32 vm_entry_intr_info_field;
1678c2ecf20Sopenharmony_ci		u32 vm_entry_exception_error_code;
1688c2ecf20Sopenharmony_ci		u32 vm_entry_instruction_len;
1698c2ecf20Sopenharmony_ci		u32 tpr_threshold;
1708c2ecf20Sopenharmony_ci		u32 secondary_vm_exec_control;
1718c2ecf20Sopenharmony_ci		u32 vm_instruction_error;
1728c2ecf20Sopenharmony_ci		u32 vm_exit_reason;
1738c2ecf20Sopenharmony_ci		u32 vm_exit_intr_info;
1748c2ecf20Sopenharmony_ci		u32 vm_exit_intr_error_code;
1758c2ecf20Sopenharmony_ci		u32 idt_vectoring_info_field;
1768c2ecf20Sopenharmony_ci		u32 idt_vectoring_error_code;
1778c2ecf20Sopenharmony_ci		u32 vm_exit_instruction_len;
1788c2ecf20Sopenharmony_ci		u32 vmx_instruction_info;
1798c2ecf20Sopenharmony_ci		u32 guest_es_limit;
1808c2ecf20Sopenharmony_ci		u32 guest_cs_limit;
1818c2ecf20Sopenharmony_ci		u32 guest_ss_limit;
1828c2ecf20Sopenharmony_ci		u32 guest_ds_limit;
1838c2ecf20Sopenharmony_ci		u32 guest_fs_limit;
1848c2ecf20Sopenharmony_ci		u32 guest_gs_limit;
1858c2ecf20Sopenharmony_ci		u32 guest_ldtr_limit;
1868c2ecf20Sopenharmony_ci		u32 guest_tr_limit;
1878c2ecf20Sopenharmony_ci		u32 guest_gdtr_limit;
1888c2ecf20Sopenharmony_ci		u32 guest_idtr_limit;
1898c2ecf20Sopenharmony_ci		u32 guest_es_ar_bytes;
1908c2ecf20Sopenharmony_ci		u32 guest_cs_ar_bytes;
1918c2ecf20Sopenharmony_ci		u32 guest_ss_ar_bytes;
1928c2ecf20Sopenharmony_ci		u32 guest_ds_ar_bytes;
1938c2ecf20Sopenharmony_ci		u32 guest_fs_ar_bytes;
1948c2ecf20Sopenharmony_ci		u32 guest_gs_ar_bytes;
1958c2ecf20Sopenharmony_ci		u32 guest_ldtr_ar_bytes;
1968c2ecf20Sopenharmony_ci		u32 guest_tr_ar_bytes;
1978c2ecf20Sopenharmony_ci		u32 guest_interruptibility_info;
1988c2ecf20Sopenharmony_ci		u32 guest_activity_state;
1998c2ecf20Sopenharmony_ci		u32 guest_sysenter_cs;
2008c2ecf20Sopenharmony_ci		u32 host_ia32_sysenter_cs;
2018c2ecf20Sopenharmony_ci		u32 padding32[8]; /* room for future expansion */
2028c2ecf20Sopenharmony_ci		u16 virtual_processor_id;
2038c2ecf20Sopenharmony_ci		u16 guest_es_selector;
2048c2ecf20Sopenharmony_ci		u16 guest_cs_selector;
2058c2ecf20Sopenharmony_ci		u16 guest_ss_selector;
2068c2ecf20Sopenharmony_ci		u16 guest_ds_selector;
2078c2ecf20Sopenharmony_ci		u16 guest_fs_selector;
2088c2ecf20Sopenharmony_ci		u16 guest_gs_selector;
2098c2ecf20Sopenharmony_ci		u16 guest_ldtr_selector;
2108c2ecf20Sopenharmony_ci		u16 guest_tr_selector;
2118c2ecf20Sopenharmony_ci		u16 host_es_selector;
2128c2ecf20Sopenharmony_ci		u16 host_cs_selector;
2138c2ecf20Sopenharmony_ci		u16 host_ss_selector;
2148c2ecf20Sopenharmony_ci		u16 host_ds_selector;
2158c2ecf20Sopenharmony_ci		u16 host_fs_selector;
2168c2ecf20Sopenharmony_ci		u16 host_gs_selector;
2178c2ecf20Sopenharmony_ci		u16 host_tr_selector;
2188c2ecf20Sopenharmony_ci	};
2198c2ecf20Sopenharmony_ci
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ciAuthors
2228c2ecf20Sopenharmony_ci-------
2238c2ecf20Sopenharmony_ci
2248c2ecf20Sopenharmony_ciThese patches were written by:
2258c2ecf20Sopenharmony_ci    - Abel Gordon, abelg <at> il.ibm.com
2268c2ecf20Sopenharmony_ci    - Nadav Har'El, nyh <at> il.ibm.com
2278c2ecf20Sopenharmony_ci    - Orit Wasserman, oritw <at> il.ibm.com
2288c2ecf20Sopenharmony_ci    - Ben-Ami Yassor, benami <at> il.ibm.com
2298c2ecf20Sopenharmony_ci    - Muli Ben-Yehuda, muli <at> il.ibm.com
2308c2ecf20Sopenharmony_ci
2318c2ecf20Sopenharmony_ciWith contributions by:
2328c2ecf20Sopenharmony_ci    - Anthony Liguori, aliguori <at> us.ibm.com
2338c2ecf20Sopenharmony_ci    - Mike Day, mdday <at> us.ibm.com
2348c2ecf20Sopenharmony_ci    - Michael Factor, factor <at> il.ibm.com
2358c2ecf20Sopenharmony_ci    - Zvi Dubitzky, dubi <at> il.ibm.com
2368c2ecf20Sopenharmony_ci
2378c2ecf20Sopenharmony_ciAnd valuable reviews by:
2388c2ecf20Sopenharmony_ci    - Avi Kivity, avi <at> redhat.com
2398c2ecf20Sopenharmony_ci    - Gleb Natapov, gleb <at> redhat.com
2408c2ecf20Sopenharmony_ci    - Marcelo Tosatti, mtosatti <at> redhat.com
2418c2ecf20Sopenharmony_ci    - Kevin Tian, kevin.tian <at> intel.com
2428c2ecf20Sopenharmony_ci    - and others.
243