18c2ecf20Sopenharmony_ci============================== 28c2ecf20Sopenharmony_ciRunning nested guests with KVM 38c2ecf20Sopenharmony_ci============================== 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ciA nested guest is the ability to run a guest inside another guest (it 68c2ecf20Sopenharmony_cican be KVM-based or a different hypervisor). The straightforward 78c2ecf20Sopenharmony_ciexample is a KVM guest that in turn runs on a KVM guest (the rest of 88c2ecf20Sopenharmony_cithis document is built on this example):: 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ci .----------------. .----------------. 118c2ecf20Sopenharmony_ci | | | | 128c2ecf20Sopenharmony_ci | L2 | | L2 | 138c2ecf20Sopenharmony_ci | (Nested Guest) | | (Nested Guest) | 148c2ecf20Sopenharmony_ci | | | | 158c2ecf20Sopenharmony_ci |----------------'--'----------------| 168c2ecf20Sopenharmony_ci | | 178c2ecf20Sopenharmony_ci | L1 (Guest Hypervisor) | 188c2ecf20Sopenharmony_ci | KVM (/dev/kvm) | 198c2ecf20Sopenharmony_ci | | 208c2ecf20Sopenharmony_ci .------------------------------------------------------. 218c2ecf20Sopenharmony_ci | L0 (Host Hypervisor) | 228c2ecf20Sopenharmony_ci | KVM (/dev/kvm) | 238c2ecf20Sopenharmony_ci |------------------------------------------------------| 248c2ecf20Sopenharmony_ci | Hardware (with virtualization extensions) | 258c2ecf20Sopenharmony_ci '------------------------------------------------------' 268c2ecf20Sopenharmony_ci 278c2ecf20Sopenharmony_ciTerminology: 288c2ecf20Sopenharmony_ci 298c2ecf20Sopenharmony_ci- L0 – level-0; the bare metal host, running KVM 308c2ecf20Sopenharmony_ci 318c2ecf20Sopenharmony_ci- L1 – level-1 guest; a VM running on L0; also called the "guest 328c2ecf20Sopenharmony_ci hypervisor", as it itself is capable of running KVM. 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ci- L2 – level-2 guest; a VM running on L1, this is the "nested guest" 358c2ecf20Sopenharmony_ci 368c2ecf20Sopenharmony_ci.. note:: The above diagram is modelled after the x86 architecture; 378c2ecf20Sopenharmony_ci s390x, ppc64 and other architectures are likely to have 388c2ecf20Sopenharmony_ci a different design for nesting. 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ci For example, s390x always has an LPAR (LogicalPARtition) 418c2ecf20Sopenharmony_ci hypervisor running on bare metal, adding another layer and 428c2ecf20Sopenharmony_ci resulting in at least four levels in a nested setup — L0 (bare 438c2ecf20Sopenharmony_ci metal, running the LPAR hypervisor), L1 (host hypervisor), L2 448c2ecf20Sopenharmony_ci (guest hypervisor), L3 (nested guest). 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ci This document will stick with the three-level terminology (L0, 478c2ecf20Sopenharmony_ci L1, and L2) for all architectures; and will largely focus on 488c2ecf20Sopenharmony_ci x86. 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_ci 518c2ecf20Sopenharmony_ciUse Cases 528c2ecf20Sopenharmony_ci--------- 538c2ecf20Sopenharmony_ci 548c2ecf20Sopenharmony_ciThere are several scenarios where nested KVM can be useful, to name a 558c2ecf20Sopenharmony_cifew: 568c2ecf20Sopenharmony_ci 578c2ecf20Sopenharmony_ci- As a developer, you want to test your software on different operating 588c2ecf20Sopenharmony_ci systems (OSes). Instead of renting multiple VMs from a Cloud 598c2ecf20Sopenharmony_ci Provider, using nested KVM lets you rent a large enough "guest 608c2ecf20Sopenharmony_ci hypervisor" (level-1 guest). This in turn allows you to create 618c2ecf20Sopenharmony_ci multiple nested guests (level-2 guests), running different OSes, on 628c2ecf20Sopenharmony_ci which you can develop and test your software. 638c2ecf20Sopenharmony_ci 648c2ecf20Sopenharmony_ci- Live migration of "guest hypervisors" and their nested guests, for 658c2ecf20Sopenharmony_ci load balancing, disaster recovery, etc. 668c2ecf20Sopenharmony_ci 678c2ecf20Sopenharmony_ci- VM image creation tools (e.g. ``virt-install``, etc) often run 688c2ecf20Sopenharmony_ci their own VM, and users expect these to work inside a VM. 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ci- Some OSes use virtualization internally for security (e.g. to let 718c2ecf20Sopenharmony_ci applications run safely in isolation). 728c2ecf20Sopenharmony_ci 738c2ecf20Sopenharmony_ci 748c2ecf20Sopenharmony_ciEnabling "nested" (x86) 758c2ecf20Sopenharmony_ci----------------------- 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ciFrom Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled 788c2ecf20Sopenharmony_ciby default for Intel and AMD. (Though your Linux distribution might 798c2ecf20Sopenharmony_cioverride this default.) 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ciIn case you are running a Linux kernel older than v4.19, to enable 828c2ecf20Sopenharmony_cinesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To 838c2ecf20Sopenharmony_cipersist this setting across reboots, you can add it in a config file, as 848c2ecf20Sopenharmony_cishown below: 858c2ecf20Sopenharmony_ci 868c2ecf20Sopenharmony_ci1. On the bare metal host (L0), list the kernel modules and ensure that 878c2ecf20Sopenharmony_ci the KVM modules:: 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ci $ lsmod | grep -i kvm 908c2ecf20Sopenharmony_ci kvm_intel 133627 0 918c2ecf20Sopenharmony_ci kvm 435079 1 kvm_intel 928c2ecf20Sopenharmony_ci 938c2ecf20Sopenharmony_ci2. Show information for ``kvm_intel`` module:: 948c2ecf20Sopenharmony_ci 958c2ecf20Sopenharmony_ci $ modinfo kvm_intel | grep -i nested 968c2ecf20Sopenharmony_ci parm: nested:bool 978c2ecf20Sopenharmony_ci 988c2ecf20Sopenharmony_ci3. For the nested KVM configuration to persist across reboots, place the 998c2ecf20Sopenharmony_ci below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it 1008c2ecf20Sopenharmony_ci doesn't exist):: 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ci $ cat /etc/modprobe.d/kvm_intel.conf 1038c2ecf20Sopenharmony_ci options kvm-intel nested=y 1048c2ecf20Sopenharmony_ci 1058c2ecf20Sopenharmony_ci4. Unload and re-load the KVM Intel module:: 1068c2ecf20Sopenharmony_ci 1078c2ecf20Sopenharmony_ci $ sudo rmmod kvm-intel 1088c2ecf20Sopenharmony_ci $ sudo modprobe kvm-intel 1098c2ecf20Sopenharmony_ci 1108c2ecf20Sopenharmony_ci5. Verify if the ``nested`` parameter for KVM is enabled:: 1118c2ecf20Sopenharmony_ci 1128c2ecf20Sopenharmony_ci $ cat /sys/module/kvm_intel/parameters/nested 1138c2ecf20Sopenharmony_ci Y 1148c2ecf20Sopenharmony_ci 1158c2ecf20Sopenharmony_ciFor AMD hosts, the process is the same as above, except that the module 1168c2ecf20Sopenharmony_ciname is ``kvm-amd``. 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ci 1198c2ecf20Sopenharmony_ciAdditional nested-related kernel parameters (x86) 1208c2ecf20Sopenharmony_ci------------------------------------------------- 1218c2ecf20Sopenharmony_ci 1228c2ecf20Sopenharmony_ciIf your hardware is sufficiently advanced (Intel Haswell processor or 1238c2ecf20Sopenharmony_cihigher, which has newer hardware virt extensions), the following 1248c2ecf20Sopenharmony_ciadditional features will also be enabled by default: "Shadow VMCS 1258c2ecf20Sopenharmony_ci(Virtual Machine Control Structure)", APIC Virtualization on your bare 1268c2ecf20Sopenharmony_cimetal host (L0). Parameters for Intel hosts:: 1278c2ecf20Sopenharmony_ci 1288c2ecf20Sopenharmony_ci $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs 1298c2ecf20Sopenharmony_ci Y 1308c2ecf20Sopenharmony_ci 1318c2ecf20Sopenharmony_ci $ cat /sys/module/kvm_intel/parameters/enable_apicv 1328c2ecf20Sopenharmony_ci Y 1338c2ecf20Sopenharmony_ci 1348c2ecf20Sopenharmony_ci $ cat /sys/module/kvm_intel/parameters/ept 1358c2ecf20Sopenharmony_ci Y 1368c2ecf20Sopenharmony_ci 1378c2ecf20Sopenharmony_ci.. note:: If you suspect your L2 (i.e. nested guest) is running slower, 1388c2ecf20Sopenharmony_ci ensure the above are enabled (particularly 1398c2ecf20Sopenharmony_ci ``enable_shadow_vmcs`` and ``ept``). 1408c2ecf20Sopenharmony_ci 1418c2ecf20Sopenharmony_ci 1428c2ecf20Sopenharmony_ciStarting a nested guest (x86) 1438c2ecf20Sopenharmony_ci----------------------------- 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ciOnce your bare metal host (L0) is configured for nesting, you should be 1468c2ecf20Sopenharmony_ciable to start an L1 guest with:: 1478c2ecf20Sopenharmony_ci 1488c2ecf20Sopenharmony_ci $ qemu-kvm -cpu host [...] 1498c2ecf20Sopenharmony_ci 1508c2ecf20Sopenharmony_ciThe above will pass through the host CPU's capabilities as-is to the 1518c2ecf20Sopenharmony_cigues); or for better live migration compatibility, use a named CPU 1528c2ecf20Sopenharmony_cimodel supported by QEMU. e.g.:: 1538c2ecf20Sopenharmony_ci 1548c2ecf20Sopenharmony_ci $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_cithen the guest hypervisor will subsequently be capable of running a 1578c2ecf20Sopenharmony_cinested guest with accelerated KVM. 1588c2ecf20Sopenharmony_ci 1598c2ecf20Sopenharmony_ci 1608c2ecf20Sopenharmony_ciEnabling "nested" (s390x) 1618c2ecf20Sopenharmony_ci------------------------- 1628c2ecf20Sopenharmony_ci 1638c2ecf20Sopenharmony_ci1. On the host hypervisor (L0), enable the ``nested`` parameter on 1648c2ecf20Sopenharmony_ci s390x:: 1658c2ecf20Sopenharmony_ci 1668c2ecf20Sopenharmony_ci $ rmmod kvm 1678c2ecf20Sopenharmony_ci $ modprobe kvm nested=1 1688c2ecf20Sopenharmony_ci 1698c2ecf20Sopenharmony_ci.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive 1708c2ecf20Sopenharmony_ci with the ``nested`` paramter — i.e. to be able to enable 1718c2ecf20Sopenharmony_ci ``nested``, the ``hpage`` parameter *must* be disabled. 1728c2ecf20Sopenharmony_ci 1738c2ecf20Sopenharmony_ci2. The guest hypervisor (L1) must be provided with the ``sie`` CPU 1748c2ecf20Sopenharmony_ci feature — with QEMU, this can be done by using "host passthrough" 1758c2ecf20Sopenharmony_ci (via the command-line ``-cpu host``). 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ci3. Now the KVM module can be loaded in the L1 (guest hypervisor):: 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ci $ modprobe kvm 1808c2ecf20Sopenharmony_ci 1818c2ecf20Sopenharmony_ci 1828c2ecf20Sopenharmony_ciLive migration with nested KVM 1838c2ecf20Sopenharmony_ci------------------------------ 1848c2ecf20Sopenharmony_ci 1858c2ecf20Sopenharmony_ciMigrating an L1 guest, with a *live* nested guest in it, to another 1868c2ecf20Sopenharmony_cibare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for 1878c2ecf20Sopenharmony_ciIntel x86 systems, and even on older versions for s390x. 1888c2ecf20Sopenharmony_ci 1898c2ecf20Sopenharmony_ciOn AMD systems, once an L1 guest has started an L2 guest, the L1 guest 1908c2ecf20Sopenharmony_cishould no longer be migrated or saved (refer to QEMU documentation on 1918c2ecf20Sopenharmony_ci"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate 1928c2ecf20Sopenharmony_cior save-and-load an L1 guest while an L2 guest is running will result in 1938c2ecf20Sopenharmony_ciundefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a 1948c2ecf20Sopenharmony_cikernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 1958c2ecf20Sopenharmony_ciguest can no longer be considered stable or secure, and must be restarted. 1968c2ecf20Sopenharmony_ciMigrating an L1 guest merely configured to support nesting, while not 1978c2ecf20Sopenharmony_ciactually running L2 guests, is expected to function normally even on AMD 1988c2ecf20Sopenharmony_cisystems but may fail once guests are started. 1998c2ecf20Sopenharmony_ci 2008c2ecf20Sopenharmony_ciMigrating an L2 guest is always expected to succeed, so all the following 2018c2ecf20Sopenharmony_ciscenarios should work even on AMD systems: 2028c2ecf20Sopenharmony_ci 2038c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to another L1 guest on the *same* bare 2048c2ecf20Sopenharmony_ci metal host. 2058c2ecf20Sopenharmony_ci 2068c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to another L1 guest on a *different* 2078c2ecf20Sopenharmony_ci bare metal host. 2088c2ecf20Sopenharmony_ci 2098c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to a bare metal host. 2108c2ecf20Sopenharmony_ci 2118c2ecf20Sopenharmony_ciReporting bugs from nested setups 2128c2ecf20Sopenharmony_ci----------------------------------- 2138c2ecf20Sopenharmony_ci 2148c2ecf20Sopenharmony_ciDebugging "nested" problems can involve sifting through log files across 2158c2ecf20Sopenharmony_ciL0, L1 and L2; this can result in tedious back-n-forth between the bug 2168c2ecf20Sopenharmony_cireporter and the bug fixer. 2178c2ecf20Sopenharmony_ci 2188c2ecf20Sopenharmony_ci- Mention that you are in a "nested" setup. If you are running any kind 2198c2ecf20Sopenharmony_ci of "nesting" at all, say so. Unfortunately, this needs to be called 2208c2ecf20Sopenharmony_ci out because when reporting bugs, people tend to forget to even 2218c2ecf20Sopenharmony_ci *mention* that they're using nested virtualization. 2228c2ecf20Sopenharmony_ci 2238c2ecf20Sopenharmony_ci- Ensure you are actually running KVM on KVM. Sometimes people do not 2248c2ecf20Sopenharmony_ci have KVM enabled for their guest hypervisor (L1), which results in 2258c2ecf20Sopenharmony_ci them running with pure emulation or what QEMU calls it as "TCG", but 2268c2ecf20Sopenharmony_ci they think they're running nested KVM. Thus confusing "nested Virt" 2278c2ecf20Sopenharmony_ci (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). 2288c2ecf20Sopenharmony_ci 2298c2ecf20Sopenharmony_ciInformation to collect (generic) 2308c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ciThe following is not an exhaustive list, but a very good starting point: 2338c2ecf20Sopenharmony_ci 2348c2ecf20Sopenharmony_ci - Kernel, libvirt, and QEMU version from L0 2358c2ecf20Sopenharmony_ci 2368c2ecf20Sopenharmony_ci - Kernel, libvirt and QEMU version from L1 2378c2ecf20Sopenharmony_ci 2388c2ecf20Sopenharmony_ci - QEMU command-line of L1 -- when using libvirt, you'll find it here: 2398c2ecf20Sopenharmony_ci ``/var/log/libvirt/qemu/instance.log`` 2408c2ecf20Sopenharmony_ci 2418c2ecf20Sopenharmony_ci - QEMU command-line of L2 -- as above, when using libvirt, get the 2428c2ecf20Sopenharmony_ci complete libvirt-generated QEMU command-line 2438c2ecf20Sopenharmony_ci 2448c2ecf20Sopenharmony_ci - ``cat /sys/cpuinfo`` from L0 2458c2ecf20Sopenharmony_ci 2468c2ecf20Sopenharmony_ci - ``cat /sys/cpuinfo`` from L1 2478c2ecf20Sopenharmony_ci 2488c2ecf20Sopenharmony_ci - ``lscpu`` from L0 2498c2ecf20Sopenharmony_ci 2508c2ecf20Sopenharmony_ci - ``lscpu`` from L1 2518c2ecf20Sopenharmony_ci 2528c2ecf20Sopenharmony_ci - Full ``dmesg`` output from L0 2538c2ecf20Sopenharmony_ci 2548c2ecf20Sopenharmony_ci - Full ``dmesg`` output from L1 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_cix86-specific info to collect 2578c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2588c2ecf20Sopenharmony_ci 2598c2ecf20Sopenharmony_ciBoth the below commands, ``x86info`` and ``dmidecode``, should be 2608c2ecf20Sopenharmony_ciavailable on most Linux distributions with the same name: 2618c2ecf20Sopenharmony_ci 2628c2ecf20Sopenharmony_ci - Output of: ``x86info -a`` from L0 2638c2ecf20Sopenharmony_ci 2648c2ecf20Sopenharmony_ci - Output of: ``x86info -a`` from L1 2658c2ecf20Sopenharmony_ci 2668c2ecf20Sopenharmony_ci - Output of: ``dmidecode`` from L0 2678c2ecf20Sopenharmony_ci 2688c2ecf20Sopenharmony_ci - Output of: ``dmidecode`` from L1 2698c2ecf20Sopenharmony_ci 2708c2ecf20Sopenharmony_cis390x-specific info to collect 2718c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2728c2ecf20Sopenharmony_ci 2738c2ecf20Sopenharmony_ciAlong with the earlier mentioned generic details, the below is 2748c2ecf20Sopenharmony_cialso recommended: 2758c2ecf20Sopenharmony_ci 2768c2ecf20Sopenharmony_ci - ``/proc/sysinfo`` from L1; this will also include the info from L0 277