18c2ecf20Sopenharmony_ci==============================
28c2ecf20Sopenharmony_ciRunning nested guests with KVM
38c2ecf20Sopenharmony_ci==============================
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ciA nested guest is the ability to run a guest inside another guest (it
68c2ecf20Sopenharmony_cican be KVM-based or a different hypervisor).  The straightforward
78c2ecf20Sopenharmony_ciexample is a KVM guest that in turn runs on a KVM guest (the rest of
88c2ecf20Sopenharmony_cithis document is built on this example)::
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ci              .----------------.  .----------------.
118c2ecf20Sopenharmony_ci              |                |  |                |
128c2ecf20Sopenharmony_ci              |      L2        |  |      L2        |
138c2ecf20Sopenharmony_ci              | (Nested Guest) |  | (Nested Guest) |
148c2ecf20Sopenharmony_ci              |                |  |                |
158c2ecf20Sopenharmony_ci              |----------------'--'----------------|
168c2ecf20Sopenharmony_ci              |                                    |
178c2ecf20Sopenharmony_ci              |       L1 (Guest Hypervisor)        |
188c2ecf20Sopenharmony_ci              |          KVM (/dev/kvm)            |
198c2ecf20Sopenharmony_ci              |                                    |
208c2ecf20Sopenharmony_ci      .------------------------------------------------------.
218c2ecf20Sopenharmony_ci      |                 L0 (Host Hypervisor)                 |
228c2ecf20Sopenharmony_ci      |                    KVM (/dev/kvm)                    |
238c2ecf20Sopenharmony_ci      |------------------------------------------------------|
248c2ecf20Sopenharmony_ci      |        Hardware (with virtualization extensions)     |
258c2ecf20Sopenharmony_ci      '------------------------------------------------------'
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ciTerminology:
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ci- L0 – level-0; the bare metal host, running KVM
308c2ecf20Sopenharmony_ci
318c2ecf20Sopenharmony_ci- L1 – level-1 guest; a VM running on L0; also called the "guest
328c2ecf20Sopenharmony_ci  hypervisor", as it itself is capable of running KVM.
338c2ecf20Sopenharmony_ci
348c2ecf20Sopenharmony_ci- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
358c2ecf20Sopenharmony_ci
368c2ecf20Sopenharmony_ci.. note:: The above diagram is modelled after the x86 architecture;
378c2ecf20Sopenharmony_ci          s390x, ppc64 and other architectures are likely to have
388c2ecf20Sopenharmony_ci          a different design for nesting.
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ci          For example, s390x always has an LPAR (LogicalPARtition)
418c2ecf20Sopenharmony_ci          hypervisor running on bare metal, adding another layer and
428c2ecf20Sopenharmony_ci          resulting in at least four levels in a nested setup — L0 (bare
438c2ecf20Sopenharmony_ci          metal, running the LPAR hypervisor), L1 (host hypervisor), L2
448c2ecf20Sopenharmony_ci          (guest hypervisor), L3 (nested guest).
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ci          This document will stick with the three-level terminology (L0,
478c2ecf20Sopenharmony_ci          L1, and L2) for all architectures; and will largely focus on
488c2ecf20Sopenharmony_ci          x86.
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ciUse Cases
528c2ecf20Sopenharmony_ci---------
538c2ecf20Sopenharmony_ci
548c2ecf20Sopenharmony_ciThere are several scenarios where nested KVM can be useful, to name a
558c2ecf20Sopenharmony_cifew:
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ci- As a developer, you want to test your software on different operating
588c2ecf20Sopenharmony_ci  systems (OSes).  Instead of renting multiple VMs from a Cloud
598c2ecf20Sopenharmony_ci  Provider, using nested KVM lets you rent a large enough "guest
608c2ecf20Sopenharmony_ci  hypervisor" (level-1 guest).  This in turn allows you to create
618c2ecf20Sopenharmony_ci  multiple nested guests (level-2 guests), running different OSes, on
628c2ecf20Sopenharmony_ci  which you can develop and test your software.
638c2ecf20Sopenharmony_ci
648c2ecf20Sopenharmony_ci- Live migration of "guest hypervisors" and their nested guests, for
658c2ecf20Sopenharmony_ci  load balancing, disaster recovery, etc.
668c2ecf20Sopenharmony_ci
678c2ecf20Sopenharmony_ci- VM image creation tools (e.g. ``virt-install``,  etc) often run
688c2ecf20Sopenharmony_ci  their own VM, and users expect these to work inside a VM.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci- Some OSes use virtualization internally for security (e.g. to let
718c2ecf20Sopenharmony_ci  applications run safely in isolation).
728c2ecf20Sopenharmony_ci
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ciEnabling "nested" (x86)
758c2ecf20Sopenharmony_ci-----------------------
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ciFrom Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
788c2ecf20Sopenharmony_ciby default for Intel and AMD.  (Though your Linux distribution might
798c2ecf20Sopenharmony_cioverride this default.)
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ciIn case you are running a Linux kernel older than v4.19, to enable
828c2ecf20Sopenharmony_cinesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``.  To
838c2ecf20Sopenharmony_cipersist this setting across reboots, you can add it in a config file, as
848c2ecf20Sopenharmony_cishown below:
858c2ecf20Sopenharmony_ci
868c2ecf20Sopenharmony_ci1. On the bare metal host (L0), list the kernel modules and ensure that
878c2ecf20Sopenharmony_ci   the KVM modules::
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ci    $ lsmod | grep -i kvm
908c2ecf20Sopenharmony_ci    kvm_intel             133627  0
918c2ecf20Sopenharmony_ci    kvm                   435079  1 kvm_intel
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ci2. Show information for ``kvm_intel`` module::
948c2ecf20Sopenharmony_ci
958c2ecf20Sopenharmony_ci    $ modinfo kvm_intel | grep -i nested
968c2ecf20Sopenharmony_ci    parm:           nested:bool
978c2ecf20Sopenharmony_ci
988c2ecf20Sopenharmony_ci3. For the nested KVM configuration to persist across reboots, place the
998c2ecf20Sopenharmony_ci   below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
1008c2ecf20Sopenharmony_ci   doesn't exist)::
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ci    $ cat /etc/modprobe.d/kvm_intel.conf
1038c2ecf20Sopenharmony_ci    options kvm-intel nested=y
1048c2ecf20Sopenharmony_ci
1058c2ecf20Sopenharmony_ci4. Unload and re-load the KVM Intel module::
1068c2ecf20Sopenharmony_ci
1078c2ecf20Sopenharmony_ci    $ sudo rmmod kvm-intel
1088c2ecf20Sopenharmony_ci    $ sudo modprobe kvm-intel
1098c2ecf20Sopenharmony_ci
1108c2ecf20Sopenharmony_ci5. Verify if the ``nested`` parameter for KVM is enabled::
1118c2ecf20Sopenharmony_ci
1128c2ecf20Sopenharmony_ci    $ cat /sys/module/kvm_intel/parameters/nested
1138c2ecf20Sopenharmony_ci    Y
1148c2ecf20Sopenharmony_ci
1158c2ecf20Sopenharmony_ciFor AMD hosts, the process is the same as above, except that the module
1168c2ecf20Sopenharmony_ciname is ``kvm-amd``.
1178c2ecf20Sopenharmony_ci
1188c2ecf20Sopenharmony_ci
1198c2ecf20Sopenharmony_ciAdditional nested-related kernel parameters (x86)
1208c2ecf20Sopenharmony_ci-------------------------------------------------
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ciIf your hardware is sufficiently advanced (Intel Haswell processor or
1238c2ecf20Sopenharmony_cihigher, which has newer hardware virt extensions), the following
1248c2ecf20Sopenharmony_ciadditional features will also be enabled by default: "Shadow VMCS
1258c2ecf20Sopenharmony_ci(Virtual Machine Control Structure)", APIC Virtualization on your bare
1268c2ecf20Sopenharmony_cimetal host (L0).  Parameters for Intel hosts::
1278c2ecf20Sopenharmony_ci
1288c2ecf20Sopenharmony_ci    $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
1298c2ecf20Sopenharmony_ci    Y
1308c2ecf20Sopenharmony_ci
1318c2ecf20Sopenharmony_ci    $ cat /sys/module/kvm_intel/parameters/enable_apicv
1328c2ecf20Sopenharmony_ci    Y
1338c2ecf20Sopenharmony_ci
1348c2ecf20Sopenharmony_ci    $ cat /sys/module/kvm_intel/parameters/ept
1358c2ecf20Sopenharmony_ci    Y
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ci.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
1388c2ecf20Sopenharmony_ci          ensure the above are enabled (particularly
1398c2ecf20Sopenharmony_ci          ``enable_shadow_vmcs`` and ``ept``).
1408c2ecf20Sopenharmony_ci
1418c2ecf20Sopenharmony_ci
1428c2ecf20Sopenharmony_ciStarting a nested guest (x86)
1438c2ecf20Sopenharmony_ci-----------------------------
1448c2ecf20Sopenharmony_ci
1458c2ecf20Sopenharmony_ciOnce your bare metal host (L0) is configured for nesting, you should be
1468c2ecf20Sopenharmony_ciable to start an L1 guest with::
1478c2ecf20Sopenharmony_ci
1488c2ecf20Sopenharmony_ci    $ qemu-kvm -cpu host [...]
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ciThe above will pass through the host CPU's capabilities as-is to the
1518c2ecf20Sopenharmony_cigues); or for better live migration compatibility, use a named CPU
1528c2ecf20Sopenharmony_cimodel supported by QEMU. e.g.::
1538c2ecf20Sopenharmony_ci
1548c2ecf20Sopenharmony_ci    $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_cithen the guest hypervisor will subsequently be capable of running a
1578c2ecf20Sopenharmony_cinested guest with accelerated KVM.
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ci
1608c2ecf20Sopenharmony_ciEnabling "nested" (s390x)
1618c2ecf20Sopenharmony_ci-------------------------
1628c2ecf20Sopenharmony_ci
1638c2ecf20Sopenharmony_ci1. On the host hypervisor (L0), enable the ``nested`` parameter on
1648c2ecf20Sopenharmony_ci   s390x::
1658c2ecf20Sopenharmony_ci
1668c2ecf20Sopenharmony_ci    $ rmmod kvm
1678c2ecf20Sopenharmony_ci    $ modprobe kvm nested=1
1688c2ecf20Sopenharmony_ci
1698c2ecf20Sopenharmony_ci.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
1708c2ecf20Sopenharmony_ci          with the ``nested`` paramter — i.e. to be able to enable
1718c2ecf20Sopenharmony_ci          ``nested``, the ``hpage`` parameter *must* be disabled.
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_ci2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
1748c2ecf20Sopenharmony_ci   feature — with QEMU, this can be done by using "host passthrough"
1758c2ecf20Sopenharmony_ci   (via the command-line ``-cpu host``).
1768c2ecf20Sopenharmony_ci
1778c2ecf20Sopenharmony_ci3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
1788c2ecf20Sopenharmony_ci
1798c2ecf20Sopenharmony_ci    $ modprobe kvm
1808c2ecf20Sopenharmony_ci
1818c2ecf20Sopenharmony_ci
1828c2ecf20Sopenharmony_ciLive migration with nested KVM
1838c2ecf20Sopenharmony_ci------------------------------
1848c2ecf20Sopenharmony_ci
1858c2ecf20Sopenharmony_ciMigrating an L1 guest, with a  *live* nested guest in it, to another
1868c2ecf20Sopenharmony_cibare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
1878c2ecf20Sopenharmony_ciIntel x86 systems, and even on older versions for s390x.
1888c2ecf20Sopenharmony_ci
1898c2ecf20Sopenharmony_ciOn AMD systems, once an L1 guest has started an L2 guest, the L1 guest
1908c2ecf20Sopenharmony_cishould no longer be migrated or saved (refer to QEMU documentation on
1918c2ecf20Sopenharmony_ci"savevm"/"loadvm") until the L2 guest shuts down.  Attempting to migrate
1928c2ecf20Sopenharmony_cior save-and-load an L1 guest while an L2 guest is running will result in
1938c2ecf20Sopenharmony_ciundefined behavior.  You might see a ``kernel BUG!`` entry in ``dmesg``, a
1948c2ecf20Sopenharmony_cikernel 'oops', or an outright kernel panic.  Such a migrated or loaded L1
1958c2ecf20Sopenharmony_ciguest can no longer be considered stable or secure, and must be restarted.
1968c2ecf20Sopenharmony_ciMigrating an L1 guest merely configured to support nesting, while not
1978c2ecf20Sopenharmony_ciactually running L2 guests, is expected to function normally even on AMD
1988c2ecf20Sopenharmony_cisystems but may fail once guests are started.
1998c2ecf20Sopenharmony_ci
2008c2ecf20Sopenharmony_ciMigrating an L2 guest is always expected to succeed, so all the following
2018c2ecf20Sopenharmony_ciscenarios should work even on AMD systems:
2028c2ecf20Sopenharmony_ci
2038c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to another L1 guest on the *same* bare
2048c2ecf20Sopenharmony_ci  metal host.
2058c2ecf20Sopenharmony_ci
2068c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to another L1 guest on a *different*
2078c2ecf20Sopenharmony_ci  bare metal host.
2088c2ecf20Sopenharmony_ci
2098c2ecf20Sopenharmony_ci- Migrating a nested guest (L2) to a bare metal host.
2108c2ecf20Sopenharmony_ci
2118c2ecf20Sopenharmony_ciReporting bugs from nested setups
2128c2ecf20Sopenharmony_ci-----------------------------------
2138c2ecf20Sopenharmony_ci
2148c2ecf20Sopenharmony_ciDebugging "nested" problems can involve sifting through log files across
2158c2ecf20Sopenharmony_ciL0, L1 and L2; this can result in tedious back-n-forth between the bug
2168c2ecf20Sopenharmony_cireporter and the bug fixer.
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ci- Mention that you are in a "nested" setup.  If you are running any kind
2198c2ecf20Sopenharmony_ci  of "nesting" at all, say so.  Unfortunately, this needs to be called
2208c2ecf20Sopenharmony_ci  out because when reporting bugs, people tend to forget to even
2218c2ecf20Sopenharmony_ci  *mention* that they're using nested virtualization.
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ci- Ensure you are actually running KVM on KVM.  Sometimes people do not
2248c2ecf20Sopenharmony_ci  have KVM enabled for their guest hypervisor (L1), which results in
2258c2ecf20Sopenharmony_ci  them running with pure emulation or what QEMU calls it as "TCG", but
2268c2ecf20Sopenharmony_ci  they think they're running nested KVM.  Thus confusing "nested Virt"
2278c2ecf20Sopenharmony_ci  (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
2288c2ecf20Sopenharmony_ci
2298c2ecf20Sopenharmony_ciInformation to collect (generic)
2308c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ciThe following is not an exhaustive list, but a very good starting point:
2338c2ecf20Sopenharmony_ci
2348c2ecf20Sopenharmony_ci  - Kernel, libvirt, and QEMU version from L0
2358c2ecf20Sopenharmony_ci
2368c2ecf20Sopenharmony_ci  - Kernel, libvirt and QEMU version from L1
2378c2ecf20Sopenharmony_ci
2388c2ecf20Sopenharmony_ci  - QEMU command-line of L1 -- when using libvirt, you'll find it here:
2398c2ecf20Sopenharmony_ci    ``/var/log/libvirt/qemu/instance.log``
2408c2ecf20Sopenharmony_ci
2418c2ecf20Sopenharmony_ci  - QEMU command-line of L2 -- as above, when using libvirt, get the
2428c2ecf20Sopenharmony_ci    complete libvirt-generated QEMU command-line
2438c2ecf20Sopenharmony_ci
2448c2ecf20Sopenharmony_ci  - ``cat /sys/cpuinfo`` from L0
2458c2ecf20Sopenharmony_ci
2468c2ecf20Sopenharmony_ci  - ``cat /sys/cpuinfo`` from L1
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ci  - ``lscpu`` from L0
2498c2ecf20Sopenharmony_ci
2508c2ecf20Sopenharmony_ci  - ``lscpu`` from L1
2518c2ecf20Sopenharmony_ci
2528c2ecf20Sopenharmony_ci  - Full ``dmesg`` output from L0
2538c2ecf20Sopenharmony_ci
2548c2ecf20Sopenharmony_ci  - Full ``dmesg`` output from L1
2558c2ecf20Sopenharmony_ci
2568c2ecf20Sopenharmony_cix86-specific info to collect
2578c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2588c2ecf20Sopenharmony_ci
2598c2ecf20Sopenharmony_ciBoth the below commands, ``x86info`` and ``dmidecode``, should be
2608c2ecf20Sopenharmony_ciavailable on most Linux distributions with the same name:
2618c2ecf20Sopenharmony_ci
2628c2ecf20Sopenharmony_ci  - Output of: ``x86info -a`` from L0
2638c2ecf20Sopenharmony_ci
2648c2ecf20Sopenharmony_ci  - Output of: ``x86info -a`` from L1
2658c2ecf20Sopenharmony_ci
2668c2ecf20Sopenharmony_ci  - Output of: ``dmidecode`` from L0
2678c2ecf20Sopenharmony_ci
2688c2ecf20Sopenharmony_ci  - Output of: ``dmidecode`` from L1
2698c2ecf20Sopenharmony_ci
2708c2ecf20Sopenharmony_cis390x-specific info to collect
2718c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2728c2ecf20Sopenharmony_ci
2738c2ecf20Sopenharmony_ciAlong with the earlier mentioned generic details, the below is
2748c2ecf20Sopenharmony_cialso recommended:
2758c2ecf20Sopenharmony_ci
2768c2ecf20Sopenharmony_ci  - ``/proc/sysinfo`` from L1; this will also include the info from L0
277