18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci================================= 48c2ecf20Sopenharmony_ciThe PPC KVM paravirtual interface 58c2ecf20Sopenharmony_ci================================= 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciThe basic execution principle by which KVM on PowerPC works is to run all kernel 88c2ecf20Sopenharmony_cispace code in PR=1 which is user space. This way we trap all privileged 98c2ecf20Sopenharmony_ciinstructions and can emulate them accordingly. 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ciUnfortunately that is also the downfall. There are quite some privileged 128c2ecf20Sopenharmony_ciinstructions that needlessly return us to the hypervisor even though they 138c2ecf20Sopenharmony_cicould be handled differently. 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_ciThis is what the PPC PV interface helps with. It takes privileged instructions 168c2ecf20Sopenharmony_ciand transforms them into unprivileged ones with some help from the hypervisor. 178c2ecf20Sopenharmony_ciThis cuts down virtualization costs by about 50% on some of my benchmarks. 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciThe code for that interface can be found in arch/powerpc/kernel/kvm* 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciQuerying for existence 228c2ecf20Sopenharmony_ci====================== 238c2ecf20Sopenharmony_ci 248c2ecf20Sopenharmony_ciTo find out if we're running on KVM or not, we leverage the device tree. When 258c2ecf20Sopenharmony_ciLinux is running on KVM, a node /hypervisor exists. That node contains a 268c2ecf20Sopenharmony_cicompatible property with the value "linux,kvm". 278c2ecf20Sopenharmony_ci 288c2ecf20Sopenharmony_ciOnce you determined you're running under a PV capable KVM, you can now use 298c2ecf20Sopenharmony_cihypercalls as described below. 308c2ecf20Sopenharmony_ci 318c2ecf20Sopenharmony_ciKVM hypercalls 328c2ecf20Sopenharmony_ci============== 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ciInside the device tree's /hypervisor node there's a property called 358c2ecf20Sopenharmony_ci'hypercall-instructions'. This property contains at most 4 opcodes that make 368c2ecf20Sopenharmony_ciup the hypercall. To call a hypercall, just call these instructions. 378c2ecf20Sopenharmony_ci 388c2ecf20Sopenharmony_ciThe parameters are as follows: 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ci ======== ================ ================ 418c2ecf20Sopenharmony_ci Register IN OUT 428c2ecf20Sopenharmony_ci ======== ================ ================ 438c2ecf20Sopenharmony_ci r0 - volatile 448c2ecf20Sopenharmony_ci r3 1st parameter Return code 458c2ecf20Sopenharmony_ci r4 2nd parameter 1st output value 468c2ecf20Sopenharmony_ci r5 3rd parameter 2nd output value 478c2ecf20Sopenharmony_ci r6 4th parameter 3rd output value 488c2ecf20Sopenharmony_ci r7 5th parameter 4th output value 498c2ecf20Sopenharmony_ci r8 6th parameter 5th output value 508c2ecf20Sopenharmony_ci r9 7th parameter 6th output value 518c2ecf20Sopenharmony_ci r10 8th parameter 7th output value 528c2ecf20Sopenharmony_ci r11 hypercall number 8th output value 538c2ecf20Sopenharmony_ci r12 - volatile 548c2ecf20Sopenharmony_ci ======== ================ ================ 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ciHypercall definitions are shared in generic code, so the same hypercall numbers 578c2ecf20Sopenharmony_ciapply for x86 and powerpc alike with the exception that each KVM hypercall 588c2ecf20Sopenharmony_cialso needs to be ORed with the KVM vendor code which is (42 << 16). 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ciReturn codes can be as follows: 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ci ==== ========================= 638c2ecf20Sopenharmony_ci Code Meaning 648c2ecf20Sopenharmony_ci ==== ========================= 658c2ecf20Sopenharmony_ci 0 Success 668c2ecf20Sopenharmony_ci 12 Hypercall not implemented 678c2ecf20Sopenharmony_ci <0 Error 688c2ecf20Sopenharmony_ci ==== ========================= 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ciThe magic page 718c2ecf20Sopenharmony_ci============== 728c2ecf20Sopenharmony_ci 738c2ecf20Sopenharmony_ciTo enable communication between the hypervisor and guest there is a new shared 748c2ecf20Sopenharmony_cipage that contains parts of supervisor visible register state. The guest can 758c2ecf20Sopenharmony_cimap this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ciWith this hypercall issued the guest always gets the magic page mapped at the 788c2ecf20Sopenharmony_cidesired location. The first parameter indicates the effective address when the 798c2ecf20Sopenharmony_ciMMU is enabled. The second parameter indicates the address in real mode, if 808c2ecf20Sopenharmony_ciapplicable to the target. For now, we always map the page to -4096. This way we 818c2ecf20Sopenharmony_cican access it using absolute load and store functions. The following 828c2ecf20Sopenharmony_ciinstruction reads the first field of the magic page:: 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ci ld rX, -4096(0) 858c2ecf20Sopenharmony_ci 868c2ecf20Sopenharmony_ciThe interface is designed to be extensible should there be need later to add 878c2ecf20Sopenharmony_ciadditional registers to the magic page. If you add fields to the magic page, 888c2ecf20Sopenharmony_cialso define a new hypercall feature to indicate that the host can give you more 898c2ecf20Sopenharmony_ciregisters. Only if the host supports the additional features, make use of them. 908c2ecf20Sopenharmony_ci 918c2ecf20Sopenharmony_ciThe magic page layout is described by struct kvm_vcpu_arch_shared 928c2ecf20Sopenharmony_ciin arch/powerpc/include/asm/kvm_para.h. 938c2ecf20Sopenharmony_ci 948c2ecf20Sopenharmony_ciMagic page features 958c2ecf20Sopenharmony_ci=================== 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ciWhen mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE, 988c2ecf20Sopenharmony_cia second return value is passed to the guest. This second return value contains 998c2ecf20Sopenharmony_cia bitmap of available features inside the magic page. 1008c2ecf20Sopenharmony_ci 1018c2ecf20Sopenharmony_ciThe following enhancements to the magic page are currently available: 1028c2ecf20Sopenharmony_ci 1038c2ecf20Sopenharmony_ci ============================ ======================================= 1048c2ecf20Sopenharmony_ci KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page 1058c2ecf20Sopenharmony_ci KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs 1068c2ecf20Sopenharmony_ci ============================ ======================================= 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ciFor enhanced features in the magic page, please check for the existence of the 1098c2ecf20Sopenharmony_cifeature before using them! 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ciMagic page flags 1128c2ecf20Sopenharmony_ci================ 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ciIn addition to features that indicate whether a host is capable of a particular 1158c2ecf20Sopenharmony_cifeature we also have a channel for a guest to tell the guest whether it's capable 1168c2ecf20Sopenharmony_ciof something. This is what we call "flags". 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ciFlags are passed to the host in the low 12 bits of the Effective Address. 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ciThe following flags are currently available for a guest to expose: 1218c2ecf20Sopenharmony_ci 1228c2ecf20Sopenharmony_ci MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page 1238c2ecf20Sopenharmony_ci 1248c2ecf20Sopenharmony_ciMSR bits 1258c2ecf20Sopenharmony_ci======== 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ciThe MSR contains bits that require hypervisor intervention and bits that do 1288c2ecf20Sopenharmony_cinot require direct hypervisor intervention because they only get interpreted 1298c2ecf20Sopenharmony_ciwhen entering the guest or don't have any impact on the hypervisor's behavior. 1308c2ecf20Sopenharmony_ci 1318c2ecf20Sopenharmony_ciThe following bits are safe to be set inside the guest: 1328c2ecf20Sopenharmony_ci 1338c2ecf20Sopenharmony_ci - MSR_EE 1348c2ecf20Sopenharmony_ci - MSR_RI 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ciIf any other bit changes in the MSR, please still use mtmsr(d). 1378c2ecf20Sopenharmony_ci 1388c2ecf20Sopenharmony_ciPatched instructions 1398c2ecf20Sopenharmony_ci==================== 1408c2ecf20Sopenharmony_ci 1418c2ecf20Sopenharmony_ciThe "ld" and "std" instructions are transformed to "lwz" and "stw" instructions 1428c2ecf20Sopenharmony_cirespectively on 32 bit systems with an added offset of 4 to accommodate for big 1438c2ecf20Sopenharmony_ciendianness. 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ciThe following is a list of mapping the Linux kernel performs when running as 1468c2ecf20Sopenharmony_ciguest. Implementing any of those mappings is optional, as the instruction traps 1478c2ecf20Sopenharmony_cialso act on the shared page. So calling privileged instructions still works as 1488c2ecf20Sopenharmony_cibefore. 1498c2ecf20Sopenharmony_ci 1508c2ecf20Sopenharmony_ci======================= ================================ 1518c2ecf20Sopenharmony_ciFrom To 1528c2ecf20Sopenharmony_ci======================= ================================ 1538c2ecf20Sopenharmony_cimfmsr rX ld rX, magic_page->msr 1548c2ecf20Sopenharmony_cimfsprg rX, 0 ld rX, magic_page->sprg0 1558c2ecf20Sopenharmony_cimfsprg rX, 1 ld rX, magic_page->sprg1 1568c2ecf20Sopenharmony_cimfsprg rX, 2 ld rX, magic_page->sprg2 1578c2ecf20Sopenharmony_cimfsprg rX, 3 ld rX, magic_page->sprg3 1588c2ecf20Sopenharmony_cimfsrr0 rX ld rX, magic_page->srr0 1598c2ecf20Sopenharmony_cimfsrr1 rX ld rX, magic_page->srr1 1608c2ecf20Sopenharmony_cimfdar rX ld rX, magic_page->dar 1618c2ecf20Sopenharmony_cimfdsisr rX lwz rX, magic_page->dsisr 1628c2ecf20Sopenharmony_ci 1638c2ecf20Sopenharmony_cimtmsr rX std rX, magic_page->msr 1648c2ecf20Sopenharmony_cimtsprg 0, rX std rX, magic_page->sprg0 1658c2ecf20Sopenharmony_cimtsprg 1, rX std rX, magic_page->sprg1 1668c2ecf20Sopenharmony_cimtsprg 2, rX std rX, magic_page->sprg2 1678c2ecf20Sopenharmony_cimtsprg 3, rX std rX, magic_page->sprg3 1688c2ecf20Sopenharmony_cimtsrr0 rX std rX, magic_page->srr0 1698c2ecf20Sopenharmony_cimtsrr1 rX std rX, magic_page->srr1 1708c2ecf20Sopenharmony_cimtdar rX std rX, magic_page->dar 1718c2ecf20Sopenharmony_cimtdsisr rX stw rX, magic_page->dsisr 1728c2ecf20Sopenharmony_ci 1738c2ecf20Sopenharmony_citlbsync nop 1748c2ecf20Sopenharmony_ci 1758c2ecf20Sopenharmony_cimtmsrd rX, 0 b <special mtmsr section> 1768c2ecf20Sopenharmony_cimtmsr rX b <special mtmsr section> 1778c2ecf20Sopenharmony_ci 1788c2ecf20Sopenharmony_cimtmsrd rX, 1 b <special mtmsrd section> 1798c2ecf20Sopenharmony_ci 1808c2ecf20Sopenharmony_ci[Book3S only] 1818c2ecf20Sopenharmony_cimtsrin rX, rY b <special mtsrin section> 1828c2ecf20Sopenharmony_ci 1838c2ecf20Sopenharmony_ci[BookE only] 1848c2ecf20Sopenharmony_ciwrteei [0|1] b <special wrteei section> 1858c2ecf20Sopenharmony_ci======================= ================================ 1868c2ecf20Sopenharmony_ci 1878c2ecf20Sopenharmony_ciSome instructions require more logic to determine what's going on than a load 1888c2ecf20Sopenharmony_cior store instruction can deliver. To enable patching of those, we keep some 1898c2ecf20Sopenharmony_ciRAM around where we can live translate instructions to. What happens is the 1908c2ecf20Sopenharmony_cifollowing: 1918c2ecf20Sopenharmony_ci 1928c2ecf20Sopenharmony_ci 1) copy emulation code to memory 1938c2ecf20Sopenharmony_ci 2) patch that code to fit the emulated instruction 1948c2ecf20Sopenharmony_ci 3) patch that code to return to the original pc + 4 1958c2ecf20Sopenharmony_ci 4) patch the original instruction to branch to the new code 1968c2ecf20Sopenharmony_ci 1978c2ecf20Sopenharmony_ciThat way we can inject an arbitrary amount of code as replacement for a single 1988c2ecf20Sopenharmony_ciinstruction. This allows us to check for pending interrupts when setting EE=1 1998c2ecf20Sopenharmony_cifor example. 2008c2ecf20Sopenharmony_ci 2018c2ecf20Sopenharmony_ciHypercall ABIs in KVM on PowerPC 2028c2ecf20Sopenharmony_ci================================= 2038c2ecf20Sopenharmony_ci 2048c2ecf20Sopenharmony_ci1) KVM hypercalls (ePAPR) 2058c2ecf20Sopenharmony_ci 2068c2ecf20Sopenharmony_ciThese are ePAPR compliant hypercall implementation (mentioned above). Even 2078c2ecf20Sopenharmony_cigeneric hypercalls are implemented here, like the ePAPR idle hcall. These are 2088c2ecf20Sopenharmony_ciavailable on all targets. 2098c2ecf20Sopenharmony_ci 2108c2ecf20Sopenharmony_ci2) PAPR hypercalls 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ciPAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU). 2138c2ecf20Sopenharmony_ciThese are the same hypercalls that pHyp, the POWER hypervisor implements. Some of 2148c2ecf20Sopenharmony_cithem are handled in the kernel, some are handled in user space. This is only 2158c2ecf20Sopenharmony_ciavailable on book3s_64. 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_ci3) OSI hypercalls 2188c2ecf20Sopenharmony_ci 2198c2ecf20Sopenharmony_ciMac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long 2208c2ecf20Sopenharmony_cibefore KVM). This is supported to maintain compatibility. All these hypercalls get 2218c2ecf20Sopenharmony_ciforwarded to user space. This is only useful on book3s_32, but can be used with 2228c2ecf20Sopenharmony_cibook3s_64 as well. 223