18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=================================
48c2ecf20Sopenharmony_ciThe PPC KVM paravirtual interface
58c2ecf20Sopenharmony_ci=================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciThe basic execution principle by which KVM on PowerPC works is to run all kernel
88c2ecf20Sopenharmony_cispace code in PR=1 which is user space. This way we trap all privileged
98c2ecf20Sopenharmony_ciinstructions and can emulate them accordingly.
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ciUnfortunately that is also the downfall. There are quite some privileged
128c2ecf20Sopenharmony_ciinstructions that needlessly return us to the hypervisor even though they
138c2ecf20Sopenharmony_cicould be handled differently.
148c2ecf20Sopenharmony_ci
158c2ecf20Sopenharmony_ciThis is what the PPC PV interface helps with. It takes privileged instructions
168c2ecf20Sopenharmony_ciand transforms them into unprivileged ones with some help from the hypervisor.
178c2ecf20Sopenharmony_ciThis cuts down virtualization costs by about 50% on some of my benchmarks.
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ciThe code for that interface can be found in arch/powerpc/kernel/kvm*
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciQuerying for existence
228c2ecf20Sopenharmony_ci======================
238c2ecf20Sopenharmony_ci
248c2ecf20Sopenharmony_ciTo find out if we're running on KVM or not, we leverage the device tree. When
258c2ecf20Sopenharmony_ciLinux is running on KVM, a node /hypervisor exists. That node contains a
268c2ecf20Sopenharmony_cicompatible property with the value "linux,kvm".
278c2ecf20Sopenharmony_ci
288c2ecf20Sopenharmony_ciOnce you determined you're running under a PV capable KVM, you can now use
298c2ecf20Sopenharmony_cihypercalls as described below.
308c2ecf20Sopenharmony_ci
318c2ecf20Sopenharmony_ciKVM hypercalls
328c2ecf20Sopenharmony_ci==============
338c2ecf20Sopenharmony_ci
348c2ecf20Sopenharmony_ciInside the device tree's /hypervisor node there's a property called
358c2ecf20Sopenharmony_ci'hypercall-instructions'. This property contains at most 4 opcodes that make
368c2ecf20Sopenharmony_ciup the hypercall. To call a hypercall, just call these instructions.
378c2ecf20Sopenharmony_ci
388c2ecf20Sopenharmony_ciThe parameters are as follows:
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ci        ========	================	================
418c2ecf20Sopenharmony_ci	Register	IN			OUT
428c2ecf20Sopenharmony_ci        ========	================	================
438c2ecf20Sopenharmony_ci	r0		-			volatile
448c2ecf20Sopenharmony_ci	r3		1st parameter		Return code
458c2ecf20Sopenharmony_ci	r4		2nd parameter		1st output value
468c2ecf20Sopenharmony_ci	r5		3rd parameter		2nd output value
478c2ecf20Sopenharmony_ci	r6		4th parameter		3rd output value
488c2ecf20Sopenharmony_ci	r7		5th parameter		4th output value
498c2ecf20Sopenharmony_ci	r8		6th parameter		5th output value
508c2ecf20Sopenharmony_ci	r9		7th parameter		6th output value
518c2ecf20Sopenharmony_ci	r10		8th parameter		7th output value
528c2ecf20Sopenharmony_ci	r11		hypercall number	8th output value
538c2ecf20Sopenharmony_ci	r12		-			volatile
548c2ecf20Sopenharmony_ci        ========	================	================
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciHypercall definitions are shared in generic code, so the same hypercall numbers
578c2ecf20Sopenharmony_ciapply for x86 and powerpc alike with the exception that each KVM hypercall
588c2ecf20Sopenharmony_cialso needs to be ORed with the KVM vendor code which is (42 << 16).
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ciReturn codes can be as follows:
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ci	====		=========================
638c2ecf20Sopenharmony_ci	Code		Meaning
648c2ecf20Sopenharmony_ci	====		=========================
658c2ecf20Sopenharmony_ci	0		Success
668c2ecf20Sopenharmony_ci	12		Hypercall not implemented
678c2ecf20Sopenharmony_ci	<0		Error
688c2ecf20Sopenharmony_ci	====		=========================
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ciThe magic page
718c2ecf20Sopenharmony_ci==============
728c2ecf20Sopenharmony_ci
738c2ecf20Sopenharmony_ciTo enable communication between the hypervisor and guest there is a new shared
748c2ecf20Sopenharmony_cipage that contains parts of supervisor visible register state. The guest can
758c2ecf20Sopenharmony_cimap this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ciWith this hypercall issued the guest always gets the magic page mapped at the
788c2ecf20Sopenharmony_cidesired location. The first parameter indicates the effective address when the
798c2ecf20Sopenharmony_ciMMU is enabled. The second parameter indicates the address in real mode, if
808c2ecf20Sopenharmony_ciapplicable to the target. For now, we always map the page to -4096. This way we
818c2ecf20Sopenharmony_cican access it using absolute load and store functions. The following
828c2ecf20Sopenharmony_ciinstruction reads the first field of the magic page::
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ci	ld	rX, -4096(0)
858c2ecf20Sopenharmony_ci
868c2ecf20Sopenharmony_ciThe interface is designed to be extensible should there be need later to add
878c2ecf20Sopenharmony_ciadditional registers to the magic page. If you add fields to the magic page,
888c2ecf20Sopenharmony_cialso define a new hypercall feature to indicate that the host can give you more
898c2ecf20Sopenharmony_ciregisters. Only if the host supports the additional features, make use of them.
908c2ecf20Sopenharmony_ci
918c2ecf20Sopenharmony_ciThe magic page layout is described by struct kvm_vcpu_arch_shared
928c2ecf20Sopenharmony_ciin arch/powerpc/include/asm/kvm_para.h.
938c2ecf20Sopenharmony_ci
948c2ecf20Sopenharmony_ciMagic page features
958c2ecf20Sopenharmony_ci===================
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ciWhen mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
988c2ecf20Sopenharmony_cia second return value is passed to the guest. This second return value contains
998c2ecf20Sopenharmony_cia bitmap of available features inside the magic page.
1008c2ecf20Sopenharmony_ci
1018c2ecf20Sopenharmony_ciThe following enhancements to the magic page are currently available:
1028c2ecf20Sopenharmony_ci
1038c2ecf20Sopenharmony_ci  ============================  =======================================
1048c2ecf20Sopenharmony_ci  KVM_MAGIC_FEAT_SR		Maps SR registers r/w in the magic page
1058c2ecf20Sopenharmony_ci  KVM_MAGIC_FEAT_MAS0_TO_SPRG7	Maps MASn, ESR, PIR and high SPRGs
1068c2ecf20Sopenharmony_ci  ============================  =======================================
1078c2ecf20Sopenharmony_ci
1088c2ecf20Sopenharmony_ciFor enhanced features in the magic page, please check for the existence of the
1098c2ecf20Sopenharmony_cifeature before using them!
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ciMagic page flags
1128c2ecf20Sopenharmony_ci================
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ciIn addition to features that indicate whether a host is capable of a particular
1158c2ecf20Sopenharmony_cifeature we also have a channel for a guest to tell the guest whether it's capable
1168c2ecf20Sopenharmony_ciof something. This is what we call "flags".
1178c2ecf20Sopenharmony_ci
1188c2ecf20Sopenharmony_ciFlags are passed to the host in the low 12 bits of the Effective Address.
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ciThe following flags are currently available for a guest to expose:
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ci  MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ciMSR bits
1258c2ecf20Sopenharmony_ci========
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ciThe MSR contains bits that require hypervisor intervention and bits that do
1288c2ecf20Sopenharmony_cinot require direct hypervisor intervention because they only get interpreted
1298c2ecf20Sopenharmony_ciwhen entering the guest or don't have any impact on the hypervisor's behavior.
1308c2ecf20Sopenharmony_ci
1318c2ecf20Sopenharmony_ciThe following bits are safe to be set inside the guest:
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ci  - MSR_EE
1348c2ecf20Sopenharmony_ci  - MSR_RI
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ciIf any other bit changes in the MSR, please still use mtmsr(d).
1378c2ecf20Sopenharmony_ci
1388c2ecf20Sopenharmony_ciPatched instructions
1398c2ecf20Sopenharmony_ci====================
1408c2ecf20Sopenharmony_ci
1418c2ecf20Sopenharmony_ciThe "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
1428c2ecf20Sopenharmony_cirespectively on 32 bit systems with an added offset of 4 to accommodate for big
1438c2ecf20Sopenharmony_ciendianness.
1448c2ecf20Sopenharmony_ci
1458c2ecf20Sopenharmony_ciThe following is a list of mapping the Linux kernel performs when running as
1468c2ecf20Sopenharmony_ciguest. Implementing any of those mappings is optional, as the instruction traps
1478c2ecf20Sopenharmony_cialso act on the shared page. So calling privileged instructions still works as
1488c2ecf20Sopenharmony_cibefore.
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ci======================= ================================
1518c2ecf20Sopenharmony_ciFrom			To
1528c2ecf20Sopenharmony_ci======================= ================================
1538c2ecf20Sopenharmony_cimfmsr	rX		ld	rX, magic_page->msr
1548c2ecf20Sopenharmony_cimfsprg	rX, 0		ld	rX, magic_page->sprg0
1558c2ecf20Sopenharmony_cimfsprg	rX, 1		ld	rX, magic_page->sprg1
1568c2ecf20Sopenharmony_cimfsprg	rX, 2		ld	rX, magic_page->sprg2
1578c2ecf20Sopenharmony_cimfsprg	rX, 3		ld	rX, magic_page->sprg3
1588c2ecf20Sopenharmony_cimfsrr0	rX		ld	rX, magic_page->srr0
1598c2ecf20Sopenharmony_cimfsrr1	rX		ld	rX, magic_page->srr1
1608c2ecf20Sopenharmony_cimfdar	rX		ld	rX, magic_page->dar
1618c2ecf20Sopenharmony_cimfdsisr	rX		lwz	rX, magic_page->dsisr
1628c2ecf20Sopenharmony_ci
1638c2ecf20Sopenharmony_cimtmsr	rX		std	rX, magic_page->msr
1648c2ecf20Sopenharmony_cimtsprg	0, rX		std	rX, magic_page->sprg0
1658c2ecf20Sopenharmony_cimtsprg	1, rX		std	rX, magic_page->sprg1
1668c2ecf20Sopenharmony_cimtsprg	2, rX		std	rX, magic_page->sprg2
1678c2ecf20Sopenharmony_cimtsprg	3, rX		std	rX, magic_page->sprg3
1688c2ecf20Sopenharmony_cimtsrr0	rX		std	rX, magic_page->srr0
1698c2ecf20Sopenharmony_cimtsrr1	rX		std	rX, magic_page->srr1
1708c2ecf20Sopenharmony_cimtdar	rX		std	rX, magic_page->dar
1718c2ecf20Sopenharmony_cimtdsisr	rX		stw	rX, magic_page->dsisr
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_citlbsync			nop
1748c2ecf20Sopenharmony_ci
1758c2ecf20Sopenharmony_cimtmsrd	rX, 0		b	<special mtmsr section>
1768c2ecf20Sopenharmony_cimtmsr	rX		b	<special mtmsr section>
1778c2ecf20Sopenharmony_ci
1788c2ecf20Sopenharmony_cimtmsrd	rX, 1		b	<special mtmsrd section>
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ci[Book3S only]
1818c2ecf20Sopenharmony_cimtsrin	rX, rY		b	<special mtsrin section>
1828c2ecf20Sopenharmony_ci
1838c2ecf20Sopenharmony_ci[BookE only]
1848c2ecf20Sopenharmony_ciwrteei	[0|1]		b	<special wrteei section>
1858c2ecf20Sopenharmony_ci======================= ================================
1868c2ecf20Sopenharmony_ci
1878c2ecf20Sopenharmony_ciSome instructions require more logic to determine what's going on than a load
1888c2ecf20Sopenharmony_cior store instruction can deliver. To enable patching of those, we keep some
1898c2ecf20Sopenharmony_ciRAM around where we can live translate instructions to. What happens is the
1908c2ecf20Sopenharmony_cifollowing:
1918c2ecf20Sopenharmony_ci
1928c2ecf20Sopenharmony_ci	1) copy emulation code to memory
1938c2ecf20Sopenharmony_ci	2) patch that code to fit the emulated instruction
1948c2ecf20Sopenharmony_ci	3) patch that code to return to the original pc + 4
1958c2ecf20Sopenharmony_ci	4) patch the original instruction to branch to the new code
1968c2ecf20Sopenharmony_ci
1978c2ecf20Sopenharmony_ciThat way we can inject an arbitrary amount of code as replacement for a single
1988c2ecf20Sopenharmony_ciinstruction. This allows us to check for pending interrupts when setting EE=1
1998c2ecf20Sopenharmony_cifor example.
2008c2ecf20Sopenharmony_ci
2018c2ecf20Sopenharmony_ciHypercall ABIs in KVM on PowerPC
2028c2ecf20Sopenharmony_ci=================================
2038c2ecf20Sopenharmony_ci
2048c2ecf20Sopenharmony_ci1) KVM hypercalls (ePAPR)
2058c2ecf20Sopenharmony_ci
2068c2ecf20Sopenharmony_ciThese are ePAPR compliant hypercall implementation (mentioned above). Even
2078c2ecf20Sopenharmony_cigeneric hypercalls are implemented here, like the ePAPR idle hcall. These are
2088c2ecf20Sopenharmony_ciavailable on all targets.
2098c2ecf20Sopenharmony_ci
2108c2ecf20Sopenharmony_ci2) PAPR hypercalls
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciPAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
2138c2ecf20Sopenharmony_ciThese are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
2148c2ecf20Sopenharmony_cithem are handled in the kernel, some are handled in user space. This is only
2158c2ecf20Sopenharmony_ciavailable on book3s_64.
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ci3) OSI hypercalls
2188c2ecf20Sopenharmony_ci
2198c2ecf20Sopenharmony_ciMac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
2208c2ecf20Sopenharmony_cibefore KVM). This is supported to maintain compatibility. All these hypercalls get
2218c2ecf20Sopenharmony_ciforwarded to user space. This is only useful on book3s_32, but can be used with
2228c2ecf20Sopenharmony_cibook3s_64 as well.
223