162306a36Sopenharmony_ci==================================
262306a36Sopenharmony_ciVFIO - "Virtual Function I/O" [1]_
362306a36Sopenharmony_ci==================================
462306a36Sopenharmony_ci
562306a36Sopenharmony_ciMany modern systems now provide DMA and interrupt remapping facilities
662306a36Sopenharmony_cito help ensure I/O devices behave within the boundaries they've been
762306a36Sopenharmony_ciallotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
862306a36Sopenharmony_ciPOWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
962306a36Sopenharmony_cisystems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
1062306a36Sopenharmony_ciagnostic framework for exposing direct device access to userspace, in
1162306a36Sopenharmony_cia secure, IOMMU protected environment.  In other words, this allows
1262306a36Sopenharmony_cisafe [2]_, non-privileged, userspace drivers.
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ciWhy do we want that?  Virtual machines often make use of direct device
1562306a36Sopenharmony_ciaccess ("device assignment") when configured for the highest possible
1662306a36Sopenharmony_ciI/O performance.  From a device and host perspective, this simply
1762306a36Sopenharmony_citurns the VM into a userspace driver, with the benefits of
1862306a36Sopenharmony_cisignificantly reduced latency, higher bandwidth, and direct use of
1962306a36Sopenharmony_cibare-metal device drivers [3]_.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciSome applications, particularly in the high performance computing
2262306a36Sopenharmony_cifield, also benefit from low-overhead, direct device access from
2362306a36Sopenharmony_ciuserspace.  Examples include network adapters (often non-TCP/IP based)
2462306a36Sopenharmony_ciand compute accelerators.  Prior to VFIO, these drivers had to either
2562306a36Sopenharmony_cigo through the full development cycle to become proper upstream
2662306a36Sopenharmony_cidriver, be maintained out of tree, or make use of the UIO framework,
2762306a36Sopenharmony_ciwhich has no notion of IOMMU protection, limited interrupt support,
2862306a36Sopenharmony_ciand requires root privileges to access things like PCI configuration
2962306a36Sopenharmony_cispace.
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciThe VFIO driver framework intends to unify these, replacing both the
3262306a36Sopenharmony_ciKVM PCI specific device assignment code as well as provide a more
3362306a36Sopenharmony_cisecure, more featureful userspace driver environment than UIO.
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciGroups, Devices, and IOMMUs
3662306a36Sopenharmony_ci---------------------------
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciDevices are the main target of any I/O driver.  Devices typically
3962306a36Sopenharmony_cicreate a programming interface made up of I/O access, interrupts,
4062306a36Sopenharmony_ciand DMA.  Without going into the details of each of these, DMA is
4162306a36Sopenharmony_ciby far the most critical aspect for maintaining a secure environment
4262306a36Sopenharmony_cias allowing a device read-write access to system memory imposes the
4362306a36Sopenharmony_cigreatest risk to the overall system integrity.
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ciTo help mitigate this risk, many modern IOMMUs now incorporate
4662306a36Sopenharmony_ciisolation properties into what was, in many cases, an interface only
4762306a36Sopenharmony_cimeant for translation (ie. solving the addressing problems of devices
4862306a36Sopenharmony_ciwith limited address spaces).  With this, devices can now be isolated
4962306a36Sopenharmony_cifrom each other and from arbitrary memory access, thus allowing
5062306a36Sopenharmony_cithings like secure direct assignment of devices into virtual machines.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ciThis isolation is not always at the granularity of a single device
5362306a36Sopenharmony_cithough.  Even when an IOMMU is capable of this, properties of devices,
5462306a36Sopenharmony_ciinterconnects, and IOMMU topologies can each reduce this isolation.
5562306a36Sopenharmony_ciFor instance, an individual device may be part of a larger multi-
5662306a36Sopenharmony_cifunction enclosure.  While the IOMMU may be able to distinguish
5762306a36Sopenharmony_cibetween devices within the enclosure, the enclosure may not require
5862306a36Sopenharmony_citransactions between devices to reach the IOMMU.  Examples of this
5962306a36Sopenharmony_cicould be anything from a multi-function PCI device with backdoors
6062306a36Sopenharmony_cibetween functions to a non-PCI-ACS (Access Control Services) capable
6162306a36Sopenharmony_cibridge allowing redirection without reaching the IOMMU.  Topology
6262306a36Sopenharmony_cican also play a factor in terms of hiding devices.  A PCIe-to-PCI
6362306a36Sopenharmony_cibridge masks the devices behind it, making transaction appear as if
6462306a36Sopenharmony_cifrom the bridge itself.  Obviously IOMMU design plays a major factor
6562306a36Sopenharmony_cias well.
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciTherefore, while for the most part an IOMMU may have device level
6862306a36Sopenharmony_cigranularity, any system is susceptible to reduced granularity.  The
6962306a36Sopenharmony_ciIOMMU API therefore supports a notion of IOMMU groups.  A group is
7062306a36Sopenharmony_cia set of devices which is isolatable from all other devices in the
7162306a36Sopenharmony_cisystem.  Groups are therefore the unit of ownership used by VFIO.
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ciWhile the group is the minimum granularity that must be used to
7462306a36Sopenharmony_ciensure secure user access, it's not necessarily the preferred
7562306a36Sopenharmony_cigranularity.  In IOMMUs which make use of page tables, it may be
7662306a36Sopenharmony_cipossible to share a set of page tables between different groups,
7762306a36Sopenharmony_cireducing the overhead both to the platform (reduced TLB thrashing,
7862306a36Sopenharmony_cireduced duplicate page tables), and to the user (programming only
7962306a36Sopenharmony_cia single set of translations).  For this reason, VFIO makes use of
8062306a36Sopenharmony_cia container class, which may hold one or more groups.  A container
8162306a36Sopenharmony_ciis created by simply opening the /dev/vfio/vfio character device.
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ciOn its own, the container provides little functionality, with all
8462306a36Sopenharmony_cibut a couple version and extension query interfaces locked away.
8562306a36Sopenharmony_ciThe user needs to add a group into the container for the next level
8662306a36Sopenharmony_ciof functionality.  To do this, the user first needs to identify the
8762306a36Sopenharmony_cigroup associated with the desired device.  This can be done using
8862306a36Sopenharmony_cithe sysfs links described in the example below.  By unbinding the
8962306a36Sopenharmony_cidevice from the host driver and binding it to a VFIO driver, a new
9062306a36Sopenharmony_ciVFIO group will appear for the group as /dev/vfio/$GROUP, where
9162306a36Sopenharmony_ci$GROUP is the IOMMU group number of which the device is a member.
9262306a36Sopenharmony_ciIf the IOMMU group contains multiple devices, each will need to
9362306a36Sopenharmony_cibe bound to a VFIO driver before operations on the VFIO group
9462306a36Sopenharmony_ciare allowed (it's also sufficient to only unbind the device from
9562306a36Sopenharmony_cihost drivers if a VFIO driver is unavailable; this will make the
9662306a36Sopenharmony_cigroup available, but not that particular device).  TBD - interface
9762306a36Sopenharmony_cifor disabling driver probing/locking a device.
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ciOnce the group is ready, it may be added to the container by opening
10062306a36Sopenharmony_cithe VFIO group character device (/dev/vfio/$GROUP) and using the
10162306a36Sopenharmony_ciVFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
10262306a36Sopenharmony_cipreviously opened container file.  If desired and if the IOMMU driver
10362306a36Sopenharmony_cisupports sharing the IOMMU context between groups, multiple groups may
10462306a36Sopenharmony_cibe set to the same container.  If a group fails to set to a container
10562306a36Sopenharmony_ciwith existing groups, a new empty container will need to be used
10662306a36Sopenharmony_ciinstead.
10762306a36Sopenharmony_ci
10862306a36Sopenharmony_ciWith a group (or groups) attached to a container, the remaining
10962306a36Sopenharmony_ciioctls become available, enabling access to the VFIO IOMMU interfaces.
11062306a36Sopenharmony_ciAdditionally, it now becomes possible to get file descriptors for each
11162306a36Sopenharmony_cidevice within a group using an ioctl on the VFIO group file descriptor.
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ciThe VFIO device API includes ioctls for describing the device, the I/O
11462306a36Sopenharmony_ciregions and their read/write/mmap offsets on the device descriptor, as
11562306a36Sopenharmony_ciwell as mechanisms for describing and registering interrupt
11662306a36Sopenharmony_cinotifications.
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_ciVFIO Usage Example
11962306a36Sopenharmony_ci------------------
12062306a36Sopenharmony_ci
12162306a36Sopenharmony_ciAssume user wants to access PCI device 0000:06:0d.0::
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ci	$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
12462306a36Sopenharmony_ci	../../../../kernel/iommu_groups/26
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciThis device is therefore in IOMMU group 26.  This device is on the
12762306a36Sopenharmony_cipci bus, therefore the user will make use of vfio-pci to manage the
12862306a36Sopenharmony_cigroup::
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ci	# modprobe vfio-pci
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciBinding this device to the vfio-pci driver creates the VFIO group
13362306a36Sopenharmony_cicharacter devices for this group::
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ci	$ lspci -n -s 0000:06:0d.0
13662306a36Sopenharmony_ci	06:0d.0 0401: 1102:0002 (rev 08)
13762306a36Sopenharmony_ci	# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
13862306a36Sopenharmony_ci	# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ciNow we need to look at what other devices are in the group to free
14162306a36Sopenharmony_ciit for use by VFIO::
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ci	$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
14462306a36Sopenharmony_ci	total 0
14562306a36Sopenharmony_ci	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
14662306a36Sopenharmony_ci		../../../../devices/pci0000:00/0000:00:1e.0
14762306a36Sopenharmony_ci	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
14862306a36Sopenharmony_ci		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
14962306a36Sopenharmony_ci	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
15062306a36Sopenharmony_ci		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ciThis device is behind a PCIe-to-PCI bridge [4]_, therefore we also
15362306a36Sopenharmony_cineed to add device 0000:06:0d.1 to the group following the same
15462306a36Sopenharmony_ciprocedure as above.  Device 0000:00:1e.0 is a bridge that does
15562306a36Sopenharmony_cinot currently have a host driver, therefore it's not required to
15662306a36Sopenharmony_cibind this device to the vfio-pci driver (vfio-pci does not currently
15762306a36Sopenharmony_cisupport PCI bridges).
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ciThe final step is to provide the user with access to the group if
16062306a36Sopenharmony_ciunprivileged operation is desired (note that /dev/vfio/vfio provides
16162306a36Sopenharmony_cino capabilities on its own and is therefore expected to be set to
16262306a36Sopenharmony_cimode 0666 by the system)::
16362306a36Sopenharmony_ci
16462306a36Sopenharmony_ci	# chown user:user /dev/vfio/26
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_ciThe user now has full access to all the devices and the iommu for this
16762306a36Sopenharmony_cigroup and can access them as follows::
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ci	int container, group, device, i;
17062306a36Sopenharmony_ci	struct vfio_group_status group_status =
17162306a36Sopenharmony_ci					{ .argsz = sizeof(group_status) };
17262306a36Sopenharmony_ci	struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
17362306a36Sopenharmony_ci	struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
17462306a36Sopenharmony_ci	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ci	/* Create a new container */
17762306a36Sopenharmony_ci	container = open("/dev/vfio/vfio", O_RDWR);
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ci	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
18062306a36Sopenharmony_ci		/* Unknown API version */
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ci	if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
18362306a36Sopenharmony_ci		/* Doesn't support the IOMMU driver we want. */
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ci	/* Open the group */
18662306a36Sopenharmony_ci	group = open("/dev/vfio/26", O_RDWR);
18762306a36Sopenharmony_ci
18862306a36Sopenharmony_ci	/* Test the group is viable and available */
18962306a36Sopenharmony_ci	ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
19262306a36Sopenharmony_ci		/* Group is not viable (ie, not all devices bound for vfio) */
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ci	/* Add the group to the container */
19562306a36Sopenharmony_ci	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ci	/* Enable the IOMMU model we want */
19862306a36Sopenharmony_ci	ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ci	/* Get addition IOMMU info */
20162306a36Sopenharmony_ci	ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ci	/* Allocate some space and setup a DMA mapping */
20462306a36Sopenharmony_ci	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
20562306a36Sopenharmony_ci			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
20662306a36Sopenharmony_ci	dma_map.size = 1024 * 1024;
20762306a36Sopenharmony_ci	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
20862306a36Sopenharmony_ci	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
20962306a36Sopenharmony_ci
21062306a36Sopenharmony_ci	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ci	/* Get a file descriptor for the device */
21362306a36Sopenharmony_ci	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ci	/* Test and setup the device */
21662306a36Sopenharmony_ci	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ci	for (i = 0; i < device_info.num_regions; i++) {
21962306a36Sopenharmony_ci		struct vfio_region_info reg = { .argsz = sizeof(reg) };
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ci		reg.index = i;
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ci		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ci		/* Setup mappings... read/write offsets, mmaps
22662306a36Sopenharmony_ci		 * For PCI devices, config space is a region */
22762306a36Sopenharmony_ci	}
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ci	for (i = 0; i < device_info.num_irqs; i++) {
23062306a36Sopenharmony_ci		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ci		irq.index = i;
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
23562306a36Sopenharmony_ci
23662306a36Sopenharmony_ci		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
23762306a36Sopenharmony_ci	}
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ci	/* Gratuitous device reset and go... */
24062306a36Sopenharmony_ci	ioctl(device, VFIO_DEVICE_RESET);
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciIOMMUFD and vfio_iommu_type1
24362306a36Sopenharmony_ci----------------------------
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciIOMMUFD is the new user API to manage I/O page tables from userspace.
24662306a36Sopenharmony_ciIt intends to be the portal of delivering advanced userspace DMA
24762306a36Sopenharmony_cifeatures (nested translation [5]_, PASID [6]_, etc.) while also providing
24862306a36Sopenharmony_cia backwards compatibility interface for existing VFIO_TYPE1v2_IOMMU use
24962306a36Sopenharmony_cicases.  Eventually the vfio_iommu_type1 driver, as well as the legacy
25062306a36Sopenharmony_civfio container and group model is intended to be deprecated.
25162306a36Sopenharmony_ci
25262306a36Sopenharmony_ciThe IOMMUFD backwards compatibility interface can be enabled two ways.
25362306a36Sopenharmony_ciIn the first method, the kernel can be configured with
25462306a36Sopenharmony_ciCONFIG_IOMMUFD_VFIO_CONTAINER, in which case the IOMMUFD subsystem
25562306a36Sopenharmony_citransparently provides the entire infrastructure for the VFIO
25662306a36Sopenharmony_cicontainer and IOMMU backend interfaces.  The compatibility mode can
25762306a36Sopenharmony_cialso be accessed if the VFIO container interface, ie. /dev/vfio/vfio is
25862306a36Sopenharmony_cisimply symlink'd to /dev/iommu.  Note that at the time of writing, the
25962306a36Sopenharmony_cicompatibility mode is not entirely feature complete relative to
26062306a36Sopenharmony_ciVFIO_TYPE1v2_IOMMU (ex. DMA mapping MMIO) and does not attempt to
26162306a36Sopenharmony_ciprovide compatibility to the VFIO_SPAPR_TCE_IOMMU interface.  Therefore
26262306a36Sopenharmony_ciit is not generally advisable at this time to switch from native VFIO
26362306a36Sopenharmony_ciimplementations to the IOMMUFD compatibility interfaces.
26462306a36Sopenharmony_ci
26562306a36Sopenharmony_ciLong term, VFIO users should migrate to device access through the cdev
26662306a36Sopenharmony_ciinterface described below, and native access through the IOMMUFD
26762306a36Sopenharmony_ciprovided interfaces.
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciVFIO Device cdev
27062306a36Sopenharmony_ci----------------
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ciTraditionally user acquires a device fd via VFIO_GROUP_GET_DEVICE_FD
27362306a36Sopenharmony_ciin a VFIO group.
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciWith CONFIG_VFIO_DEVICE_CDEV=y the user can now acquire a device fd
27662306a36Sopenharmony_ciby directly opening a character device /dev/vfio/devices/vfioX where
27762306a36Sopenharmony_ci"X" is the number allocated uniquely by VFIO for registered devices.
27862306a36Sopenharmony_cicdev interface does not support noiommu devices, so user should use
27962306a36Sopenharmony_cithe legacy group interface if noiommu is wanted.
28062306a36Sopenharmony_ci
28162306a36Sopenharmony_ciThe cdev only works with IOMMUFD.  Both VFIO drivers and applications
28262306a36Sopenharmony_cimust adapt to the new cdev security model which requires using
28362306a36Sopenharmony_ciVFIO_DEVICE_BIND_IOMMUFD to claim DMA ownership before starting to
28462306a36Sopenharmony_ciactually use the device.  Once BIND succeeds then a VFIO device can
28562306a36Sopenharmony_cibe fully accessed by the user.
28662306a36Sopenharmony_ci
28762306a36Sopenharmony_ciVFIO device cdev doesn't rely on VFIO group/container/iommu drivers.
28862306a36Sopenharmony_ciHence those modules can be fully compiled out in an environment
28962306a36Sopenharmony_ciwhere no legacy VFIO application exists.
29062306a36Sopenharmony_ci
29162306a36Sopenharmony_ciSo far SPAPR does not support IOMMUFD yet.  So it cannot support device
29262306a36Sopenharmony_cicdev either.
29362306a36Sopenharmony_ci
29462306a36Sopenharmony_civfio device cdev access is still bound by IOMMU group semantics, ie. there
29562306a36Sopenharmony_cican be only one DMA owner for the group.  Devices belonging to the same
29662306a36Sopenharmony_cigroup can not be bound to multiple iommufd_ctx or shared between native
29762306a36Sopenharmony_cikernel and vfio bus driver or other driver supporting the driver_managed_dma
29862306a36Sopenharmony_ciflag.  A violation of this ownership requirement will fail at the
29962306a36Sopenharmony_ciVFIO_DEVICE_BIND_IOMMUFD ioctl, which gates full device access.
30062306a36Sopenharmony_ci
30162306a36Sopenharmony_ciDevice cdev Example
30262306a36Sopenharmony_ci-------------------
30362306a36Sopenharmony_ci
30462306a36Sopenharmony_ciAssume user wants to access PCI device 0000:6a:01.0::
30562306a36Sopenharmony_ci
30662306a36Sopenharmony_ci	$ ls /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/
30762306a36Sopenharmony_ci	vfio0
30862306a36Sopenharmony_ci
30962306a36Sopenharmony_ciThis device is therefore represented as vfio0.  The user can verify
31062306a36Sopenharmony_ciits existence::
31162306a36Sopenharmony_ci
31262306a36Sopenharmony_ci	$ ls -l /dev/vfio/devices/vfio0
31362306a36Sopenharmony_ci	crw------- 1 root root 511, 0 Feb 16 01:22 /dev/vfio/devices/vfio0
31462306a36Sopenharmony_ci	$ cat /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/vfio0/dev
31562306a36Sopenharmony_ci	511:0
31662306a36Sopenharmony_ci	$ ls -l /dev/char/511\:0
31762306a36Sopenharmony_ci	lrwxrwxrwx 1 root root 21 Feb 16 01:22 /dev/char/511:0 -> ../vfio/devices/vfio0
31862306a36Sopenharmony_ci
31962306a36Sopenharmony_ciThen provide the user with access to the device if unprivileged
32062306a36Sopenharmony_cioperation is desired::
32162306a36Sopenharmony_ci
32262306a36Sopenharmony_ci	$ chown user:user /dev/vfio/devices/vfio0
32362306a36Sopenharmony_ci
32462306a36Sopenharmony_ciFinally the user could get cdev fd by::
32562306a36Sopenharmony_ci
32662306a36Sopenharmony_ci	cdev_fd = open("/dev/vfio/devices/vfio0", O_RDWR);
32762306a36Sopenharmony_ci
32862306a36Sopenharmony_ciAn opened cdev_fd doesn't give the user any permission of accessing
32962306a36Sopenharmony_cithe device except binding the cdev_fd to an iommufd.  After that point
33062306a36Sopenharmony_cithen the device is fully accessible including attaching it to an
33162306a36Sopenharmony_ciIOMMUFD IOAS/HWPT to enable userspace DMA::
33262306a36Sopenharmony_ci
33362306a36Sopenharmony_ci	struct vfio_device_bind_iommufd bind = {
33462306a36Sopenharmony_ci		.argsz = sizeof(bind),
33562306a36Sopenharmony_ci		.flags = 0,
33662306a36Sopenharmony_ci	};
33762306a36Sopenharmony_ci	struct iommu_ioas_alloc alloc_data  = {
33862306a36Sopenharmony_ci		.size = sizeof(alloc_data),
33962306a36Sopenharmony_ci		.flags = 0,
34062306a36Sopenharmony_ci	};
34162306a36Sopenharmony_ci	struct vfio_device_attach_iommufd_pt attach_data = {
34262306a36Sopenharmony_ci		.argsz = sizeof(attach_data),
34362306a36Sopenharmony_ci		.flags = 0,
34462306a36Sopenharmony_ci	};
34562306a36Sopenharmony_ci	struct iommu_ioas_map map = {
34662306a36Sopenharmony_ci		.size = sizeof(map),
34762306a36Sopenharmony_ci		.flags = IOMMU_IOAS_MAP_READABLE |
34862306a36Sopenharmony_ci			 IOMMU_IOAS_MAP_WRITEABLE |
34962306a36Sopenharmony_ci			 IOMMU_IOAS_MAP_FIXED_IOVA,
35062306a36Sopenharmony_ci		.__reserved = 0,
35162306a36Sopenharmony_ci	};
35262306a36Sopenharmony_ci
35362306a36Sopenharmony_ci	iommufd = open("/dev/iommu", O_RDWR);
35462306a36Sopenharmony_ci
35562306a36Sopenharmony_ci	bind.iommufd = iommufd;
35662306a36Sopenharmony_ci	ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
35762306a36Sopenharmony_ci
35862306a36Sopenharmony_ci	ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
35962306a36Sopenharmony_ci	attach_data.pt_id = alloc_data.out_ioas_id;
36062306a36Sopenharmony_ci	ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
36162306a36Sopenharmony_ci
36262306a36Sopenharmony_ci	/* Allocate some space and setup a DMA mapping */
36362306a36Sopenharmony_ci	map.user_va = (int64_t)mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
36462306a36Sopenharmony_ci				    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
36562306a36Sopenharmony_ci	map.iova = 0; /* 1MB starting at 0x0 from device view */
36662306a36Sopenharmony_ci	map.length = 1024 * 1024;
36762306a36Sopenharmony_ci	map.ioas_id = alloc_data.out_ioas_id;;
36862306a36Sopenharmony_ci
36962306a36Sopenharmony_ci	ioctl(iommufd, IOMMU_IOAS_MAP, &map);
37062306a36Sopenharmony_ci
37162306a36Sopenharmony_ci	/* Other device operations as stated in "VFIO Usage Example" */
37262306a36Sopenharmony_ci
37362306a36Sopenharmony_ciVFIO User API
37462306a36Sopenharmony_ci-------------------------------------------------------------------------------
37562306a36Sopenharmony_ci
37662306a36Sopenharmony_ciPlease see include/uapi/linux/vfio.h for complete API documentation.
37762306a36Sopenharmony_ci
37862306a36Sopenharmony_ciVFIO bus driver API
37962306a36Sopenharmony_ci-------------------------------------------------------------------------------
38062306a36Sopenharmony_ci
38162306a36Sopenharmony_ciVFIO bus drivers, such as vfio-pci make use of only a few interfaces
38262306a36Sopenharmony_ciinto VFIO core.  When devices are bound and unbound to the driver,
38362306a36Sopenharmony_ciFollowing interfaces are called when devices are bound to and
38462306a36Sopenharmony_ciunbound from the driver::
38562306a36Sopenharmony_ci
38662306a36Sopenharmony_ci	int vfio_register_group_dev(struct vfio_device *device);
38762306a36Sopenharmony_ci	int vfio_register_emulated_iommu_dev(struct vfio_device *device);
38862306a36Sopenharmony_ci	void vfio_unregister_group_dev(struct vfio_device *device);
38962306a36Sopenharmony_ci
39062306a36Sopenharmony_ciThe driver should embed the vfio_device in its own structure and use
39162306a36Sopenharmony_civfio_alloc_device() to allocate the structure, and can register
39262306a36Sopenharmony_ci@init/@release callbacks to manage any private state wrapping the
39362306a36Sopenharmony_civfio_device::
39462306a36Sopenharmony_ci
39562306a36Sopenharmony_ci	vfio_alloc_device(dev_struct, member, dev, ops);
39662306a36Sopenharmony_ci	void vfio_put_device(struct vfio_device *device);
39762306a36Sopenharmony_ci
39862306a36Sopenharmony_civfio_register_group_dev() indicates to the core to begin tracking the
39962306a36Sopenharmony_ciiommu_group of the specified dev and register the dev as owned by a VFIO bus
40062306a36Sopenharmony_cidriver. Once vfio_register_group_dev() returns it is possible for userspace to
40162306a36Sopenharmony_cistart accessing the driver, thus the driver should ensure it is completely
40262306a36Sopenharmony_ciready before calling it. The driver provides an ops structure for callbacks
40362306a36Sopenharmony_cisimilar to a file operations structure::
40462306a36Sopenharmony_ci
40562306a36Sopenharmony_ci	struct vfio_device_ops {
40662306a36Sopenharmony_ci		char	*name;
40762306a36Sopenharmony_ci		int	(*init)(struct vfio_device *vdev);
40862306a36Sopenharmony_ci		void	(*release)(struct vfio_device *vdev);
40962306a36Sopenharmony_ci		int	(*bind_iommufd)(struct vfio_device *vdev,
41062306a36Sopenharmony_ci					struct iommufd_ctx *ictx, u32 *out_device_id);
41162306a36Sopenharmony_ci		void	(*unbind_iommufd)(struct vfio_device *vdev);
41262306a36Sopenharmony_ci		int	(*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
41362306a36Sopenharmony_ci		void	(*detach_ioas)(struct vfio_device *vdev);
41462306a36Sopenharmony_ci		int	(*open_device)(struct vfio_device *vdev);
41562306a36Sopenharmony_ci		void	(*close_device)(struct vfio_device *vdev);
41662306a36Sopenharmony_ci		ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
41762306a36Sopenharmony_ci				size_t count, loff_t *ppos);
41862306a36Sopenharmony_ci		ssize_t	(*write)(struct vfio_device *vdev, const char __user *buf,
41962306a36Sopenharmony_ci			 size_t count, loff_t *size);
42062306a36Sopenharmony_ci		long	(*ioctl)(struct vfio_device *vdev, unsigned int cmd,
42162306a36Sopenharmony_ci				 unsigned long arg);
42262306a36Sopenharmony_ci		int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
42362306a36Sopenharmony_ci		void	(*request)(struct vfio_device *vdev, unsigned int count);
42462306a36Sopenharmony_ci		int	(*match)(struct vfio_device *vdev, char *buf);
42562306a36Sopenharmony_ci		void	(*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
42662306a36Sopenharmony_ci		int	(*device_feature)(struct vfio_device *device, u32 flags,
42762306a36Sopenharmony_ci					  void __user *arg, size_t argsz);
42862306a36Sopenharmony_ci	};
42962306a36Sopenharmony_ci
43062306a36Sopenharmony_ciEach function is passed the vdev that was originally registered
43162306a36Sopenharmony_ciin the vfio_register_group_dev() or vfio_register_emulated_iommu_dev()
43262306a36Sopenharmony_cicall above. This allows the bus driver to obtain its private data using
43362306a36Sopenharmony_cicontainer_of().
43462306a36Sopenharmony_ci
43562306a36Sopenharmony_ci::
43662306a36Sopenharmony_ci
43762306a36Sopenharmony_ci	- The init/release callbacks are issued when vfio_device is initialized
43862306a36Sopenharmony_ci	  and released.
43962306a36Sopenharmony_ci
44062306a36Sopenharmony_ci	- The open/close device callbacks are issued when the first
44162306a36Sopenharmony_ci	  instance of a file descriptor for the device is created (eg.
44262306a36Sopenharmony_ci	  via VFIO_GROUP_GET_DEVICE_FD) for a user session.
44362306a36Sopenharmony_ci
44462306a36Sopenharmony_ci	- The ioctl callback provides a direct pass through for some VFIO_DEVICE_*
44562306a36Sopenharmony_ci	  ioctls.
44662306a36Sopenharmony_ci
44762306a36Sopenharmony_ci	- The [un]bind_iommufd callbacks are issued when the device is bound to
44862306a36Sopenharmony_ci	  and unbound from iommufd.
44962306a36Sopenharmony_ci
45062306a36Sopenharmony_ci	- The [de]attach_ioas callback is issued when the device is attached to
45162306a36Sopenharmony_ci	  and detached from an IOAS managed by the bound iommufd. However, the
45262306a36Sopenharmony_ci	  attached IOAS can also be automatically detached when the device is
45362306a36Sopenharmony_ci	  unbound from iommufd.
45462306a36Sopenharmony_ci
45562306a36Sopenharmony_ci	- The read/write/mmap callbacks implement the device region access defined
45662306a36Sopenharmony_ci	  by the device's own VFIO_DEVICE_GET_REGION_INFO ioctl.
45762306a36Sopenharmony_ci
45862306a36Sopenharmony_ci	- The request callback is issued when device is going to be unregistered,
45962306a36Sopenharmony_ci	  such as when trying to unbind the device from the vfio bus driver.
46062306a36Sopenharmony_ci
46162306a36Sopenharmony_ci	- The dma_unmap callback is issued when a range of iovas are unmapped
46262306a36Sopenharmony_ci	  in the container or IOAS attached by the device. Drivers which make
46362306a36Sopenharmony_ci	  use of the vfio page pinning interface must implement this callback in
46462306a36Sopenharmony_ci	  order to unpin pages within the dma_unmap range. Drivers must tolerate
46562306a36Sopenharmony_ci	  this callback even before calls to open_device().
46662306a36Sopenharmony_ci
46762306a36Sopenharmony_ciPPC64 sPAPR implementation note
46862306a36Sopenharmony_ci-------------------------------
46962306a36Sopenharmony_ci
47062306a36Sopenharmony_ciThis implementation has some specifics:
47162306a36Sopenharmony_ci
47262306a36Sopenharmony_ci1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
47362306a36Sopenharmony_ci   container is supported as an IOMMU table is allocated at the boot time,
47462306a36Sopenharmony_ci   one table per a IOMMU group which is a Partitionable Endpoint (PE)
47562306a36Sopenharmony_ci   (PE is often a PCI domain but not always).
47662306a36Sopenharmony_ci
47762306a36Sopenharmony_ci   Newer systems (POWER8 with IODA2) have improved hardware design which allows
47862306a36Sopenharmony_ci   to remove this limitation and have multiple IOMMU groups per a VFIO
47962306a36Sopenharmony_ci   container.
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ci2) The hardware supports so called DMA windows - the PCI address range
48262306a36Sopenharmony_ci   within which DMA transfer is allowed, any attempt to access address space
48362306a36Sopenharmony_ci   out of the window leads to the whole PE isolation.
48462306a36Sopenharmony_ci
48562306a36Sopenharmony_ci3) PPC64 guests are paravirtualized but not fully emulated. There is an API
48662306a36Sopenharmony_ci   to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
48762306a36Sopenharmony_ci   currently there is no way to reduce the number of calls. In order to make
48862306a36Sopenharmony_ci   things faster, the map/unmap handling has been implemented in real mode
48962306a36Sopenharmony_ci   which provides an excellent performance which has limitations such as
49062306a36Sopenharmony_ci   inability to do locked pages accounting in real time.
49162306a36Sopenharmony_ci
49262306a36Sopenharmony_ci4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
49362306a36Sopenharmony_ci   subtree that can be treated as a unit for the purposes of partitioning and
49462306a36Sopenharmony_ci   error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
49562306a36Sopenharmony_ci   function of a multi-function IOA, or multiple IOAs (possibly including
49662306a36Sopenharmony_ci   switch and bridge structures above the multiple IOAs). PPC64 guests detect
49762306a36Sopenharmony_ci   PCI errors and recover from them via EEH RTAS services, which works on the
49862306a36Sopenharmony_ci   basis of additional ioctl commands.
49962306a36Sopenharmony_ci
50062306a36Sopenharmony_ci   So 4 additional ioctls have been added:
50162306a36Sopenharmony_ci
50262306a36Sopenharmony_ci	VFIO_IOMMU_SPAPR_TCE_GET_INFO
50362306a36Sopenharmony_ci		returns the size and the start of the DMA window on the PCI bus.
50462306a36Sopenharmony_ci
50562306a36Sopenharmony_ci	VFIO_IOMMU_ENABLE
50662306a36Sopenharmony_ci		enables the container. The locked pages accounting
50762306a36Sopenharmony_ci		is done at this point. This lets user first to know what
50862306a36Sopenharmony_ci		the DMA window is and adjust rlimit before doing any real job.
50962306a36Sopenharmony_ci
51062306a36Sopenharmony_ci	VFIO_IOMMU_DISABLE
51162306a36Sopenharmony_ci		disables the container.
51262306a36Sopenharmony_ci
51362306a36Sopenharmony_ci	VFIO_EEH_PE_OP
51462306a36Sopenharmony_ci		provides an API for EEH setup, error detection and recovery.
51562306a36Sopenharmony_ci
51662306a36Sopenharmony_ci   The code flow from the example above should be slightly changed::
51762306a36Sopenharmony_ci
51862306a36Sopenharmony_ci	struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
51962306a36Sopenharmony_ci
52062306a36Sopenharmony_ci	.....
52162306a36Sopenharmony_ci	/* Add the group to the container */
52262306a36Sopenharmony_ci	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
52362306a36Sopenharmony_ci
52462306a36Sopenharmony_ci	/* Enable the IOMMU model we want */
52562306a36Sopenharmony_ci	ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
52662306a36Sopenharmony_ci
52762306a36Sopenharmony_ci	/* Get addition sPAPR IOMMU info */
52862306a36Sopenharmony_ci	vfio_iommu_spapr_tce_info spapr_iommu_info;
52962306a36Sopenharmony_ci	ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
53062306a36Sopenharmony_ci
53162306a36Sopenharmony_ci	if (ioctl(container, VFIO_IOMMU_ENABLE))
53262306a36Sopenharmony_ci		/* Cannot enable container, may be low rlimit */
53362306a36Sopenharmony_ci
53462306a36Sopenharmony_ci	/* Allocate some space and setup a DMA mapping */
53562306a36Sopenharmony_ci	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
53662306a36Sopenharmony_ci			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
53762306a36Sopenharmony_ci
53862306a36Sopenharmony_ci	dma_map.size = 1024 * 1024;
53962306a36Sopenharmony_ci	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
54062306a36Sopenharmony_ci	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
54162306a36Sopenharmony_ci
54262306a36Sopenharmony_ci	/* Check here is .iova/.size are within DMA window from spapr_iommu_info */
54362306a36Sopenharmony_ci	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
54462306a36Sopenharmony_ci
54562306a36Sopenharmony_ci	/* Get a file descriptor for the device */
54662306a36Sopenharmony_ci	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
54762306a36Sopenharmony_ci
54862306a36Sopenharmony_ci	....
54962306a36Sopenharmony_ci
55062306a36Sopenharmony_ci	/* Gratuitous device reset and go... */
55162306a36Sopenharmony_ci	ioctl(device, VFIO_DEVICE_RESET);
55262306a36Sopenharmony_ci
55362306a36Sopenharmony_ci	/* Make sure EEH is supported */
55462306a36Sopenharmony_ci	ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH);
55562306a36Sopenharmony_ci
55662306a36Sopenharmony_ci	/* Enable the EEH functionality on the device */
55762306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_ENABLE;
55862306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
55962306a36Sopenharmony_ci
56062306a36Sopenharmony_ci	/* You're suggested to create additional data struct to represent
56162306a36Sopenharmony_ci	 * PE, and put child devices belonging to same IOMMU group to the
56262306a36Sopenharmony_ci	 * PE instance for later reference.
56362306a36Sopenharmony_ci	 */
56462306a36Sopenharmony_ci
56562306a36Sopenharmony_ci	/* Check the PE's state and make sure it's in functional state */
56662306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_GET_STATE;
56762306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
56862306a36Sopenharmony_ci
56962306a36Sopenharmony_ci	/* Save device state using pci_save_state().
57062306a36Sopenharmony_ci	 * EEH should be enabled on the specified device.
57162306a36Sopenharmony_ci	 */
57262306a36Sopenharmony_ci
57362306a36Sopenharmony_ci	....
57462306a36Sopenharmony_ci
57562306a36Sopenharmony_ci	/* Inject EEH error, which is expected to be caused by 32-bits
57662306a36Sopenharmony_ci	 * config load.
57762306a36Sopenharmony_ci	 */
57862306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_INJECT_ERR;
57962306a36Sopenharmony_ci	pe_op.err.type = EEH_ERR_TYPE_32;
58062306a36Sopenharmony_ci	pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR;
58162306a36Sopenharmony_ci	pe_op.err.addr = 0ul;
58262306a36Sopenharmony_ci	pe_op.err.mask = 0ul;
58362306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
58462306a36Sopenharmony_ci
58562306a36Sopenharmony_ci	....
58662306a36Sopenharmony_ci
58762306a36Sopenharmony_ci	/* When 0xFF's returned from reading PCI config space or IO BARs
58862306a36Sopenharmony_ci	 * of the PCI device. Check the PE's state to see if that has been
58962306a36Sopenharmony_ci	 * frozen.
59062306a36Sopenharmony_ci	 */
59162306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
59262306a36Sopenharmony_ci
59362306a36Sopenharmony_ci	/* Waiting for pending PCI transactions to be completed and don't
59462306a36Sopenharmony_ci	 * produce any more PCI traffic from/to the affected PE until
59562306a36Sopenharmony_ci	 * recovery is finished.
59662306a36Sopenharmony_ci	 */
59762306a36Sopenharmony_ci
59862306a36Sopenharmony_ci	/* Enable IO for the affected PE and collect logs. Usually, the
59962306a36Sopenharmony_ci	 * standard part of PCI config space, AER registers are dumped
60062306a36Sopenharmony_ci	 * as logs for further analysis.
60162306a36Sopenharmony_ci	 */
60262306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_UNFREEZE_IO;
60362306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
60462306a36Sopenharmony_ci
60562306a36Sopenharmony_ci	/*
60662306a36Sopenharmony_ci	 * Issue PE reset: hot or fundamental reset. Usually, hot reset
60762306a36Sopenharmony_ci	 * is enough. However, the firmware of some PCI adapters would
60862306a36Sopenharmony_ci	 * require fundamental reset.
60962306a36Sopenharmony_ci	 */
61062306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_RESET_HOT;
61162306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
61262306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE;
61362306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
61462306a36Sopenharmony_ci
61562306a36Sopenharmony_ci	/* Configure the PCI bridges for the affected PE */
61662306a36Sopenharmony_ci	pe_op.op = VFIO_EEH_PE_CONFIGURE;
61762306a36Sopenharmony_ci	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
61862306a36Sopenharmony_ci
61962306a36Sopenharmony_ci	/* Restored state we saved at initialization time. pci_restore_state()
62062306a36Sopenharmony_ci	 * is good enough as an example.
62162306a36Sopenharmony_ci	 */
62262306a36Sopenharmony_ci
62362306a36Sopenharmony_ci	/* Hopefully, error is recovered successfully. Now, you can resume to
62462306a36Sopenharmony_ci	 * start PCI traffic to/from the affected PE.
62562306a36Sopenharmony_ci	 */
62662306a36Sopenharmony_ci
62762306a36Sopenharmony_ci	....
62862306a36Sopenharmony_ci
62962306a36Sopenharmony_ci5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
63062306a36Sopenharmony_ci   VFIO_IOMMU_DISABLE and implements 2 new ioctls:
63162306a36Sopenharmony_ci   VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
63262306a36Sopenharmony_ci   (which are unsupported in v1 IOMMU).
63362306a36Sopenharmony_ci
63462306a36Sopenharmony_ci   PPC64 paravirtualized guests generate a lot of map/unmap requests,
63562306a36Sopenharmony_ci   and the handling of those includes pinning/unpinning pages and updating
63662306a36Sopenharmony_ci   mm::locked_vm counter to make sure we do not exceed the rlimit.
63762306a36Sopenharmony_ci   The v2 IOMMU splits accounting and pinning into separate operations:
63862306a36Sopenharmony_ci
63962306a36Sopenharmony_ci   - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
64062306a36Sopenharmony_ci     receive a user space address and size of the block to be pinned.
64162306a36Sopenharmony_ci     Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
64262306a36Sopenharmony_ci     be called with the exact address and size used for registering
64362306a36Sopenharmony_ci     the memory block. The userspace is not expected to call these often.
64462306a36Sopenharmony_ci     The ranges are stored in a linked list in a VFIO container.
64562306a36Sopenharmony_ci
64662306a36Sopenharmony_ci   - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
64762306a36Sopenharmony_ci     IOMMU table and do not do pinning; instead these check that the userspace
64862306a36Sopenharmony_ci     address is from pre-registered range.
64962306a36Sopenharmony_ci
65062306a36Sopenharmony_ci   This separation helps in optimizing DMA for guests.
65162306a36Sopenharmony_ci
65262306a36Sopenharmony_ci6) sPAPR specification allows guests to have an additional DMA window(s) on
65362306a36Sopenharmony_ci   a PCI bus with a variable page size. Two ioctls have been added to support
65462306a36Sopenharmony_ci   this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
65562306a36Sopenharmony_ci   The platform has to support the functionality or error will be returned to
65662306a36Sopenharmony_ci   the userspace. The existing hardware supports up to 2 DMA windows, one is
65762306a36Sopenharmony_ci   2GB long, uses 4K pages and called "default 32bit window"; the other can
65862306a36Sopenharmony_ci   be as big as entire RAM, use different page size, it is optional - guests
65962306a36Sopenharmony_ci   create those in run-time if the guest driver supports 64bit DMA.
66062306a36Sopenharmony_ci
66162306a36Sopenharmony_ci   VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
66262306a36Sopenharmony_ci   a number of TCE table levels (if a TCE table is going to be big enough and
66362306a36Sopenharmony_ci   the kernel may not be able to allocate enough of physically contiguous
66462306a36Sopenharmony_ci   memory). It creates a new window in the available slot and returns the bus
66562306a36Sopenharmony_ci   address where the new window starts. Due to hardware limitation, the user
66662306a36Sopenharmony_ci   space cannot choose the location of DMA windows.
66762306a36Sopenharmony_ci
66862306a36Sopenharmony_ci   VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
66962306a36Sopenharmony_ci   and removes it.
67062306a36Sopenharmony_ci
67162306a36Sopenharmony_ci-------------------------------------------------------------------------------
67262306a36Sopenharmony_ci
67362306a36Sopenharmony_ci.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
67462306a36Sopenharmony_ci   initial implementation by Tom Lyon while as Cisco.  We've since
67562306a36Sopenharmony_ci   outgrown the acronym, but it's catchy.
67662306a36Sopenharmony_ci
67762306a36Sopenharmony_ci.. [2] "safe" also depends upon a device being "well behaved".  It's
67862306a36Sopenharmony_ci   possible for multi-function devices to have backdoors between
67962306a36Sopenharmony_ci   functions and even for single function devices to have alternative
68062306a36Sopenharmony_ci   access to things like PCI config space through MMIO registers.  To
68162306a36Sopenharmony_ci   guard against the former we can include additional precautions in the
68262306a36Sopenharmony_ci   IOMMU driver to group multi-function PCI devices together
68362306a36Sopenharmony_ci   (iommu=group_mf).  The latter we can't prevent, but the IOMMU should
68462306a36Sopenharmony_ci   still provide isolation.  For PCI, SR-IOV Virtual Functions are the
68562306a36Sopenharmony_ci   best indicator of "well behaved", as these are designed for
68662306a36Sopenharmony_ci   virtualization usage models.
68762306a36Sopenharmony_ci
68862306a36Sopenharmony_ci.. [3] As always there are trade-offs to virtual machine device
68962306a36Sopenharmony_ci   assignment that are beyond the scope of VFIO.  It's expected that
69062306a36Sopenharmony_ci   future IOMMU technologies will reduce some, but maybe not all, of
69162306a36Sopenharmony_ci   these trade-offs.
69262306a36Sopenharmony_ci
69362306a36Sopenharmony_ci.. [4] In this case the device is below a PCI bridge, so transactions
69462306a36Sopenharmony_ci   from either function of the device are indistinguishable to the iommu::
69562306a36Sopenharmony_ci
69662306a36Sopenharmony_ci	-[0000:00]-+-1e.0-[06]--+-0d.0
69762306a36Sopenharmony_ci				\-0d.1
69862306a36Sopenharmony_ci
69962306a36Sopenharmony_ci	00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
70062306a36Sopenharmony_ci
70162306a36Sopenharmony_ci.. [5] Nested translation is an IOMMU feature which supports two stage
70262306a36Sopenharmony_ci   address translations.  This improves the address translation efficiency
70362306a36Sopenharmony_ci   in IOMMU virtualization.
70462306a36Sopenharmony_ci
70562306a36Sopenharmony_ci.. [6] PASID stands for Process Address Space ID, introduced by PCI
70662306a36Sopenharmony_ci   Express.  It is a prerequisite for Shared Virtual Addressing (SVA)
70762306a36Sopenharmony_ci   and Scalable I/O Virtualization (Scalable IOV).
708