162306a36Sopenharmony_ci================================== 262306a36Sopenharmony_ciVFIO - "Virtual Function I/O" [1]_ 362306a36Sopenharmony_ci================================== 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciMany modern systems now provide DMA and interrupt remapping facilities 662306a36Sopenharmony_cito help ensure I/O devices behave within the boundaries they've been 762306a36Sopenharmony_ciallotted. This includes x86 hardware with AMD-Vi and Intel VT-d, 862306a36Sopenharmony_ciPOWER systems with Partitionable Endpoints (PEs) and embedded PowerPC 962306a36Sopenharmony_cisystems such as Freescale PAMU. The VFIO driver is an IOMMU/device 1062306a36Sopenharmony_ciagnostic framework for exposing direct device access to userspace, in 1162306a36Sopenharmony_cia secure, IOMMU protected environment. In other words, this allows 1262306a36Sopenharmony_cisafe [2]_, non-privileged, userspace drivers. 1362306a36Sopenharmony_ci 1462306a36Sopenharmony_ciWhy do we want that? Virtual machines often make use of direct device 1562306a36Sopenharmony_ciaccess ("device assignment") when configured for the highest possible 1662306a36Sopenharmony_ciI/O performance. From a device and host perspective, this simply 1762306a36Sopenharmony_citurns the VM into a userspace driver, with the benefits of 1862306a36Sopenharmony_cisignificantly reduced latency, higher bandwidth, and direct use of 1962306a36Sopenharmony_cibare-metal device drivers [3]_. 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciSome applications, particularly in the high performance computing 2262306a36Sopenharmony_cifield, also benefit from low-overhead, direct device access from 2362306a36Sopenharmony_ciuserspace. Examples include network adapters (often non-TCP/IP based) 2462306a36Sopenharmony_ciand compute accelerators. Prior to VFIO, these drivers had to either 2562306a36Sopenharmony_cigo through the full development cycle to become proper upstream 2662306a36Sopenharmony_cidriver, be maintained out of tree, or make use of the UIO framework, 2762306a36Sopenharmony_ciwhich has no notion of IOMMU protection, limited interrupt support, 2862306a36Sopenharmony_ciand requires root privileges to access things like PCI configuration 2962306a36Sopenharmony_cispace. 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciThe VFIO driver framework intends to unify these, replacing both the 3262306a36Sopenharmony_ciKVM PCI specific device assignment code as well as provide a more 3362306a36Sopenharmony_cisecure, more featureful userspace driver environment than UIO. 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciGroups, Devices, and IOMMUs 3662306a36Sopenharmony_ci--------------------------- 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciDevices are the main target of any I/O driver. Devices typically 3962306a36Sopenharmony_cicreate a programming interface made up of I/O access, interrupts, 4062306a36Sopenharmony_ciand DMA. Without going into the details of each of these, DMA is 4162306a36Sopenharmony_ciby far the most critical aspect for maintaining a secure environment 4262306a36Sopenharmony_cias allowing a device read-write access to system memory imposes the 4362306a36Sopenharmony_cigreatest risk to the overall system integrity. 4462306a36Sopenharmony_ci 4562306a36Sopenharmony_ciTo help mitigate this risk, many modern IOMMUs now incorporate 4662306a36Sopenharmony_ciisolation properties into what was, in many cases, an interface only 4762306a36Sopenharmony_cimeant for translation (ie. solving the addressing problems of devices 4862306a36Sopenharmony_ciwith limited address spaces). With this, devices can now be isolated 4962306a36Sopenharmony_cifrom each other and from arbitrary memory access, thus allowing 5062306a36Sopenharmony_cithings like secure direct assignment of devices into virtual machines. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ciThis isolation is not always at the granularity of a single device 5362306a36Sopenharmony_cithough. Even when an IOMMU is capable of this, properties of devices, 5462306a36Sopenharmony_ciinterconnects, and IOMMU topologies can each reduce this isolation. 5562306a36Sopenharmony_ciFor instance, an individual device may be part of a larger multi- 5662306a36Sopenharmony_cifunction enclosure. While the IOMMU may be able to distinguish 5762306a36Sopenharmony_cibetween devices within the enclosure, the enclosure may not require 5862306a36Sopenharmony_citransactions between devices to reach the IOMMU. Examples of this 5962306a36Sopenharmony_cicould be anything from a multi-function PCI device with backdoors 6062306a36Sopenharmony_cibetween functions to a non-PCI-ACS (Access Control Services) capable 6162306a36Sopenharmony_cibridge allowing redirection without reaching the IOMMU. Topology 6262306a36Sopenharmony_cican also play a factor in terms of hiding devices. A PCIe-to-PCI 6362306a36Sopenharmony_cibridge masks the devices behind it, making transaction appear as if 6462306a36Sopenharmony_cifrom the bridge itself. Obviously IOMMU design plays a major factor 6562306a36Sopenharmony_cias well. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ciTherefore, while for the most part an IOMMU may have device level 6862306a36Sopenharmony_cigranularity, any system is susceptible to reduced granularity. The 6962306a36Sopenharmony_ciIOMMU API therefore supports a notion of IOMMU groups. A group is 7062306a36Sopenharmony_cia set of devices which is isolatable from all other devices in the 7162306a36Sopenharmony_cisystem. Groups are therefore the unit of ownership used by VFIO. 7262306a36Sopenharmony_ci 7362306a36Sopenharmony_ciWhile the group is the minimum granularity that must be used to 7462306a36Sopenharmony_ciensure secure user access, it's not necessarily the preferred 7562306a36Sopenharmony_cigranularity. In IOMMUs which make use of page tables, it may be 7662306a36Sopenharmony_cipossible to share a set of page tables between different groups, 7762306a36Sopenharmony_cireducing the overhead both to the platform (reduced TLB thrashing, 7862306a36Sopenharmony_cireduced duplicate page tables), and to the user (programming only 7962306a36Sopenharmony_cia single set of translations). For this reason, VFIO makes use of 8062306a36Sopenharmony_cia container class, which may hold one or more groups. A container 8162306a36Sopenharmony_ciis created by simply opening the /dev/vfio/vfio character device. 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ciOn its own, the container provides little functionality, with all 8462306a36Sopenharmony_cibut a couple version and extension query interfaces locked away. 8562306a36Sopenharmony_ciThe user needs to add a group into the container for the next level 8662306a36Sopenharmony_ciof functionality. To do this, the user first needs to identify the 8762306a36Sopenharmony_cigroup associated with the desired device. This can be done using 8862306a36Sopenharmony_cithe sysfs links described in the example below. By unbinding the 8962306a36Sopenharmony_cidevice from the host driver and binding it to a VFIO driver, a new 9062306a36Sopenharmony_ciVFIO group will appear for the group as /dev/vfio/$GROUP, where 9162306a36Sopenharmony_ci$GROUP is the IOMMU group number of which the device is a member. 9262306a36Sopenharmony_ciIf the IOMMU group contains multiple devices, each will need to 9362306a36Sopenharmony_cibe bound to a VFIO driver before operations on the VFIO group 9462306a36Sopenharmony_ciare allowed (it's also sufficient to only unbind the device from 9562306a36Sopenharmony_cihost drivers if a VFIO driver is unavailable; this will make the 9662306a36Sopenharmony_cigroup available, but not that particular device). TBD - interface 9762306a36Sopenharmony_cifor disabling driver probing/locking a device. 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ciOnce the group is ready, it may be added to the container by opening 10062306a36Sopenharmony_cithe VFIO group character device (/dev/vfio/$GROUP) and using the 10162306a36Sopenharmony_ciVFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the 10262306a36Sopenharmony_cipreviously opened container file. If desired and if the IOMMU driver 10362306a36Sopenharmony_cisupports sharing the IOMMU context between groups, multiple groups may 10462306a36Sopenharmony_cibe set to the same container. If a group fails to set to a container 10562306a36Sopenharmony_ciwith existing groups, a new empty container will need to be used 10662306a36Sopenharmony_ciinstead. 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ciWith a group (or groups) attached to a container, the remaining 10962306a36Sopenharmony_ciioctls become available, enabling access to the VFIO IOMMU interfaces. 11062306a36Sopenharmony_ciAdditionally, it now becomes possible to get file descriptors for each 11162306a36Sopenharmony_cidevice within a group using an ioctl on the VFIO group file descriptor. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ciThe VFIO device API includes ioctls for describing the device, the I/O 11462306a36Sopenharmony_ciregions and their read/write/mmap offsets on the device descriptor, as 11562306a36Sopenharmony_ciwell as mechanisms for describing and registering interrupt 11662306a36Sopenharmony_cinotifications. 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_ciVFIO Usage Example 11962306a36Sopenharmony_ci------------------ 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ciAssume user wants to access PCI device 0000:06:0d.0:: 12262306a36Sopenharmony_ci 12362306a36Sopenharmony_ci $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 12462306a36Sopenharmony_ci ../../../../kernel/iommu_groups/26 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ciThis device is therefore in IOMMU group 26. This device is on the 12762306a36Sopenharmony_cipci bus, therefore the user will make use of vfio-pci to manage the 12862306a36Sopenharmony_cigroup:: 12962306a36Sopenharmony_ci 13062306a36Sopenharmony_ci # modprobe vfio-pci 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciBinding this device to the vfio-pci driver creates the VFIO group 13362306a36Sopenharmony_cicharacter devices for this group:: 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ci $ lspci -n -s 0000:06:0d.0 13662306a36Sopenharmony_ci 06:0d.0 0401: 1102:0002 (rev 08) 13762306a36Sopenharmony_ci # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 13862306a36Sopenharmony_ci # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciNow we need to look at what other devices are in the group to free 14162306a36Sopenharmony_ciit for use by VFIO:: 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ci $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 14462306a36Sopenharmony_ci total 0 14562306a36Sopenharmony_ci lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 14662306a36Sopenharmony_ci ../../../../devices/pci0000:00/0000:00:1e.0 14762306a36Sopenharmony_ci lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 14862306a36Sopenharmony_ci ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 14962306a36Sopenharmony_ci lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 15062306a36Sopenharmony_ci ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciThis device is behind a PCIe-to-PCI bridge [4]_, therefore we also 15362306a36Sopenharmony_cineed to add device 0000:06:0d.1 to the group following the same 15462306a36Sopenharmony_ciprocedure as above. Device 0000:00:1e.0 is a bridge that does 15562306a36Sopenharmony_cinot currently have a host driver, therefore it's not required to 15662306a36Sopenharmony_cibind this device to the vfio-pci driver (vfio-pci does not currently 15762306a36Sopenharmony_cisupport PCI bridges). 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciThe final step is to provide the user with access to the group if 16062306a36Sopenharmony_ciunprivileged operation is desired (note that /dev/vfio/vfio provides 16162306a36Sopenharmony_cino capabilities on its own and is therefore expected to be set to 16262306a36Sopenharmony_cimode 0666 by the system):: 16362306a36Sopenharmony_ci 16462306a36Sopenharmony_ci # chown user:user /dev/vfio/26 16562306a36Sopenharmony_ci 16662306a36Sopenharmony_ciThe user now has full access to all the devices and the iommu for this 16762306a36Sopenharmony_cigroup and can access them as follows:: 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ci int container, group, device, i; 17062306a36Sopenharmony_ci struct vfio_group_status group_status = 17162306a36Sopenharmony_ci { .argsz = sizeof(group_status) }; 17262306a36Sopenharmony_ci struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; 17362306a36Sopenharmony_ci struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; 17462306a36Sopenharmony_ci struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ci /* Create a new container */ 17762306a36Sopenharmony_ci container = open("/dev/vfio/vfio", O_RDWR); 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ci if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) 18062306a36Sopenharmony_ci /* Unknown API version */ 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ci if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) 18362306a36Sopenharmony_ci /* Doesn't support the IOMMU driver we want. */ 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ci /* Open the group */ 18662306a36Sopenharmony_ci group = open("/dev/vfio/26", O_RDWR); 18762306a36Sopenharmony_ci 18862306a36Sopenharmony_ci /* Test the group is viable and available */ 18962306a36Sopenharmony_ci ioctl(group, VFIO_GROUP_GET_STATUS, &group_status); 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) 19262306a36Sopenharmony_ci /* Group is not viable (ie, not all devices bound for vfio) */ 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ci /* Add the group to the container */ 19562306a36Sopenharmony_ci ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ci /* Enable the IOMMU model we want */ 19862306a36Sopenharmony_ci ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ci /* Get addition IOMMU info */ 20162306a36Sopenharmony_ci ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ci /* Allocate some space and setup a DMA mapping */ 20462306a36Sopenharmony_ci dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, 20562306a36Sopenharmony_ci MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 20662306a36Sopenharmony_ci dma_map.size = 1024 * 1024; 20762306a36Sopenharmony_ci dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ 20862306a36Sopenharmony_ci dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; 20962306a36Sopenharmony_ci 21062306a36Sopenharmony_ci ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ci /* Get a file descriptor for the device */ 21362306a36Sopenharmony_ci device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); 21462306a36Sopenharmony_ci 21562306a36Sopenharmony_ci /* Test and setup the device */ 21662306a36Sopenharmony_ci ioctl(device, VFIO_DEVICE_GET_INFO, &device_info); 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ci for (i = 0; i < device_info.num_regions; i++) { 21962306a36Sopenharmony_ci struct vfio_region_info reg = { .argsz = sizeof(reg) }; 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci reg.index = i; 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ci ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®); 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ci /* Setup mappings... read/write offsets, mmaps 22662306a36Sopenharmony_ci * For PCI devices, config space is a region */ 22762306a36Sopenharmony_ci } 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ci for (i = 0; i < device_info.num_irqs; i++) { 23062306a36Sopenharmony_ci struct vfio_irq_info irq = { .argsz = sizeof(irq) }; 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ci irq.index = i; 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ci ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_ci /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ 23762306a36Sopenharmony_ci } 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ci /* Gratuitous device reset and go... */ 24062306a36Sopenharmony_ci ioctl(device, VFIO_DEVICE_RESET); 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciIOMMUFD and vfio_iommu_type1 24362306a36Sopenharmony_ci---------------------------- 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciIOMMUFD is the new user API to manage I/O page tables from userspace. 24662306a36Sopenharmony_ciIt intends to be the portal of delivering advanced userspace DMA 24762306a36Sopenharmony_cifeatures (nested translation [5]_, PASID [6]_, etc.) while also providing 24862306a36Sopenharmony_cia backwards compatibility interface for existing VFIO_TYPE1v2_IOMMU use 24962306a36Sopenharmony_cicases. Eventually the vfio_iommu_type1 driver, as well as the legacy 25062306a36Sopenharmony_civfio container and group model is intended to be deprecated. 25162306a36Sopenharmony_ci 25262306a36Sopenharmony_ciThe IOMMUFD backwards compatibility interface can be enabled two ways. 25362306a36Sopenharmony_ciIn the first method, the kernel can be configured with 25462306a36Sopenharmony_ciCONFIG_IOMMUFD_VFIO_CONTAINER, in which case the IOMMUFD subsystem 25562306a36Sopenharmony_citransparently provides the entire infrastructure for the VFIO 25662306a36Sopenharmony_cicontainer and IOMMU backend interfaces. The compatibility mode can 25762306a36Sopenharmony_cialso be accessed if the VFIO container interface, ie. /dev/vfio/vfio is 25862306a36Sopenharmony_cisimply symlink'd to /dev/iommu. Note that at the time of writing, the 25962306a36Sopenharmony_cicompatibility mode is not entirely feature complete relative to 26062306a36Sopenharmony_ciVFIO_TYPE1v2_IOMMU (ex. DMA mapping MMIO) and does not attempt to 26162306a36Sopenharmony_ciprovide compatibility to the VFIO_SPAPR_TCE_IOMMU interface. Therefore 26262306a36Sopenharmony_ciit is not generally advisable at this time to switch from native VFIO 26362306a36Sopenharmony_ciimplementations to the IOMMUFD compatibility interfaces. 26462306a36Sopenharmony_ci 26562306a36Sopenharmony_ciLong term, VFIO users should migrate to device access through the cdev 26662306a36Sopenharmony_ciinterface described below, and native access through the IOMMUFD 26762306a36Sopenharmony_ciprovided interfaces. 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_ciVFIO Device cdev 27062306a36Sopenharmony_ci---------------- 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_ciTraditionally user acquires a device fd via VFIO_GROUP_GET_DEVICE_FD 27362306a36Sopenharmony_ciin a VFIO group. 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciWith CONFIG_VFIO_DEVICE_CDEV=y the user can now acquire a device fd 27662306a36Sopenharmony_ciby directly opening a character device /dev/vfio/devices/vfioX where 27762306a36Sopenharmony_ci"X" is the number allocated uniquely by VFIO for registered devices. 27862306a36Sopenharmony_cicdev interface does not support noiommu devices, so user should use 27962306a36Sopenharmony_cithe legacy group interface if noiommu is wanted. 28062306a36Sopenharmony_ci 28162306a36Sopenharmony_ciThe cdev only works with IOMMUFD. Both VFIO drivers and applications 28262306a36Sopenharmony_cimust adapt to the new cdev security model which requires using 28362306a36Sopenharmony_ciVFIO_DEVICE_BIND_IOMMUFD to claim DMA ownership before starting to 28462306a36Sopenharmony_ciactually use the device. Once BIND succeeds then a VFIO device can 28562306a36Sopenharmony_cibe fully accessed by the user. 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_ciVFIO device cdev doesn't rely on VFIO group/container/iommu drivers. 28862306a36Sopenharmony_ciHence those modules can be fully compiled out in an environment 28962306a36Sopenharmony_ciwhere no legacy VFIO application exists. 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ciSo far SPAPR does not support IOMMUFD yet. So it cannot support device 29262306a36Sopenharmony_cicdev either. 29362306a36Sopenharmony_ci 29462306a36Sopenharmony_civfio device cdev access is still bound by IOMMU group semantics, ie. there 29562306a36Sopenharmony_cican be only one DMA owner for the group. Devices belonging to the same 29662306a36Sopenharmony_cigroup can not be bound to multiple iommufd_ctx or shared between native 29762306a36Sopenharmony_cikernel and vfio bus driver or other driver supporting the driver_managed_dma 29862306a36Sopenharmony_ciflag. A violation of this ownership requirement will fail at the 29962306a36Sopenharmony_ciVFIO_DEVICE_BIND_IOMMUFD ioctl, which gates full device access. 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ciDevice cdev Example 30262306a36Sopenharmony_ci------------------- 30362306a36Sopenharmony_ci 30462306a36Sopenharmony_ciAssume user wants to access PCI device 0000:6a:01.0:: 30562306a36Sopenharmony_ci 30662306a36Sopenharmony_ci $ ls /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/ 30762306a36Sopenharmony_ci vfio0 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_ciThis device is therefore represented as vfio0. The user can verify 31062306a36Sopenharmony_ciits existence:: 31162306a36Sopenharmony_ci 31262306a36Sopenharmony_ci $ ls -l /dev/vfio/devices/vfio0 31362306a36Sopenharmony_ci crw------- 1 root root 511, 0 Feb 16 01:22 /dev/vfio/devices/vfio0 31462306a36Sopenharmony_ci $ cat /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/vfio0/dev 31562306a36Sopenharmony_ci 511:0 31662306a36Sopenharmony_ci $ ls -l /dev/char/511\:0 31762306a36Sopenharmony_ci lrwxrwxrwx 1 root root 21 Feb 16 01:22 /dev/char/511:0 -> ../vfio/devices/vfio0 31862306a36Sopenharmony_ci 31962306a36Sopenharmony_ciThen provide the user with access to the device if unprivileged 32062306a36Sopenharmony_cioperation is desired:: 32162306a36Sopenharmony_ci 32262306a36Sopenharmony_ci $ chown user:user /dev/vfio/devices/vfio0 32362306a36Sopenharmony_ci 32462306a36Sopenharmony_ciFinally the user could get cdev fd by:: 32562306a36Sopenharmony_ci 32662306a36Sopenharmony_ci cdev_fd = open("/dev/vfio/devices/vfio0", O_RDWR); 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ciAn opened cdev_fd doesn't give the user any permission of accessing 32962306a36Sopenharmony_cithe device except binding the cdev_fd to an iommufd. After that point 33062306a36Sopenharmony_cithen the device is fully accessible including attaching it to an 33162306a36Sopenharmony_ciIOMMUFD IOAS/HWPT to enable userspace DMA:: 33262306a36Sopenharmony_ci 33362306a36Sopenharmony_ci struct vfio_device_bind_iommufd bind = { 33462306a36Sopenharmony_ci .argsz = sizeof(bind), 33562306a36Sopenharmony_ci .flags = 0, 33662306a36Sopenharmony_ci }; 33762306a36Sopenharmony_ci struct iommu_ioas_alloc alloc_data = { 33862306a36Sopenharmony_ci .size = sizeof(alloc_data), 33962306a36Sopenharmony_ci .flags = 0, 34062306a36Sopenharmony_ci }; 34162306a36Sopenharmony_ci struct vfio_device_attach_iommufd_pt attach_data = { 34262306a36Sopenharmony_ci .argsz = sizeof(attach_data), 34362306a36Sopenharmony_ci .flags = 0, 34462306a36Sopenharmony_ci }; 34562306a36Sopenharmony_ci struct iommu_ioas_map map = { 34662306a36Sopenharmony_ci .size = sizeof(map), 34762306a36Sopenharmony_ci .flags = IOMMU_IOAS_MAP_READABLE | 34862306a36Sopenharmony_ci IOMMU_IOAS_MAP_WRITEABLE | 34962306a36Sopenharmony_ci IOMMU_IOAS_MAP_FIXED_IOVA, 35062306a36Sopenharmony_ci .__reserved = 0, 35162306a36Sopenharmony_ci }; 35262306a36Sopenharmony_ci 35362306a36Sopenharmony_ci iommufd = open("/dev/iommu", O_RDWR); 35462306a36Sopenharmony_ci 35562306a36Sopenharmony_ci bind.iommufd = iommufd; 35662306a36Sopenharmony_ci ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind); 35762306a36Sopenharmony_ci 35862306a36Sopenharmony_ci ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data); 35962306a36Sopenharmony_ci attach_data.pt_id = alloc_data.out_ioas_id; 36062306a36Sopenharmony_ci ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data); 36162306a36Sopenharmony_ci 36262306a36Sopenharmony_ci /* Allocate some space and setup a DMA mapping */ 36362306a36Sopenharmony_ci map.user_va = (int64_t)mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, 36462306a36Sopenharmony_ci MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 36562306a36Sopenharmony_ci map.iova = 0; /* 1MB starting at 0x0 from device view */ 36662306a36Sopenharmony_ci map.length = 1024 * 1024; 36762306a36Sopenharmony_ci map.ioas_id = alloc_data.out_ioas_id;; 36862306a36Sopenharmony_ci 36962306a36Sopenharmony_ci ioctl(iommufd, IOMMU_IOAS_MAP, &map); 37062306a36Sopenharmony_ci 37162306a36Sopenharmony_ci /* Other device operations as stated in "VFIO Usage Example" */ 37262306a36Sopenharmony_ci 37362306a36Sopenharmony_ciVFIO User API 37462306a36Sopenharmony_ci------------------------------------------------------------------------------- 37562306a36Sopenharmony_ci 37662306a36Sopenharmony_ciPlease see include/uapi/linux/vfio.h for complete API documentation. 37762306a36Sopenharmony_ci 37862306a36Sopenharmony_ciVFIO bus driver API 37962306a36Sopenharmony_ci------------------------------------------------------------------------------- 38062306a36Sopenharmony_ci 38162306a36Sopenharmony_ciVFIO bus drivers, such as vfio-pci make use of only a few interfaces 38262306a36Sopenharmony_ciinto VFIO core. When devices are bound and unbound to the driver, 38362306a36Sopenharmony_ciFollowing interfaces are called when devices are bound to and 38462306a36Sopenharmony_ciunbound from the driver:: 38562306a36Sopenharmony_ci 38662306a36Sopenharmony_ci int vfio_register_group_dev(struct vfio_device *device); 38762306a36Sopenharmony_ci int vfio_register_emulated_iommu_dev(struct vfio_device *device); 38862306a36Sopenharmony_ci void vfio_unregister_group_dev(struct vfio_device *device); 38962306a36Sopenharmony_ci 39062306a36Sopenharmony_ciThe driver should embed the vfio_device in its own structure and use 39162306a36Sopenharmony_civfio_alloc_device() to allocate the structure, and can register 39262306a36Sopenharmony_ci@init/@release callbacks to manage any private state wrapping the 39362306a36Sopenharmony_civfio_device:: 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ci vfio_alloc_device(dev_struct, member, dev, ops); 39662306a36Sopenharmony_ci void vfio_put_device(struct vfio_device *device); 39762306a36Sopenharmony_ci 39862306a36Sopenharmony_civfio_register_group_dev() indicates to the core to begin tracking the 39962306a36Sopenharmony_ciiommu_group of the specified dev and register the dev as owned by a VFIO bus 40062306a36Sopenharmony_cidriver. Once vfio_register_group_dev() returns it is possible for userspace to 40162306a36Sopenharmony_cistart accessing the driver, thus the driver should ensure it is completely 40262306a36Sopenharmony_ciready before calling it. The driver provides an ops structure for callbacks 40362306a36Sopenharmony_cisimilar to a file operations structure:: 40462306a36Sopenharmony_ci 40562306a36Sopenharmony_ci struct vfio_device_ops { 40662306a36Sopenharmony_ci char *name; 40762306a36Sopenharmony_ci int (*init)(struct vfio_device *vdev); 40862306a36Sopenharmony_ci void (*release)(struct vfio_device *vdev); 40962306a36Sopenharmony_ci int (*bind_iommufd)(struct vfio_device *vdev, 41062306a36Sopenharmony_ci struct iommufd_ctx *ictx, u32 *out_device_id); 41162306a36Sopenharmony_ci void (*unbind_iommufd)(struct vfio_device *vdev); 41262306a36Sopenharmony_ci int (*attach_ioas)(struct vfio_device *vdev, u32 *pt_id); 41362306a36Sopenharmony_ci void (*detach_ioas)(struct vfio_device *vdev); 41462306a36Sopenharmony_ci int (*open_device)(struct vfio_device *vdev); 41562306a36Sopenharmony_ci void (*close_device)(struct vfio_device *vdev); 41662306a36Sopenharmony_ci ssize_t (*read)(struct vfio_device *vdev, char __user *buf, 41762306a36Sopenharmony_ci size_t count, loff_t *ppos); 41862306a36Sopenharmony_ci ssize_t (*write)(struct vfio_device *vdev, const char __user *buf, 41962306a36Sopenharmony_ci size_t count, loff_t *size); 42062306a36Sopenharmony_ci long (*ioctl)(struct vfio_device *vdev, unsigned int cmd, 42162306a36Sopenharmony_ci unsigned long arg); 42262306a36Sopenharmony_ci int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma); 42362306a36Sopenharmony_ci void (*request)(struct vfio_device *vdev, unsigned int count); 42462306a36Sopenharmony_ci int (*match)(struct vfio_device *vdev, char *buf); 42562306a36Sopenharmony_ci void (*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length); 42662306a36Sopenharmony_ci int (*device_feature)(struct vfio_device *device, u32 flags, 42762306a36Sopenharmony_ci void __user *arg, size_t argsz); 42862306a36Sopenharmony_ci }; 42962306a36Sopenharmony_ci 43062306a36Sopenharmony_ciEach function is passed the vdev that was originally registered 43162306a36Sopenharmony_ciin the vfio_register_group_dev() or vfio_register_emulated_iommu_dev() 43262306a36Sopenharmony_cicall above. This allows the bus driver to obtain its private data using 43362306a36Sopenharmony_cicontainer_of(). 43462306a36Sopenharmony_ci 43562306a36Sopenharmony_ci:: 43662306a36Sopenharmony_ci 43762306a36Sopenharmony_ci - The init/release callbacks are issued when vfio_device is initialized 43862306a36Sopenharmony_ci and released. 43962306a36Sopenharmony_ci 44062306a36Sopenharmony_ci - The open/close device callbacks are issued when the first 44162306a36Sopenharmony_ci instance of a file descriptor for the device is created (eg. 44262306a36Sopenharmony_ci via VFIO_GROUP_GET_DEVICE_FD) for a user session. 44362306a36Sopenharmony_ci 44462306a36Sopenharmony_ci - The ioctl callback provides a direct pass through for some VFIO_DEVICE_* 44562306a36Sopenharmony_ci ioctls. 44662306a36Sopenharmony_ci 44762306a36Sopenharmony_ci - The [un]bind_iommufd callbacks are issued when the device is bound to 44862306a36Sopenharmony_ci and unbound from iommufd. 44962306a36Sopenharmony_ci 45062306a36Sopenharmony_ci - The [de]attach_ioas callback is issued when the device is attached to 45162306a36Sopenharmony_ci and detached from an IOAS managed by the bound iommufd. However, the 45262306a36Sopenharmony_ci attached IOAS can also be automatically detached when the device is 45362306a36Sopenharmony_ci unbound from iommufd. 45462306a36Sopenharmony_ci 45562306a36Sopenharmony_ci - The read/write/mmap callbacks implement the device region access defined 45662306a36Sopenharmony_ci by the device's own VFIO_DEVICE_GET_REGION_INFO ioctl. 45762306a36Sopenharmony_ci 45862306a36Sopenharmony_ci - The request callback is issued when device is going to be unregistered, 45962306a36Sopenharmony_ci such as when trying to unbind the device from the vfio bus driver. 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ci - The dma_unmap callback is issued when a range of iovas are unmapped 46262306a36Sopenharmony_ci in the container or IOAS attached by the device. Drivers which make 46362306a36Sopenharmony_ci use of the vfio page pinning interface must implement this callback in 46462306a36Sopenharmony_ci order to unpin pages within the dma_unmap range. Drivers must tolerate 46562306a36Sopenharmony_ci this callback even before calls to open_device(). 46662306a36Sopenharmony_ci 46762306a36Sopenharmony_ciPPC64 sPAPR implementation note 46862306a36Sopenharmony_ci------------------------------- 46962306a36Sopenharmony_ci 47062306a36Sopenharmony_ciThis implementation has some specifics: 47162306a36Sopenharmony_ci 47262306a36Sopenharmony_ci1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 47362306a36Sopenharmony_ci container is supported as an IOMMU table is allocated at the boot time, 47462306a36Sopenharmony_ci one table per a IOMMU group which is a Partitionable Endpoint (PE) 47562306a36Sopenharmony_ci (PE is often a PCI domain but not always). 47662306a36Sopenharmony_ci 47762306a36Sopenharmony_ci Newer systems (POWER8 with IODA2) have improved hardware design which allows 47862306a36Sopenharmony_ci to remove this limitation and have multiple IOMMU groups per a VFIO 47962306a36Sopenharmony_ci container. 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ci2) The hardware supports so called DMA windows - the PCI address range 48262306a36Sopenharmony_ci within which DMA transfer is allowed, any attempt to access address space 48362306a36Sopenharmony_ci out of the window leads to the whole PE isolation. 48462306a36Sopenharmony_ci 48562306a36Sopenharmony_ci3) PPC64 guests are paravirtualized but not fully emulated. There is an API 48662306a36Sopenharmony_ci to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 48762306a36Sopenharmony_ci currently there is no way to reduce the number of calls. In order to make 48862306a36Sopenharmony_ci things faster, the map/unmap handling has been implemented in real mode 48962306a36Sopenharmony_ci which provides an excellent performance which has limitations such as 49062306a36Sopenharmony_ci inability to do locked pages accounting in real time. 49162306a36Sopenharmony_ci 49262306a36Sopenharmony_ci4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 49362306a36Sopenharmony_ci subtree that can be treated as a unit for the purposes of partitioning and 49462306a36Sopenharmony_ci error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 49562306a36Sopenharmony_ci function of a multi-function IOA, or multiple IOAs (possibly including 49662306a36Sopenharmony_ci switch and bridge structures above the multiple IOAs). PPC64 guests detect 49762306a36Sopenharmony_ci PCI errors and recover from them via EEH RTAS services, which works on the 49862306a36Sopenharmony_ci basis of additional ioctl commands. 49962306a36Sopenharmony_ci 50062306a36Sopenharmony_ci So 4 additional ioctls have been added: 50162306a36Sopenharmony_ci 50262306a36Sopenharmony_ci VFIO_IOMMU_SPAPR_TCE_GET_INFO 50362306a36Sopenharmony_ci returns the size and the start of the DMA window on the PCI bus. 50462306a36Sopenharmony_ci 50562306a36Sopenharmony_ci VFIO_IOMMU_ENABLE 50662306a36Sopenharmony_ci enables the container. The locked pages accounting 50762306a36Sopenharmony_ci is done at this point. This lets user first to know what 50862306a36Sopenharmony_ci the DMA window is and adjust rlimit before doing any real job. 50962306a36Sopenharmony_ci 51062306a36Sopenharmony_ci VFIO_IOMMU_DISABLE 51162306a36Sopenharmony_ci disables the container. 51262306a36Sopenharmony_ci 51362306a36Sopenharmony_ci VFIO_EEH_PE_OP 51462306a36Sopenharmony_ci provides an API for EEH setup, error detection and recovery. 51562306a36Sopenharmony_ci 51662306a36Sopenharmony_ci The code flow from the example above should be slightly changed:: 51762306a36Sopenharmony_ci 51862306a36Sopenharmony_ci struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 51962306a36Sopenharmony_ci 52062306a36Sopenharmony_ci ..... 52162306a36Sopenharmony_ci /* Add the group to the container */ 52262306a36Sopenharmony_ci ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); 52362306a36Sopenharmony_ci 52462306a36Sopenharmony_ci /* Enable the IOMMU model we want */ 52562306a36Sopenharmony_ci ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU) 52662306a36Sopenharmony_ci 52762306a36Sopenharmony_ci /* Get addition sPAPR IOMMU info */ 52862306a36Sopenharmony_ci vfio_iommu_spapr_tce_info spapr_iommu_info; 52962306a36Sopenharmony_ci ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info); 53062306a36Sopenharmony_ci 53162306a36Sopenharmony_ci if (ioctl(container, VFIO_IOMMU_ENABLE)) 53262306a36Sopenharmony_ci /* Cannot enable container, may be low rlimit */ 53362306a36Sopenharmony_ci 53462306a36Sopenharmony_ci /* Allocate some space and setup a DMA mapping */ 53562306a36Sopenharmony_ci dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, 53662306a36Sopenharmony_ci MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 53762306a36Sopenharmony_ci 53862306a36Sopenharmony_ci dma_map.size = 1024 * 1024; 53962306a36Sopenharmony_ci dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ 54062306a36Sopenharmony_ci dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; 54162306a36Sopenharmony_ci 54262306a36Sopenharmony_ci /* Check here is .iova/.size are within DMA window from spapr_iommu_info */ 54362306a36Sopenharmony_ci ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); 54462306a36Sopenharmony_ci 54562306a36Sopenharmony_ci /* Get a file descriptor for the device */ 54662306a36Sopenharmony_ci device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); 54762306a36Sopenharmony_ci 54862306a36Sopenharmony_ci .... 54962306a36Sopenharmony_ci 55062306a36Sopenharmony_ci /* Gratuitous device reset and go... */ 55162306a36Sopenharmony_ci ioctl(device, VFIO_DEVICE_RESET); 55262306a36Sopenharmony_ci 55362306a36Sopenharmony_ci /* Make sure EEH is supported */ 55462306a36Sopenharmony_ci ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH); 55562306a36Sopenharmony_ci 55662306a36Sopenharmony_ci /* Enable the EEH functionality on the device */ 55762306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_ENABLE; 55862306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 55962306a36Sopenharmony_ci 56062306a36Sopenharmony_ci /* You're suggested to create additional data struct to represent 56162306a36Sopenharmony_ci * PE, and put child devices belonging to same IOMMU group to the 56262306a36Sopenharmony_ci * PE instance for later reference. 56362306a36Sopenharmony_ci */ 56462306a36Sopenharmony_ci 56562306a36Sopenharmony_ci /* Check the PE's state and make sure it's in functional state */ 56662306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_GET_STATE; 56762306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 56862306a36Sopenharmony_ci 56962306a36Sopenharmony_ci /* Save device state using pci_save_state(). 57062306a36Sopenharmony_ci * EEH should be enabled on the specified device. 57162306a36Sopenharmony_ci */ 57262306a36Sopenharmony_ci 57362306a36Sopenharmony_ci .... 57462306a36Sopenharmony_ci 57562306a36Sopenharmony_ci /* Inject EEH error, which is expected to be caused by 32-bits 57662306a36Sopenharmony_ci * config load. 57762306a36Sopenharmony_ci */ 57862306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_INJECT_ERR; 57962306a36Sopenharmony_ci pe_op.err.type = EEH_ERR_TYPE_32; 58062306a36Sopenharmony_ci pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR; 58162306a36Sopenharmony_ci pe_op.err.addr = 0ul; 58262306a36Sopenharmony_ci pe_op.err.mask = 0ul; 58362306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 58462306a36Sopenharmony_ci 58562306a36Sopenharmony_ci .... 58662306a36Sopenharmony_ci 58762306a36Sopenharmony_ci /* When 0xFF's returned from reading PCI config space or IO BARs 58862306a36Sopenharmony_ci * of the PCI device. Check the PE's state to see if that has been 58962306a36Sopenharmony_ci * frozen. 59062306a36Sopenharmony_ci */ 59162306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 59262306a36Sopenharmony_ci 59362306a36Sopenharmony_ci /* Waiting for pending PCI transactions to be completed and don't 59462306a36Sopenharmony_ci * produce any more PCI traffic from/to the affected PE until 59562306a36Sopenharmony_ci * recovery is finished. 59662306a36Sopenharmony_ci */ 59762306a36Sopenharmony_ci 59862306a36Sopenharmony_ci /* Enable IO for the affected PE and collect logs. Usually, the 59962306a36Sopenharmony_ci * standard part of PCI config space, AER registers are dumped 60062306a36Sopenharmony_ci * as logs for further analysis. 60162306a36Sopenharmony_ci */ 60262306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_UNFREEZE_IO; 60362306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 60462306a36Sopenharmony_ci 60562306a36Sopenharmony_ci /* 60662306a36Sopenharmony_ci * Issue PE reset: hot or fundamental reset. Usually, hot reset 60762306a36Sopenharmony_ci * is enough. However, the firmware of some PCI adapters would 60862306a36Sopenharmony_ci * require fundamental reset. 60962306a36Sopenharmony_ci */ 61062306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_RESET_HOT; 61162306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 61262306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE; 61362306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 61462306a36Sopenharmony_ci 61562306a36Sopenharmony_ci /* Configure the PCI bridges for the affected PE */ 61662306a36Sopenharmony_ci pe_op.op = VFIO_EEH_PE_CONFIGURE; 61762306a36Sopenharmony_ci ioctl(container, VFIO_EEH_PE_OP, &pe_op); 61862306a36Sopenharmony_ci 61962306a36Sopenharmony_ci /* Restored state we saved at initialization time. pci_restore_state() 62062306a36Sopenharmony_ci * is good enough as an example. 62162306a36Sopenharmony_ci */ 62262306a36Sopenharmony_ci 62362306a36Sopenharmony_ci /* Hopefully, error is recovered successfully. Now, you can resume to 62462306a36Sopenharmony_ci * start PCI traffic to/from the affected PE. 62562306a36Sopenharmony_ci */ 62662306a36Sopenharmony_ci 62762306a36Sopenharmony_ci .... 62862306a36Sopenharmony_ci 62962306a36Sopenharmony_ci5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 63062306a36Sopenharmony_ci VFIO_IOMMU_DISABLE and implements 2 new ioctls: 63162306a36Sopenharmony_ci VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 63262306a36Sopenharmony_ci (which are unsupported in v1 IOMMU). 63362306a36Sopenharmony_ci 63462306a36Sopenharmony_ci PPC64 paravirtualized guests generate a lot of map/unmap requests, 63562306a36Sopenharmony_ci and the handling of those includes pinning/unpinning pages and updating 63662306a36Sopenharmony_ci mm::locked_vm counter to make sure we do not exceed the rlimit. 63762306a36Sopenharmony_ci The v2 IOMMU splits accounting and pinning into separate operations: 63862306a36Sopenharmony_ci 63962306a36Sopenharmony_ci - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 64062306a36Sopenharmony_ci receive a user space address and size of the block to be pinned. 64162306a36Sopenharmony_ci Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 64262306a36Sopenharmony_ci be called with the exact address and size used for registering 64362306a36Sopenharmony_ci the memory block. The userspace is not expected to call these often. 64462306a36Sopenharmony_ci The ranges are stored in a linked list in a VFIO container. 64562306a36Sopenharmony_ci 64662306a36Sopenharmony_ci - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 64762306a36Sopenharmony_ci IOMMU table and do not do pinning; instead these check that the userspace 64862306a36Sopenharmony_ci address is from pre-registered range. 64962306a36Sopenharmony_ci 65062306a36Sopenharmony_ci This separation helps in optimizing DMA for guests. 65162306a36Sopenharmony_ci 65262306a36Sopenharmony_ci6) sPAPR specification allows guests to have an additional DMA window(s) on 65362306a36Sopenharmony_ci a PCI bus with a variable page size. Two ioctls have been added to support 65462306a36Sopenharmony_ci this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 65562306a36Sopenharmony_ci The platform has to support the functionality or error will be returned to 65662306a36Sopenharmony_ci the userspace. The existing hardware supports up to 2 DMA windows, one is 65762306a36Sopenharmony_ci 2GB long, uses 4K pages and called "default 32bit window"; the other can 65862306a36Sopenharmony_ci be as big as entire RAM, use different page size, it is optional - guests 65962306a36Sopenharmony_ci create those in run-time if the guest driver supports 64bit DMA. 66062306a36Sopenharmony_ci 66162306a36Sopenharmony_ci VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 66262306a36Sopenharmony_ci a number of TCE table levels (if a TCE table is going to be big enough and 66362306a36Sopenharmony_ci the kernel may not be able to allocate enough of physically contiguous 66462306a36Sopenharmony_ci memory). It creates a new window in the available slot and returns the bus 66562306a36Sopenharmony_ci address where the new window starts. Due to hardware limitation, the user 66662306a36Sopenharmony_ci space cannot choose the location of DMA windows. 66762306a36Sopenharmony_ci 66862306a36Sopenharmony_ci VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 66962306a36Sopenharmony_ci and removes it. 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ci------------------------------------------------------------------------------- 67262306a36Sopenharmony_ci 67362306a36Sopenharmony_ci.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its 67462306a36Sopenharmony_ci initial implementation by Tom Lyon while as Cisco. We've since 67562306a36Sopenharmony_ci outgrown the acronym, but it's catchy. 67662306a36Sopenharmony_ci 67762306a36Sopenharmony_ci.. [2] "safe" also depends upon a device being "well behaved". It's 67862306a36Sopenharmony_ci possible for multi-function devices to have backdoors between 67962306a36Sopenharmony_ci functions and even for single function devices to have alternative 68062306a36Sopenharmony_ci access to things like PCI config space through MMIO registers. To 68162306a36Sopenharmony_ci guard against the former we can include additional precautions in the 68262306a36Sopenharmony_ci IOMMU driver to group multi-function PCI devices together 68362306a36Sopenharmony_ci (iommu=group_mf). The latter we can't prevent, but the IOMMU should 68462306a36Sopenharmony_ci still provide isolation. For PCI, SR-IOV Virtual Functions are the 68562306a36Sopenharmony_ci best indicator of "well behaved", as these are designed for 68662306a36Sopenharmony_ci virtualization usage models. 68762306a36Sopenharmony_ci 68862306a36Sopenharmony_ci.. [3] As always there are trade-offs to virtual machine device 68962306a36Sopenharmony_ci assignment that are beyond the scope of VFIO. It's expected that 69062306a36Sopenharmony_ci future IOMMU technologies will reduce some, but maybe not all, of 69162306a36Sopenharmony_ci these trade-offs. 69262306a36Sopenharmony_ci 69362306a36Sopenharmony_ci.. [4] In this case the device is below a PCI bridge, so transactions 69462306a36Sopenharmony_ci from either function of the device are indistinguishable to the iommu:: 69562306a36Sopenharmony_ci 69662306a36Sopenharmony_ci -[0000:00]-+-1e.0-[06]--+-0d.0 69762306a36Sopenharmony_ci \-0d.1 69862306a36Sopenharmony_ci 69962306a36Sopenharmony_ci 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 70062306a36Sopenharmony_ci 70162306a36Sopenharmony_ci.. [5] Nested translation is an IOMMU feature which supports two stage 70262306a36Sopenharmony_ci address translations. This improves the address translation efficiency 70362306a36Sopenharmony_ci in IOMMU virtualization. 70462306a36Sopenharmony_ci 70562306a36Sopenharmony_ci.. [6] PASID stands for Process Address Space ID, introduced by PCI 70662306a36Sopenharmony_ci Express. It is a prerequisite for Shared Virtual Addressing (SVA) 70762306a36Sopenharmony_ci and Scalable I/O Virtualization (Scalable IOV). 708