18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci====================================================== 48c2ecf20Sopenharmony_ciTimekeeping Virtualization for X86-Based Architectures 58c2ecf20Sopenharmony_ci====================================================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ci:Author: Zachary Amsden <zamsden@redhat.com> 88c2ecf20Sopenharmony_ci:Copyright: (c) 2010, Red Hat. All rights reserved. 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ci.. Contents 118c2ecf20Sopenharmony_ci 128c2ecf20Sopenharmony_ci 1) Overview 138c2ecf20Sopenharmony_ci 2) Timing Devices 148c2ecf20Sopenharmony_ci 3) TSC Hardware 158c2ecf20Sopenharmony_ci 4) Virtualization Problems 168c2ecf20Sopenharmony_ci 178c2ecf20Sopenharmony_ci1. Overview 188c2ecf20Sopenharmony_ci=========== 198c2ecf20Sopenharmony_ci 208c2ecf20Sopenharmony_ciOne of the most complicated parts of the X86 platform, and specifically, 218c2ecf20Sopenharmony_cithe virtualization of this platform is the plethora of timing devices available 228c2ecf20Sopenharmony_ciand the complexity of emulating those devices. In addition, virtualization of 238c2ecf20Sopenharmony_citime introduces a new set of challenges because it introduces a multiplexed 248c2ecf20Sopenharmony_cidivision of time beyond the control of the guest CPU. 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ciFirst, we will describe the various timekeeping hardware available, then 278c2ecf20Sopenharmony_cipresent some of the problems which arise and solutions available, giving 288c2ecf20Sopenharmony_cispecific recommendations for certain classes of KVM guests. 298c2ecf20Sopenharmony_ci 308c2ecf20Sopenharmony_ciThe purpose of this document is to collect data and information relevant to 318c2ecf20Sopenharmony_citimekeeping which may be difficult to find elsewhere, specifically, 328c2ecf20Sopenharmony_ciinformation relevant to KVM and hardware-based virtualization. 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ci2. Timing Devices 358c2ecf20Sopenharmony_ci================= 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ciFirst we discuss the basic hardware devices available. TSC and the related 388c2ecf20Sopenharmony_ciKVM clock are special enough to warrant a full exposition and are described in 398c2ecf20Sopenharmony_cithe following section. 408c2ecf20Sopenharmony_ci 418c2ecf20Sopenharmony_ci2.1. i8254 - PIT 428c2ecf20Sopenharmony_ci---------------- 438c2ecf20Sopenharmony_ci 448c2ecf20Sopenharmony_ciOne of the first timer devices available is the programmable interrupt timer, 458c2ecf20Sopenharmony_cior PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three 468c2ecf20Sopenharmony_cichannels which can be programmed to deliver periodic or one-shot interrupts. 478c2ecf20Sopenharmony_ciThese three channels can be configured in different modes and have individual 488c2ecf20Sopenharmony_cicounters. Channel 1 and 2 were not available for general use in the original 498c2ecf20Sopenharmony_ciIBM PC, and historically were connected to control RAM refresh and the PC 508c2ecf20Sopenharmony_cispeaker. Now the PIT is typically integrated as part of an emulated chipset 518c2ecf20Sopenharmony_ciand a separate physical PIT is not used. 528c2ecf20Sopenharmony_ci 538c2ecf20Sopenharmony_ciThe PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done 548c2ecf20Sopenharmony_ciusing single or multiple byte access to the I/O ports. There are 6 modes 558c2ecf20Sopenharmony_ciavailable, but not all modes are available to all timers, as only timer 2 568c2ecf20Sopenharmony_cihas a connected gate input, required for modes 1 and 5. The gate line is 578c2ecf20Sopenharmony_cicontrolled by port 61h, bit 0, as illustrated in the following diagram:: 588c2ecf20Sopenharmony_ci 598c2ecf20Sopenharmony_ci -------------- ---------------- 608c2ecf20Sopenharmony_ci | | | | 618c2ecf20Sopenharmony_ci | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 628c2ecf20Sopenharmony_ci | Clock | | | | 638c2ecf20Sopenharmony_ci -------------- | +->| GATE TIMER 0 | 648c2ecf20Sopenharmony_ci | ---------------- 658c2ecf20Sopenharmony_ci | 668c2ecf20Sopenharmony_ci | ---------------- 678c2ecf20Sopenharmony_ci | | | 688c2ecf20Sopenharmony_ci |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM 698c2ecf20Sopenharmony_ci | | | (aka /dev/null) 708c2ecf20Sopenharmony_ci | +->| GATE TIMER 1 | 718c2ecf20Sopenharmony_ci | ---------------- 728c2ecf20Sopenharmony_ci | 738c2ecf20Sopenharmony_ci | ---------------- 748c2ecf20Sopenharmony_ci | | | 758c2ecf20Sopenharmony_ci |------>| CLOCK OUT | ---------> Port 61h, bit 5 768c2ecf20Sopenharmony_ci | | | 778c2ecf20Sopenharmony_ci Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ 788c2ecf20Sopenharmony_ci ---------------- _| )--|LPF|---Speaker 798c2ecf20Sopenharmony_ci / *---- \___/ 808c2ecf20Sopenharmony_ci Port 61h, bit 1 ---------------------------------/ 818c2ecf20Sopenharmony_ci 828c2ecf20Sopenharmony_ciThe timer modes are now described. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ciMode 0: Single Timeout. 858c2ecf20Sopenharmony_ci This is a one-shot software timeout that counts down 868c2ecf20Sopenharmony_ci when the gate is high (always true for timers 0 and 1). When the count 878c2ecf20Sopenharmony_ci reaches zero, the output goes high. 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ciMode 1: Triggered One-shot. 908c2ecf20Sopenharmony_ci The output is initially set high. When the gate 918c2ecf20Sopenharmony_ci line is set high, a countdown is initiated (which does not stop if the gate is 928c2ecf20Sopenharmony_ci lowered), during which the output is set low. When the count reaches zero, 938c2ecf20Sopenharmony_ci the output goes high. 948c2ecf20Sopenharmony_ci 958c2ecf20Sopenharmony_ciMode 2: Rate Generator. 968c2ecf20Sopenharmony_ci The output is initially set high. When the countdown 978c2ecf20Sopenharmony_ci reaches 1, the output goes low for one count and then returns high. The value 988c2ecf20Sopenharmony_ci is reloaded and the countdown automatically resumes. If the gate line goes 998c2ecf20Sopenharmony_ci low, the count is halted. If the output is low when the gate is lowered, the 1008c2ecf20Sopenharmony_ci output automatically goes high (this only affects timer 2). 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ciMode 3: Square Wave. 1038c2ecf20Sopenharmony_ci This generates a high / low square wave. The count 1048c2ecf20Sopenharmony_ci determines the length of the pulse, which alternates between high and low 1058c2ecf20Sopenharmony_ci when zero is reached. The count only proceeds when gate is high and is 1068c2ecf20Sopenharmony_ci automatically reloaded on reaching zero. The count is decremented twice at 1078c2ecf20Sopenharmony_ci each clock to generate a full high / low cycle at the full periodic rate. 1088c2ecf20Sopenharmony_ci If the count is even, the clock remains high for N/2 counts and low for N/2 1098c2ecf20Sopenharmony_ci counts; if the clock is odd, the clock is high for (N+1)/2 counts and low 1108c2ecf20Sopenharmony_ci for (N-1)/2 counts. Only even values are latched by the counter, so odd 1118c2ecf20Sopenharmony_ci values are not observed when reading. This is the intended mode for timer 2, 1128c2ecf20Sopenharmony_ci which generates sine-like tones by low-pass filtering the square wave output. 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ciMode 4: Software Strobe. 1158c2ecf20Sopenharmony_ci After programming this mode and loading the counter, 1168c2ecf20Sopenharmony_ci the output remains high until the counter reaches zero. Then the output 1178c2ecf20Sopenharmony_ci goes low for 1 clock cycle and returns high. The counter is not reloaded. 1188c2ecf20Sopenharmony_ci Counting only occurs when gate is high. 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ciMode 5: Hardware Strobe. 1218c2ecf20Sopenharmony_ci After programming and loading the counter, the 1228c2ecf20Sopenharmony_ci output remains high. When the gate is raised, a countdown is initiated 1238c2ecf20Sopenharmony_ci (which does not stop if the gate is lowered). When the counter reaches zero, 1248c2ecf20Sopenharmony_ci the output goes low for 1 clock cycle and then returns high. The counter is 1258c2ecf20Sopenharmony_ci not reloaded. 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ciIn addition to normal binary counting, the PIT supports BCD counting. The 1288c2ecf20Sopenharmony_cicommand port, 0x43 is used to set the counter and mode for each of the three 1298c2ecf20Sopenharmony_citimers. 1308c2ecf20Sopenharmony_ci 1318c2ecf20Sopenharmony_ciPIT commands, issued to port 0x43, using the following bit encoding:: 1328c2ecf20Sopenharmony_ci 1338c2ecf20Sopenharmony_ci Bit 7-4: Command (See table below) 1348c2ecf20Sopenharmony_ci Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) 1358c2ecf20Sopenharmony_ci Bit 0 : Binary (0) / BCD (1) 1368c2ecf20Sopenharmony_ci 1378c2ecf20Sopenharmony_ciCommand table:: 1388c2ecf20Sopenharmony_ci 1398c2ecf20Sopenharmony_ci 0000 - Latch Timer 0 count for port 0x40 1408c2ecf20Sopenharmony_ci sample and hold the count to be read in port 0x40; 1418c2ecf20Sopenharmony_ci additional commands ignored until counter is read; 1428c2ecf20Sopenharmony_ci mode bits ignored. 1438c2ecf20Sopenharmony_ci 1448c2ecf20Sopenharmony_ci 0001 - Set Timer 0 LSB mode for port 0x40 1458c2ecf20Sopenharmony_ci set timer to read LSB only and force MSB to zero; 1468c2ecf20Sopenharmony_ci mode bits set timer mode 1478c2ecf20Sopenharmony_ci 1488c2ecf20Sopenharmony_ci 0010 - Set Timer 0 MSB mode for port 0x40 1498c2ecf20Sopenharmony_ci set timer to read MSB only and force LSB to zero; 1508c2ecf20Sopenharmony_ci mode bits set timer mode 1518c2ecf20Sopenharmony_ci 1528c2ecf20Sopenharmony_ci 0011 - Set Timer 0 16-bit mode for port 0x40 1538c2ecf20Sopenharmony_ci set timer to read / write LSB first, then MSB; 1548c2ecf20Sopenharmony_ci mode bits set timer mode 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_ci 0100 - Latch Timer 1 count for port 0x41 - as described above 1578c2ecf20Sopenharmony_ci 0101 - Set Timer 1 LSB mode for port 0x41 - as described above 1588c2ecf20Sopenharmony_ci 0110 - Set Timer 1 MSB mode for port 0x41 - as described above 1598c2ecf20Sopenharmony_ci 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ci 1000 - Latch Timer 2 count for port 0x42 - as described above 1628c2ecf20Sopenharmony_ci 1001 - Set Timer 2 LSB mode for port 0x42 - as described above 1638c2ecf20Sopenharmony_ci 1010 - Set Timer 2 MSB mode for port 0x42 - as described above 1648c2ecf20Sopenharmony_ci 1011 - Set Timer 2 16-bit mode for port 0x42 as described above 1658c2ecf20Sopenharmony_ci 1668c2ecf20Sopenharmony_ci 1101 - General counter latch 1678c2ecf20Sopenharmony_ci Latch combination of counters into corresponding ports 1688c2ecf20Sopenharmony_ci Bit 3 = Counter 2 1698c2ecf20Sopenharmony_ci Bit 2 = Counter 1 1708c2ecf20Sopenharmony_ci Bit 1 = Counter 0 1718c2ecf20Sopenharmony_ci Bit 0 = Unused 1728c2ecf20Sopenharmony_ci 1738c2ecf20Sopenharmony_ci 1110 - Latch timer status 1748c2ecf20Sopenharmony_ci Latch combination of counter mode into corresponding ports 1758c2ecf20Sopenharmony_ci Bit 3 = Counter 2 1768c2ecf20Sopenharmony_ci Bit 2 = Counter 1 1778c2ecf20Sopenharmony_ci Bit 1 = Counter 0 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ci The output of ports 0x40-0x42 following this command will be: 1808c2ecf20Sopenharmony_ci 1818c2ecf20Sopenharmony_ci Bit 7 = Output pin 1828c2ecf20Sopenharmony_ci Bit 6 = Count loaded (0 if timer has expired) 1838c2ecf20Sopenharmony_ci Bit 5-4 = Read / Write mode 1848c2ecf20Sopenharmony_ci 01 = MSB only 1858c2ecf20Sopenharmony_ci 10 = LSB only 1868c2ecf20Sopenharmony_ci 11 = LSB / MSB (16-bit) 1878c2ecf20Sopenharmony_ci Bit 3-1 = Mode 1888c2ecf20Sopenharmony_ci Bit 0 = Binary (0) / BCD mode (1) 1898c2ecf20Sopenharmony_ci 1908c2ecf20Sopenharmony_ci2.2. RTC 1918c2ecf20Sopenharmony_ci-------- 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ciThe second device which was available in the original PC was the MC146818 real 1948c2ecf20Sopenharmony_citime clock. The original device is now obsolete, and usually emulated by the 1958c2ecf20Sopenharmony_cisystem chipset, sometimes by an HPET and some frankenstein IRQ routing. 1968c2ecf20Sopenharmony_ci 1978c2ecf20Sopenharmony_ciThe RTC is accessed through CMOS variables, which uses an index register to 1988c2ecf20Sopenharmony_cicontrol which bytes are read. Since there is only one index register, read 1998c2ecf20Sopenharmony_ciof the CMOS and read of the RTC require lock protection (in addition, it is 2008c2ecf20Sopenharmony_cidangerous to allow userspace utilities such as hwclock to have direct RTC 2018c2ecf20Sopenharmony_ciaccess, as they could corrupt kernel reads and writes of CMOS memory). 2028c2ecf20Sopenharmony_ci 2038c2ecf20Sopenharmony_ciThe RTC generates an interrupt which is usually routed to IRQ 8. The interrupt 2048c2ecf20Sopenharmony_cican function as a periodic timer, an additional once a day alarm, and can issue 2058c2ecf20Sopenharmony_ciinterrupts after an update of the CMOS registers by the MC146818 is complete. 2068c2ecf20Sopenharmony_ciThe type of interrupt is signalled in the RTC status registers. 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ciThe RTC will update the current time fields by battery power even while the 2098c2ecf20Sopenharmony_cisystem is off. The current time fields should not be read while an update is 2108c2ecf20Sopenharmony_ciin progress, as indicated in the status register. 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ciThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be 2138c2ecf20Sopenharmony_ciprogrammed to a 32kHz divider if the RTC is to count seconds. 2148c2ecf20Sopenharmony_ci 2158c2ecf20Sopenharmony_ciThis is the RAM map originally used for the RTC/CMOS:: 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_ci Location Size Description 2188c2ecf20Sopenharmony_ci ------------------------------------------ 2198c2ecf20Sopenharmony_ci 00h byte Current second (BCD) 2208c2ecf20Sopenharmony_ci 01h byte Seconds alarm (BCD) 2218c2ecf20Sopenharmony_ci 02h byte Current minute (BCD) 2228c2ecf20Sopenharmony_ci 03h byte Minutes alarm (BCD) 2238c2ecf20Sopenharmony_ci 04h byte Current hour (BCD) 2248c2ecf20Sopenharmony_ci 05h byte Hours alarm (BCD) 2258c2ecf20Sopenharmony_ci 06h byte Current day of week (BCD) 2268c2ecf20Sopenharmony_ci 07h byte Current day of month (BCD) 2278c2ecf20Sopenharmony_ci 08h byte Current month (BCD) 2288c2ecf20Sopenharmony_ci 09h byte Current year (BCD) 2298c2ecf20Sopenharmony_ci 0Ah byte Register A 2308c2ecf20Sopenharmony_ci bit 7 = Update in progress 2318c2ecf20Sopenharmony_ci bit 6-4 = Divider for clock 2328c2ecf20Sopenharmony_ci 000 = 4.194 MHz 2338c2ecf20Sopenharmony_ci 001 = 1.049 MHz 2348c2ecf20Sopenharmony_ci 010 = 32 kHz 2358c2ecf20Sopenharmony_ci 10X = test modes 2368c2ecf20Sopenharmony_ci 110 = reset / disable 2378c2ecf20Sopenharmony_ci 111 = reset / disable 2388c2ecf20Sopenharmony_ci bit 3-0 = Rate selection for periodic interrupt 2398c2ecf20Sopenharmony_ci 000 = periodic timer disabled 2408c2ecf20Sopenharmony_ci 001 = 3.90625 uS 2418c2ecf20Sopenharmony_ci 010 = 7.8125 uS 2428c2ecf20Sopenharmony_ci 011 = .122070 mS 2438c2ecf20Sopenharmony_ci 100 = .244141 mS 2448c2ecf20Sopenharmony_ci ... 2458c2ecf20Sopenharmony_ci 1101 = 125 mS 2468c2ecf20Sopenharmony_ci 1110 = 250 mS 2478c2ecf20Sopenharmony_ci 1111 = 500 mS 2488c2ecf20Sopenharmony_ci 0Bh byte Register B 2498c2ecf20Sopenharmony_ci bit 7 = Run (0) / Halt (1) 2508c2ecf20Sopenharmony_ci bit 6 = Periodic interrupt enable 2518c2ecf20Sopenharmony_ci bit 5 = Alarm interrupt enable 2528c2ecf20Sopenharmony_ci bit 4 = Update-ended interrupt enable 2538c2ecf20Sopenharmony_ci bit 3 = Square wave interrupt enable 2548c2ecf20Sopenharmony_ci bit 2 = BCD calendar (0) / Binary (1) 2558c2ecf20Sopenharmony_ci bit 1 = 12-hour mode (0) / 24-hour mode (1) 2568c2ecf20Sopenharmony_ci bit 0 = 0 (DST off) / 1 (DST enabled) 2578c2ecf20Sopenharmony_ci OCh byte Register C (read only) 2588c2ecf20Sopenharmony_ci bit 7 = interrupt request flag (IRQF) 2598c2ecf20Sopenharmony_ci bit 6 = periodic interrupt flag (PF) 2608c2ecf20Sopenharmony_ci bit 5 = alarm interrupt flag (AF) 2618c2ecf20Sopenharmony_ci bit 4 = update interrupt flag (UF) 2628c2ecf20Sopenharmony_ci bit 3-0 = reserved 2638c2ecf20Sopenharmony_ci ODh byte Register D (read only) 2648c2ecf20Sopenharmony_ci bit 7 = RTC has power 2658c2ecf20Sopenharmony_ci bit 6-0 = reserved 2668c2ecf20Sopenharmony_ci 32h byte Current century BCD (*) 2678c2ecf20Sopenharmony_ci (*) location vendor specific and now determined from ACPI global tables 2688c2ecf20Sopenharmony_ci 2698c2ecf20Sopenharmony_ci2.3. APIC 2708c2ecf20Sopenharmony_ci--------- 2718c2ecf20Sopenharmony_ci 2728c2ecf20Sopenharmony_ciOn Pentium and later processors, an on-board timer is available to each CPU 2738c2ecf20Sopenharmony_cias part of the Advanced Programmable Interrupt Controller. The APIC is 2748c2ecf20Sopenharmony_ciaccessed through memory-mapped registers and provides interrupt service to each 2758c2ecf20Sopenharmony_ciCPU, used for IPIs and local timer interrupts. 2768c2ecf20Sopenharmony_ci 2778c2ecf20Sopenharmony_ciAlthough in theory the APIC is a safe and stable source for local interrupts, 2788c2ecf20Sopenharmony_ciin practice, many bugs and glitches have occurred due to the special nature of 2798c2ecf20Sopenharmony_cithe APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect 2808c2ecf20Sopenharmony_cithe use of the APIC and that workarounds may be required. In addition, some of 2818c2ecf20Sopenharmony_cithese workarounds pose unique constraints for virtualization - requiring either 2828c2ecf20Sopenharmony_ciextra overhead incurred from extra reads of memory-mapped I/O or additional 2838c2ecf20Sopenharmony_cifunctionality that may be more computationally expensive to implement. 2848c2ecf20Sopenharmony_ci 2858c2ecf20Sopenharmony_ciSince the APIC is documented quite well in the Intel and AMD manuals, we will 2868c2ecf20Sopenharmony_ciavoid repetition of the detail here. It should be pointed out that the APIC 2878c2ecf20Sopenharmony_citimer is programmed through the LVT (local vector timer) register, is capable 2888c2ecf20Sopenharmony_ciof one-shot or periodic operation, and is based on the bus clock divided down 2898c2ecf20Sopenharmony_ciby the programmable divider register. 2908c2ecf20Sopenharmony_ci 2918c2ecf20Sopenharmony_ci2.4. HPET 2928c2ecf20Sopenharmony_ci--------- 2938c2ecf20Sopenharmony_ci 2948c2ecf20Sopenharmony_ciHPET is quite complex, and was originally intended to replace the PIT / RTC 2958c2ecf20Sopenharmony_cisupport of the X86 PC. It remains to be seen whether that will be the case, as 2968c2ecf20Sopenharmony_cithe de facto standard of PC hardware is to emulate these older devices. Some 2978c2ecf20Sopenharmony_cisystems designated as legacy free may support only the HPET as a hardware timer 2988c2ecf20Sopenharmony_cidevice. 2998c2ecf20Sopenharmony_ci 3008c2ecf20Sopenharmony_ciThe HPET spec is rather loose and vague, requiring at least 3 hardware timers, 3018c2ecf20Sopenharmony_cibut allowing implementation freedom to support many more. It also imposes no 3028c2ecf20Sopenharmony_cifixed rate on the timer frequency, but does impose some extremal values on 3038c2ecf20Sopenharmony_cifrequency, error and slew. 3048c2ecf20Sopenharmony_ci 3058c2ecf20Sopenharmony_ciIn general, the HPET is recommended as a high precision (compared to PIT /RTC) 3068c2ecf20Sopenharmony_citime source which is independent of local variation (as there is only one HPET 3078c2ecf20Sopenharmony_ciin any given system). The HPET is also memory-mapped, and its presence is 3088c2ecf20Sopenharmony_ciindicated through ACPI tables by the BIOS. 3098c2ecf20Sopenharmony_ci 3108c2ecf20Sopenharmony_ciDetailed specification of the HPET is beyond the current scope of this 3118c2ecf20Sopenharmony_cidocument, as it is also very well documented elsewhere. 3128c2ecf20Sopenharmony_ci 3138c2ecf20Sopenharmony_ci2.5. Offboard Timers 3148c2ecf20Sopenharmony_ci-------------------- 3158c2ecf20Sopenharmony_ci 3168c2ecf20Sopenharmony_ciSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have 3178c2ecf20Sopenharmony_citiming chips built into the cards which may have registers which are accessible 3188c2ecf20Sopenharmony_cito kernel or user drivers. To the author's knowledge, using these to generate 3198c2ecf20Sopenharmony_cia clocksource for a Linux or other kernel has not yet been attempted and is in 3208c2ecf20Sopenharmony_cigeneral frowned upon as not playing by the agreed rules of the game. Such a 3218c2ecf20Sopenharmony_citimer device would require additional support to be virtualized properly and is 3228c2ecf20Sopenharmony_cinot considered important at this time as no known operating system does this. 3238c2ecf20Sopenharmony_ci 3248c2ecf20Sopenharmony_ci3. TSC Hardware 3258c2ecf20Sopenharmony_ci=============== 3268c2ecf20Sopenharmony_ci 3278c2ecf20Sopenharmony_ciThe TSC or time stamp counter is relatively simple in theory; it counts 3288c2ecf20Sopenharmony_ciinstruction cycles issued by the processor, which can be used as a measure of 3298c2ecf20Sopenharmony_citime. In practice, due to a number of problems, it is the most complicated 3308c2ecf20Sopenharmony_citimekeeping device to use. 3318c2ecf20Sopenharmony_ci 3328c2ecf20Sopenharmony_ciThe TSC is represented internally as a 64-bit MSR which can be read with the 3338c2ecf20Sopenharmony_ciRDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware 3348c2ecf20Sopenharmony_cilimitations made it possible to write the TSC, but generally on old hardware it 3358c2ecf20Sopenharmony_ciwas only possible to write the low 32-bits of the 64-bit counter, and the upper 3368c2ecf20Sopenharmony_ci32-bits of the counter were cleared. Now, however, on Intel processors family 3378c2ecf20Sopenharmony_ci0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction 3388c2ecf20Sopenharmony_cihas been lifted and all 64-bits are writable. On AMD systems, the ability to 3398c2ecf20Sopenharmony_ciwrite the TSC MSR is not an architectural guarantee. 3408c2ecf20Sopenharmony_ci 3418c2ecf20Sopenharmony_ciThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by 3428c2ecf20Sopenharmony_cimeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. 3438c2ecf20Sopenharmony_ci 3448c2ecf20Sopenharmony_ciSome vendors have implemented an additional instruction, RDTSCP, which returns 3458c2ecf20Sopenharmony_ciatomically not just the TSC, but an indicator which corresponds to the 3468c2ecf20Sopenharmony_ciprocessor number. This can be used to index into an array of TSC variables to 3478c2ecf20Sopenharmony_cidetermine offset information in SMP systems where TSCs are not synchronized. 3488c2ecf20Sopenharmony_ciThe presence of this instruction must be determined by consulting CPUID feature 3498c2ecf20Sopenharmony_cibits. 3508c2ecf20Sopenharmony_ci 3518c2ecf20Sopenharmony_ciBoth VMX and SVM provide extension fields in the virtualization hardware which 3528c2ecf20Sopenharmony_ciallows the guest visible TSC to be offset by a constant. Newer implementations 3538c2ecf20Sopenharmony_cipromise to allow the TSC to additionally be scaled, but this hardware is not 3548c2ecf20Sopenharmony_ciyet widely available. 3558c2ecf20Sopenharmony_ci 3568c2ecf20Sopenharmony_ci3.1. TSC synchronization 3578c2ecf20Sopenharmony_ci------------------------ 3588c2ecf20Sopenharmony_ci 3598c2ecf20Sopenharmony_ciThe TSC is a CPU-local clock in most implementations. This means, on SMP 3608c2ecf20Sopenharmony_ciplatforms, the TSCs of different CPUs may start at different times depending 3618c2ecf20Sopenharmony_cion when the CPUs are powered on. Generally, CPUs on the same die will share 3628c2ecf20Sopenharmony_cithe same clock, however, this is not always the case. 3638c2ecf20Sopenharmony_ci 3648c2ecf20Sopenharmony_ciThe BIOS may attempt to resynchronize the TSCs during the poweron process and 3658c2ecf20Sopenharmony_cithe operating system or other system software may attempt to do this as well. 3668c2ecf20Sopenharmony_ciSeveral hardware limitations make the problem worse - if it is not possible to 3678c2ecf20Sopenharmony_ciwrite the full 64-bits of the TSC, it may be impossible to match the TSC in 3688c2ecf20Sopenharmony_cinewly arriving CPUs to that of the rest of the system, resulting in 3698c2ecf20Sopenharmony_ciunsynchronized TSCs. This may be done by BIOS or system software, but in 3708c2ecf20Sopenharmony_cipractice, getting a perfectly synchronized TSC will not be possible unless all 3718c2ecf20Sopenharmony_civalues are read from the same clock, which generally only is possible on single 3728c2ecf20Sopenharmony_cisocket systems or those with special hardware support. 3738c2ecf20Sopenharmony_ci 3748c2ecf20Sopenharmony_ci3.2. TSC and CPU hotplug 3758c2ecf20Sopenharmony_ci------------------------ 3768c2ecf20Sopenharmony_ci 3778c2ecf20Sopenharmony_ciAs touched on already, CPUs which arrive later than the boot time of the system 3788c2ecf20Sopenharmony_cimay not have a TSC value that is synchronized with the rest of the system. 3798c2ecf20Sopenharmony_ciEither system software, BIOS, or SMM code may actually try to establish the TSC 3808c2ecf20Sopenharmony_cito a value matching the rest of the system, but a perfect match is usually not 3818c2ecf20Sopenharmony_cia guarantee. This can have the effect of bringing a system from a state where 3828c2ecf20Sopenharmony_ciTSC is synchronized back to a state where TSC synchronization flaws, however 3838c2ecf20Sopenharmony_cismall, may be exposed to the OS and any virtualization environment. 3848c2ecf20Sopenharmony_ci 3858c2ecf20Sopenharmony_ci3.3. TSC and multi-socket / NUMA 3868c2ecf20Sopenharmony_ci-------------------------------- 3878c2ecf20Sopenharmony_ci 3888c2ecf20Sopenharmony_ciMulti-socket systems, especially large multi-socket systems are likely to have 3898c2ecf20Sopenharmony_ciindividual clocksources rather than a single, universally distributed clock. 3908c2ecf20Sopenharmony_ciSince these clocks are driven by different crystals, they will not have 3918c2ecf20Sopenharmony_ciperfectly matched frequency, and temperature and electrical variations will 3928c2ecf20Sopenharmony_cicause the CPU clocks, and thus the TSCs to drift over time. Depending on the 3938c2ecf20Sopenharmony_ciexact clock and bus design, the drift may or may not be fixed in absolute 3948c2ecf20Sopenharmony_cierror, and may accumulate over time. 3958c2ecf20Sopenharmony_ci 3968c2ecf20Sopenharmony_ciIn addition, very large systems may deliberately slew the clocks of individual 3978c2ecf20Sopenharmony_cicores. This technique, known as spread-spectrum clocking, reduces EMI at the 3988c2ecf20Sopenharmony_ciclock frequency and harmonics of it, which may be required to pass FCC 3998c2ecf20Sopenharmony_cistandards for telecommunications and computer equipment. 4008c2ecf20Sopenharmony_ci 4018c2ecf20Sopenharmony_ciIt is recommended not to trust the TSCs to remain synchronized on NUMA or 4028c2ecf20Sopenharmony_cimultiple socket systems for these reasons. 4038c2ecf20Sopenharmony_ci 4048c2ecf20Sopenharmony_ci3.4. TSC and C-states 4058c2ecf20Sopenharmony_ci--------------------- 4068c2ecf20Sopenharmony_ci 4078c2ecf20Sopenharmony_ciC-states, or idling states of the processor, especially C1E and deeper sleep 4088c2ecf20Sopenharmony_cistates may be problematic for TSC as well. The TSC may stop advancing in such 4098c2ecf20Sopenharmony_cia state, resulting in a TSC which is behind that of other CPUs when execution 4108c2ecf20Sopenharmony_ciis resumed. Such CPUs must be detected and flagged by the operating system 4118c2ecf20Sopenharmony_cibased on CPU and chipset identifications. 4128c2ecf20Sopenharmony_ci 4138c2ecf20Sopenharmony_ciThe TSC in such a case may be corrected by catching it up to a known external 4148c2ecf20Sopenharmony_ciclocksource. 4158c2ecf20Sopenharmony_ci 4168c2ecf20Sopenharmony_ci3.5. TSC frequency change / P-states 4178c2ecf20Sopenharmony_ci------------------------------------ 4188c2ecf20Sopenharmony_ci 4198c2ecf20Sopenharmony_ciTo make things slightly more interesting, some CPUs may change frequency. They 4208c2ecf20Sopenharmony_cimay or may not run the TSC at the same rate, and because the frequency change 4218c2ecf20Sopenharmony_cimay be staggered or slewed, at some points in time, the TSC rate may not be 4228c2ecf20Sopenharmony_ciknown other than falling within a range of values. In this case, the TSC will 4238c2ecf20Sopenharmony_cinot be a stable time source, and must be calibrated against a known, stable, 4248c2ecf20Sopenharmony_ciexternal clock to be a usable source of time. 4258c2ecf20Sopenharmony_ci 4268c2ecf20Sopenharmony_ciWhether the TSC runs at a constant rate or scales with the P-state is model 4278c2ecf20Sopenharmony_cidependent and must be determined by inspecting CPUID, chipset or vendor 4288c2ecf20Sopenharmony_cispecific MSR fields. 4298c2ecf20Sopenharmony_ci 4308c2ecf20Sopenharmony_ciIn addition, some vendors have known bugs where the P-state is actually 4318c2ecf20Sopenharmony_cicompensated for properly during normal operation, but when the processor is 4328c2ecf20Sopenharmony_ciinactive, the P-state may be raised temporarily to service cache misses from 4338c2ecf20Sopenharmony_ciother processors. In such cases, the TSC on halted CPUs could advance faster 4348c2ecf20Sopenharmony_cithan that of non-halted processors. AMD Turion processors are known to have 4358c2ecf20Sopenharmony_cithis problem. 4368c2ecf20Sopenharmony_ci 4378c2ecf20Sopenharmony_ci3.6. TSC and STPCLK / T-states 4388c2ecf20Sopenharmony_ci------------------------------ 4398c2ecf20Sopenharmony_ci 4408c2ecf20Sopenharmony_ciExternal signals given to the processor may also have the effect of stopping 4418c2ecf20Sopenharmony_cithe TSC. This is typically done for thermal emergency power control to prevent 4428c2ecf20Sopenharmony_cian overheating condition, and typically, there is no way to detect that this 4438c2ecf20Sopenharmony_cicondition has happened. 4448c2ecf20Sopenharmony_ci 4458c2ecf20Sopenharmony_ci3.7. TSC virtualization - VMX 4468c2ecf20Sopenharmony_ci----------------------------- 4478c2ecf20Sopenharmony_ci 4488c2ecf20Sopenharmony_ciVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 4498c2ecf20Sopenharmony_ciinstructions, which is enough for full virtualization of TSC in any manner. In 4508c2ecf20Sopenharmony_ciaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET 4518c2ecf20Sopenharmony_cifield specified in the VMCS. Special instructions must be used to read and 4528c2ecf20Sopenharmony_ciwrite the VMCS field. 4538c2ecf20Sopenharmony_ci 4548c2ecf20Sopenharmony_ci3.8. TSC virtualization - SVM 4558c2ecf20Sopenharmony_ci----------------------------- 4568c2ecf20Sopenharmony_ci 4578c2ecf20Sopenharmony_ciSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 4588c2ecf20Sopenharmony_ciinstructions, which is enough for full virtualization of TSC in any manner. In 4598c2ecf20Sopenharmony_ciaddition, SVM allows passing through the host TSC plus an additional offset 4608c2ecf20Sopenharmony_cifield specified in the SVM control block. 4618c2ecf20Sopenharmony_ci 4628c2ecf20Sopenharmony_ci3.9. TSC feature bits in Linux 4638c2ecf20Sopenharmony_ci------------------------------ 4648c2ecf20Sopenharmony_ci 4658c2ecf20Sopenharmony_ciIn summary, there is no way to guarantee the TSC remains in perfect 4668c2ecf20Sopenharmony_cisynchronization unless it is explicitly guaranteed by the architecture. Even 4678c2ecf20Sopenharmony_ciif so, the TSCs in multi-sockets or NUMA systems may still run independently 4688c2ecf20Sopenharmony_cidespite being locally consistent. 4698c2ecf20Sopenharmony_ci 4708c2ecf20Sopenharmony_ciThe following feature bits are used by Linux to signal various TSC attributes, 4718c2ecf20Sopenharmony_cibut they can only be taken to be meaningful for UP or single node systems. 4728c2ecf20Sopenharmony_ci 4738c2ecf20Sopenharmony_ci========================= ======================================= 4748c2ecf20Sopenharmony_ciX86_FEATURE_TSC The TSC is available in hardware 4758c2ecf20Sopenharmony_ciX86_FEATURE_RDTSCP The RDTSCP instruction is available 4768c2ecf20Sopenharmony_ciX86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states 4778c2ecf20Sopenharmony_ciX86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states 4788c2ecf20Sopenharmony_ciX86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) 4798c2ecf20Sopenharmony_ci========================= ======================================= 4808c2ecf20Sopenharmony_ci 4818c2ecf20Sopenharmony_ci4. Virtualization Problems 4828c2ecf20Sopenharmony_ci========================== 4838c2ecf20Sopenharmony_ci 4848c2ecf20Sopenharmony_ciTimekeeping is especially problematic for virtualization because a number of 4858c2ecf20Sopenharmony_cichallenges arise. The most obvious problem is that time is now shared between 4868c2ecf20Sopenharmony_cithe host and, potentially, a number of virtual machines. Thus the virtual 4878c2ecf20Sopenharmony_cioperating system does not run with 100% usage of the CPU, despite the fact that 4888c2ecf20Sopenharmony_ciit may very well make that assumption. It may expect it to remain true to very 4898c2ecf20Sopenharmony_ciexacting bounds when interrupt sources are disabled, but in reality only its 4908c2ecf20Sopenharmony_civirtual interrupt sources are disabled, and the machine may still be preempted 4918c2ecf20Sopenharmony_ciat any time. This causes problems as the passage of real time, the injection 4928c2ecf20Sopenharmony_ciof machine interrupts and the associated clock sources are no longer completely 4938c2ecf20Sopenharmony_cisynchronized with real time. 4948c2ecf20Sopenharmony_ci 4958c2ecf20Sopenharmony_ciThis same problem can occur on native hardware to a degree, as SMM mode may 4968c2ecf20Sopenharmony_cisteal cycles from the naturally on X86 systems when SMM mode is used by the 4978c2ecf20Sopenharmony_ciBIOS, but not in such an extreme fashion. However, the fact that SMM mode may 4988c2ecf20Sopenharmony_cicause similar problems to virtualization makes it a good justification for 4998c2ecf20Sopenharmony_cisolving many of these problems on bare metal. 5008c2ecf20Sopenharmony_ci 5018c2ecf20Sopenharmony_ci4.1. Interrupt clocking 5028c2ecf20Sopenharmony_ci----------------------- 5038c2ecf20Sopenharmony_ci 5048c2ecf20Sopenharmony_ciOne of the most immediate problems that occurs with legacy operating systems 5058c2ecf20Sopenharmony_ciis that the system timekeeping routines are often designed to keep track of 5068c2ecf20Sopenharmony_citime by counting periodic interrupts. These interrupts may come from the PIT 5078c2ecf20Sopenharmony_cior the RTC, but the problem is the same: the host virtualization engine may not 5088c2ecf20Sopenharmony_cibe able to deliver the proper number of interrupts per second, and so guest 5098c2ecf20Sopenharmony_citime may fall behind. This is especially problematic if a high interrupt rate 5108c2ecf20Sopenharmony_ciis selected, such as 1000 HZ, which is unfortunately the default for many Linux 5118c2ecf20Sopenharmony_ciguests. 5128c2ecf20Sopenharmony_ci 5138c2ecf20Sopenharmony_ciThere are three approaches to solving this problem; first, it may be possible 5148c2ecf20Sopenharmony_cito simply ignore it. Guests which have a separate time source for tracking 5158c2ecf20Sopenharmony_ci'wall clock' or 'real time' may not need any adjustment of their interrupts to 5168c2ecf20Sopenharmony_cimaintain proper time. If this is not sufficient, it may be necessary to inject 5178c2ecf20Sopenharmony_ciadditional interrupts into the guest in order to increase the effective 5188c2ecf20Sopenharmony_ciinterrupt rate. This approach leads to complications in extreme conditions, 5198c2ecf20Sopenharmony_ciwhere host load or guest lag is too much to compensate for, and thus another 5208c2ecf20Sopenharmony_cisolution to the problem has risen: the guest may need to become aware of lost 5218c2ecf20Sopenharmony_citicks and compensate for them internally. Although promising in theory, the 5228c2ecf20Sopenharmony_ciimplementation of this policy in Linux has been extremely error prone, and a 5238c2ecf20Sopenharmony_cinumber of buggy variants of lost tick compensation are distributed across 5248c2ecf20Sopenharmony_cicommonly used Linux systems. 5258c2ecf20Sopenharmony_ci 5268c2ecf20Sopenharmony_ciWindows uses periodic RTC clocking as a means of keeping time internally, and 5278c2ecf20Sopenharmony_cithus requires interrupt slewing to keep proper time. It does use a low enough 5288c2ecf20Sopenharmony_cirate (ed: is it 18.2 Hz?) however that it has not yet been a problem in 5298c2ecf20Sopenharmony_cipractice. 5308c2ecf20Sopenharmony_ci 5318c2ecf20Sopenharmony_ci4.2. TSC sampling and serialization 5328c2ecf20Sopenharmony_ci----------------------------------- 5338c2ecf20Sopenharmony_ci 5348c2ecf20Sopenharmony_ciAs the highest precision time source available, the cycle counter of the CPU 5358c2ecf20Sopenharmony_cihas aroused much interest from developers. As explained above, this timer has 5368c2ecf20Sopenharmony_cimany problems unique to its nature as a local, potentially unstable and 5378c2ecf20Sopenharmony_cipotentially unsynchronized source. One issue which is not unique to the TSC, 5388c2ecf20Sopenharmony_cibut is highlighted because of its very precise nature is sampling delay. By 5398c2ecf20Sopenharmony_cidefinition, the counter, once read is already old. However, it is also 5408c2ecf20Sopenharmony_cipossible for the counter to be read ahead of the actual use of the result. 5418c2ecf20Sopenharmony_ciThis is a consequence of the superscalar execution of the instruction stream, 5428c2ecf20Sopenharmony_ciwhich may execute instructions out of order. Such execution is called 5438c2ecf20Sopenharmony_cinon-serialized. Forcing serialized execution is necessary for precise 5448c2ecf20Sopenharmony_cimeasurement with the TSC, and requires a serializing instruction, such as CPUID 5458c2ecf20Sopenharmony_cior an MSR read. 5468c2ecf20Sopenharmony_ci 5478c2ecf20Sopenharmony_ciSince CPUID may actually be virtualized by a trap and emulate mechanism, this 5488c2ecf20Sopenharmony_ciserialization can pose a performance issue for hardware virtualization. An 5498c2ecf20Sopenharmony_ciaccurate time stamp counter reading may therefore not always be available, and 5508c2ecf20Sopenharmony_ciit may be necessary for an implementation to guard against "backwards" reads of 5518c2ecf20Sopenharmony_cithe TSC as seen from other CPUs, even in an otherwise perfectly synchronized 5528c2ecf20Sopenharmony_cisystem. 5538c2ecf20Sopenharmony_ci 5548c2ecf20Sopenharmony_ci4.3. Timespec aliasing 5558c2ecf20Sopenharmony_ci---------------------- 5568c2ecf20Sopenharmony_ci 5578c2ecf20Sopenharmony_ciAdditionally, this lack of serialization from the TSC poses another challenge 5588c2ecf20Sopenharmony_ciwhen using results of the TSC when measured against another time source. As 5598c2ecf20Sopenharmony_cithe TSC is much higher precision, many possible values of the TSC may be read 5608c2ecf20Sopenharmony_ciwhile another clock is still expressing the same value. 5618c2ecf20Sopenharmony_ci 5628c2ecf20Sopenharmony_ciThat is, you may read (T,T+10) while external clock C maintains the same value. 5638c2ecf20Sopenharmony_ciDue to non-serialized reads, you may actually end up with a range which 5648c2ecf20Sopenharmony_cifluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but 5658c2ecf20Sopenharmony_cicalibrated against an external value may have a range of valid values. 5668c2ecf20Sopenharmony_ciRe-calibrating this computation may actually cause time, as computed after the 5678c2ecf20Sopenharmony_cicalibration, to go backwards, compared with time computed before the 5688c2ecf20Sopenharmony_cicalibration. 5698c2ecf20Sopenharmony_ci 5708c2ecf20Sopenharmony_ciThis problem is particularly pronounced with an internal time source in Linux, 5718c2ecf20Sopenharmony_cithe kernel time, which is expressed in the theoretically high resolution 5728c2ecf20Sopenharmony_citimespec - but which advances in much larger granularity intervals, sometimes 5738c2ecf20Sopenharmony_ciat the rate of jiffies, and possibly in catchup modes, at a much larger step. 5748c2ecf20Sopenharmony_ci 5758c2ecf20Sopenharmony_ciThis aliasing requires care in the computation and recalibration of kvmclock 5768c2ecf20Sopenharmony_ciand any other values derived from TSC computation (such as TSC virtualization 5778c2ecf20Sopenharmony_ciitself). 5788c2ecf20Sopenharmony_ci 5798c2ecf20Sopenharmony_ci4.4. Migration 5808c2ecf20Sopenharmony_ci-------------- 5818c2ecf20Sopenharmony_ci 5828c2ecf20Sopenharmony_ciMigration of a virtual machine raises problems for timekeeping in two ways. 5838c2ecf20Sopenharmony_ciFirst, the migration itself may take time, during which interrupts cannot be 5848c2ecf20Sopenharmony_cidelivered, and after which, the guest time may need to be caught up. NTP may 5858c2ecf20Sopenharmony_cibe able to help to some degree here, as the clock correction required is 5868c2ecf20Sopenharmony_citypically small enough to fall in the NTP-correctable window. 5878c2ecf20Sopenharmony_ci 5888c2ecf20Sopenharmony_ciAn additional concern is that timers based off the TSC (or HPET, if the raw bus 5898c2ecf20Sopenharmony_ciclock is exposed) may now be running at different rates, requiring compensation 5908c2ecf20Sopenharmony_ciin some way in the hypervisor by virtualizing these timers. In addition, 5918c2ecf20Sopenharmony_cimigrating to a faster machine may preclude the use of a passthrough TSC, as a 5928c2ecf20Sopenharmony_cifaster clock cannot be made visible to a guest without the potential of time 5938c2ecf20Sopenharmony_ciadvancing faster than usual. A slower clock is less of a problem, as it can 5948c2ecf20Sopenharmony_cialways be caught up to the original rate. KVM clock avoids these problems by 5958c2ecf20Sopenharmony_cisimply storing multipliers and offsets against the TSC for the guest to convert 5968c2ecf20Sopenharmony_ciback into nanosecond resolution values. 5978c2ecf20Sopenharmony_ci 5988c2ecf20Sopenharmony_ci4.5. Scheduling 5998c2ecf20Sopenharmony_ci--------------- 6008c2ecf20Sopenharmony_ci 6018c2ecf20Sopenharmony_ciSince scheduling may be based on precise timing and firing of interrupts, the 6028c2ecf20Sopenharmony_cischeduling algorithms of an operating system may be adversely affected by 6038c2ecf20Sopenharmony_civirtualization. In theory, the effect is random and should be universally 6048c2ecf20Sopenharmony_cidistributed, but in contrived as well as real scenarios (guest device access, 6058c2ecf20Sopenharmony_cicauses of virtualization exits, possible context switch), this may not always 6068c2ecf20Sopenharmony_cibe the case. The effect of this has not been well studied. 6078c2ecf20Sopenharmony_ci 6088c2ecf20Sopenharmony_ciIn an attempt to work around this, several implementations have provided a 6098c2ecf20Sopenharmony_ciparavirtualized scheduler clock, which reveals the true amount of CPU time for 6108c2ecf20Sopenharmony_ciwhich a virtual machine has been running. 6118c2ecf20Sopenharmony_ci 6128c2ecf20Sopenharmony_ci4.6. Watchdogs 6138c2ecf20Sopenharmony_ci-------------- 6148c2ecf20Sopenharmony_ci 6158c2ecf20Sopenharmony_ciWatchdog timers, such as the lock detector in Linux may fire accidentally when 6168c2ecf20Sopenharmony_cirunning under hardware virtualization due to timer interrupts being delayed or 6178c2ecf20Sopenharmony_cimisinterpretation of the passage of real time. Usually, these warnings are 6188c2ecf20Sopenharmony_cispurious and can be ignored, but in some circumstances it may be necessary to 6198c2ecf20Sopenharmony_cidisable such detection. 6208c2ecf20Sopenharmony_ci 6218c2ecf20Sopenharmony_ci4.7. Delays and precision timing 6228c2ecf20Sopenharmony_ci-------------------------------- 6238c2ecf20Sopenharmony_ci 6248c2ecf20Sopenharmony_ciPrecise timing and delays may not be possible in a virtualized system. This 6258c2ecf20Sopenharmony_cican happen if the system is controlling physical hardware, or issues delays to 6268c2ecf20Sopenharmony_cicompensate for slower I/O to and from devices. The first issue is not solvable 6278c2ecf20Sopenharmony_ciin general for a virtualized system; hardware control software can't be 6288c2ecf20Sopenharmony_ciadequately virtualized without a full real-time operating system, which would 6298c2ecf20Sopenharmony_cirequire an RT aware virtualization platform. 6308c2ecf20Sopenharmony_ci 6318c2ecf20Sopenharmony_ciThe second issue may cause performance problems, but this is unlikely to be a 6328c2ecf20Sopenharmony_cisignificant issue. In many cases these delays may be eliminated through 6338c2ecf20Sopenharmony_ciconfiguration or paravirtualization. 6348c2ecf20Sopenharmony_ci 6358c2ecf20Sopenharmony_ci4.8. Covert channels and leaks 6368c2ecf20Sopenharmony_ci------------------------------ 6378c2ecf20Sopenharmony_ci 6388c2ecf20Sopenharmony_ciIn addition to the above problems, time information will inevitably leak to the 6398c2ecf20Sopenharmony_ciguest about the host in anything but a perfect implementation of virtualized 6408c2ecf20Sopenharmony_citime. This may allow the guest to infer the presence of a hypervisor (as in a 6418c2ecf20Sopenharmony_cired-pill type detection), and it may allow information to leak between guests 6428c2ecf20Sopenharmony_ciby using CPU utilization itself as a signalling channel. Preventing such 6438c2ecf20Sopenharmony_ciproblems would require completely isolated virtual time which may not track 6448c2ecf20Sopenharmony_cireal time any longer. This may be useful in certain security or QA contexts, 6458c2ecf20Sopenharmony_cibut in general isn't recommended for real-world deployment scenarios. 646