18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci======================================================
48c2ecf20Sopenharmony_ciTimekeeping Virtualization for X86-Based Architectures
58c2ecf20Sopenharmony_ci======================================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ci:Author: Zachary Amsden <zamsden@redhat.com>
88c2ecf20Sopenharmony_ci:Copyright: (c) 2010, Red Hat.  All rights reserved.
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ci.. Contents
118c2ecf20Sopenharmony_ci
128c2ecf20Sopenharmony_ci   1) Overview
138c2ecf20Sopenharmony_ci   2) Timing Devices
148c2ecf20Sopenharmony_ci   3) TSC Hardware
158c2ecf20Sopenharmony_ci   4) Virtualization Problems
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ci1. Overview
188c2ecf20Sopenharmony_ci===========
198c2ecf20Sopenharmony_ci
208c2ecf20Sopenharmony_ciOne of the most complicated parts of the X86 platform, and specifically,
218c2ecf20Sopenharmony_cithe virtualization of this platform is the plethora of timing devices available
228c2ecf20Sopenharmony_ciand the complexity of emulating those devices.  In addition, virtualization of
238c2ecf20Sopenharmony_citime introduces a new set of challenges because it introduces a multiplexed
248c2ecf20Sopenharmony_cidivision of time beyond the control of the guest CPU.
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ciFirst, we will describe the various timekeeping hardware available, then
278c2ecf20Sopenharmony_cipresent some of the problems which arise and solutions available, giving
288c2ecf20Sopenharmony_cispecific recommendations for certain classes of KVM guests.
298c2ecf20Sopenharmony_ci
308c2ecf20Sopenharmony_ciThe purpose of this document is to collect data and information relevant to
318c2ecf20Sopenharmony_citimekeeping which may be difficult to find elsewhere, specifically,
328c2ecf20Sopenharmony_ciinformation relevant to KVM and hardware-based virtualization.
338c2ecf20Sopenharmony_ci
348c2ecf20Sopenharmony_ci2. Timing Devices
358c2ecf20Sopenharmony_ci=================
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ciFirst we discuss the basic hardware devices available.  TSC and the related
388c2ecf20Sopenharmony_ciKVM clock are special enough to warrant a full exposition and are described in
398c2ecf20Sopenharmony_cithe following section.
408c2ecf20Sopenharmony_ci
418c2ecf20Sopenharmony_ci2.1. i8254 - PIT
428c2ecf20Sopenharmony_ci----------------
438c2ecf20Sopenharmony_ci
448c2ecf20Sopenharmony_ciOne of the first timer devices available is the programmable interrupt timer,
458c2ecf20Sopenharmony_cior PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
468c2ecf20Sopenharmony_cichannels which can be programmed to deliver periodic or one-shot interrupts.
478c2ecf20Sopenharmony_ciThese three channels can be configured in different modes and have individual
488c2ecf20Sopenharmony_cicounters.  Channel 1 and 2 were not available for general use in the original
498c2ecf20Sopenharmony_ciIBM PC, and historically were connected to control RAM refresh and the PC
508c2ecf20Sopenharmony_cispeaker.  Now the PIT is typically integrated as part of an emulated chipset
518c2ecf20Sopenharmony_ciand a separate physical PIT is not used.
528c2ecf20Sopenharmony_ci
538c2ecf20Sopenharmony_ciThe PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
548c2ecf20Sopenharmony_ciusing single or multiple byte access to the I/O ports.  There are 6 modes
558c2ecf20Sopenharmony_ciavailable, but not all modes are available to all timers, as only timer 2
568c2ecf20Sopenharmony_cihas a connected gate input, required for modes 1 and 5.  The gate line is
578c2ecf20Sopenharmony_cicontrolled by port 61h, bit 0, as illustrated in the following diagram::
588c2ecf20Sopenharmony_ci
598c2ecf20Sopenharmony_ci  --------------             ----------------
608c2ecf20Sopenharmony_ci  |            |           |                |
618c2ecf20Sopenharmony_ci  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
628c2ecf20Sopenharmony_ci  |    Clock   |   |       |                |
638c2ecf20Sopenharmony_ci  --------------   |    +->| GATE  TIMER 0  |
648c2ecf20Sopenharmony_ci                   |        ----------------
658c2ecf20Sopenharmony_ci                   |
668c2ecf20Sopenharmony_ci                   |        ----------------
678c2ecf20Sopenharmony_ci                   |       |                |
688c2ecf20Sopenharmony_ci                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
698c2ecf20Sopenharmony_ci                   |       |                |            (aka /dev/null)
708c2ecf20Sopenharmony_ci                   |    +->| GATE  TIMER 1  |
718c2ecf20Sopenharmony_ci                   |        ----------------
728c2ecf20Sopenharmony_ci                   |
738c2ecf20Sopenharmony_ci                   |        ----------------
748c2ecf20Sopenharmony_ci                   |       |                |
758c2ecf20Sopenharmony_ci                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
768c2ecf20Sopenharmony_ci                           |                |      |
778c2ecf20Sopenharmony_ci  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
788c2ecf20Sopenharmony_ci                            ----------------         _|    )--|LPF|---Speaker
798c2ecf20Sopenharmony_ci                                                    / *----   \___/
808c2ecf20Sopenharmony_ci  Port 61h, bit 1 ---------------------------------/
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ciThe timer modes are now described.
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ciMode 0: Single Timeout.
858c2ecf20Sopenharmony_ci This is a one-shot software timeout that counts down
868c2ecf20Sopenharmony_ci when the gate is high (always true for timers 0 and 1).  When the count
878c2ecf20Sopenharmony_ci reaches zero, the output goes high.
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ciMode 1: Triggered One-shot.
908c2ecf20Sopenharmony_ci The output is initially set high.  When the gate
918c2ecf20Sopenharmony_ci line is set high, a countdown is initiated (which does not stop if the gate is
928c2ecf20Sopenharmony_ci lowered), during which the output is set low.  When the count reaches zero,
938c2ecf20Sopenharmony_ci the output goes high.
948c2ecf20Sopenharmony_ci
958c2ecf20Sopenharmony_ciMode 2: Rate Generator.
968c2ecf20Sopenharmony_ci The output is initially set high.  When the countdown
978c2ecf20Sopenharmony_ci reaches 1, the output goes low for one count and then returns high.  The value
988c2ecf20Sopenharmony_ci is reloaded and the countdown automatically resumes.  If the gate line goes
998c2ecf20Sopenharmony_ci low, the count is halted.  If the output is low when the gate is lowered, the
1008c2ecf20Sopenharmony_ci output automatically goes high (this only affects timer 2).
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciMode 3: Square Wave.
1038c2ecf20Sopenharmony_ci This generates a high / low square wave.  The count
1048c2ecf20Sopenharmony_ci determines the length of the pulse, which alternates between high and low
1058c2ecf20Sopenharmony_ci when zero is reached.  The count only proceeds when gate is high and is
1068c2ecf20Sopenharmony_ci automatically reloaded on reaching zero.  The count is decremented twice at
1078c2ecf20Sopenharmony_ci each clock to generate a full high / low cycle at the full periodic rate.
1088c2ecf20Sopenharmony_ci If the count is even, the clock remains high for N/2 counts and low for N/2
1098c2ecf20Sopenharmony_ci counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
1108c2ecf20Sopenharmony_ci for (N-1)/2 counts.  Only even values are latched by the counter, so odd
1118c2ecf20Sopenharmony_ci values are not observed when reading.  This is the intended mode for timer 2,
1128c2ecf20Sopenharmony_ci which generates sine-like tones by low-pass filtering the square wave output.
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ciMode 4: Software Strobe.
1158c2ecf20Sopenharmony_ci After programming this mode and loading the counter,
1168c2ecf20Sopenharmony_ci the output remains high until the counter reaches zero.  Then the output
1178c2ecf20Sopenharmony_ci goes low for 1 clock cycle and returns high.  The counter is not reloaded.
1188c2ecf20Sopenharmony_ci Counting only occurs when gate is high.
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ciMode 5: Hardware Strobe.
1218c2ecf20Sopenharmony_ci After programming and loading the counter, the
1228c2ecf20Sopenharmony_ci output remains high.  When the gate is raised, a countdown is initiated
1238c2ecf20Sopenharmony_ci (which does not stop if the gate is lowered).  When the counter reaches zero,
1248c2ecf20Sopenharmony_ci the output goes low for 1 clock cycle and then returns high.  The counter is
1258c2ecf20Sopenharmony_ci not reloaded.
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ciIn addition to normal binary counting, the PIT supports BCD counting.  The
1288c2ecf20Sopenharmony_cicommand port, 0x43 is used to set the counter and mode for each of the three
1298c2ecf20Sopenharmony_citimers.
1308c2ecf20Sopenharmony_ci
1318c2ecf20Sopenharmony_ciPIT commands, issued to port 0x43, using the following bit encoding::
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ci  Bit 7-4: Command (See table below)
1348c2ecf20Sopenharmony_ci  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
1358c2ecf20Sopenharmony_ci  Bit 0  : Binary (0) / BCD (1)
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ciCommand table::
1388c2ecf20Sopenharmony_ci
1398c2ecf20Sopenharmony_ci  0000 - Latch Timer 0 count for port 0x40
1408c2ecf20Sopenharmony_ci	sample and hold the count to be read in port 0x40;
1418c2ecf20Sopenharmony_ci	additional commands ignored until counter is read;
1428c2ecf20Sopenharmony_ci	mode bits ignored.
1438c2ecf20Sopenharmony_ci
1448c2ecf20Sopenharmony_ci  0001 - Set Timer 0 LSB mode for port 0x40
1458c2ecf20Sopenharmony_ci	set timer to read LSB only and force MSB to zero;
1468c2ecf20Sopenharmony_ci	mode bits set timer mode
1478c2ecf20Sopenharmony_ci
1488c2ecf20Sopenharmony_ci  0010 - Set Timer 0 MSB mode for port 0x40
1498c2ecf20Sopenharmony_ci	set timer to read MSB only and force LSB to zero;
1508c2ecf20Sopenharmony_ci	mode bits set timer mode
1518c2ecf20Sopenharmony_ci
1528c2ecf20Sopenharmony_ci  0011 - Set Timer 0 16-bit mode for port 0x40
1538c2ecf20Sopenharmony_ci	set timer to read / write LSB first, then MSB;
1548c2ecf20Sopenharmony_ci	mode bits set timer mode
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_ci  0100 - Latch Timer 1 count for port 0x41 - as described above
1578c2ecf20Sopenharmony_ci  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
1588c2ecf20Sopenharmony_ci  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
1598c2ecf20Sopenharmony_ci  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ci  1000 - Latch Timer 2 count for port 0x42 - as described above
1628c2ecf20Sopenharmony_ci  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
1638c2ecf20Sopenharmony_ci  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
1648c2ecf20Sopenharmony_ci  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
1658c2ecf20Sopenharmony_ci
1668c2ecf20Sopenharmony_ci  1101 - General counter latch
1678c2ecf20Sopenharmony_ci	Latch combination of counters into corresponding ports
1688c2ecf20Sopenharmony_ci	Bit 3 = Counter 2
1698c2ecf20Sopenharmony_ci	Bit 2 = Counter 1
1708c2ecf20Sopenharmony_ci	Bit 1 = Counter 0
1718c2ecf20Sopenharmony_ci	Bit 0 = Unused
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_ci  1110 - Latch timer status
1748c2ecf20Sopenharmony_ci	Latch combination of counter mode into corresponding ports
1758c2ecf20Sopenharmony_ci	Bit 3 = Counter 2
1768c2ecf20Sopenharmony_ci	Bit 2 = Counter 1
1778c2ecf20Sopenharmony_ci	Bit 1 = Counter 0
1788c2ecf20Sopenharmony_ci
1798c2ecf20Sopenharmony_ci	The output of ports 0x40-0x42 following this command will be:
1808c2ecf20Sopenharmony_ci
1818c2ecf20Sopenharmony_ci	Bit 7 = Output pin
1828c2ecf20Sopenharmony_ci	Bit 6 = Count loaded (0 if timer has expired)
1838c2ecf20Sopenharmony_ci	Bit 5-4 = Read / Write mode
1848c2ecf20Sopenharmony_ci	    01 = MSB only
1858c2ecf20Sopenharmony_ci	    10 = LSB only
1868c2ecf20Sopenharmony_ci	    11 = LSB / MSB (16-bit)
1878c2ecf20Sopenharmony_ci	Bit 3-1 = Mode
1888c2ecf20Sopenharmony_ci	Bit 0 = Binary (0) / BCD mode (1)
1898c2ecf20Sopenharmony_ci
1908c2ecf20Sopenharmony_ci2.2. RTC
1918c2ecf20Sopenharmony_ci--------
1928c2ecf20Sopenharmony_ci
1938c2ecf20Sopenharmony_ciThe second device which was available in the original PC was the MC146818 real
1948c2ecf20Sopenharmony_citime clock.  The original device is now obsolete, and usually emulated by the
1958c2ecf20Sopenharmony_cisystem chipset, sometimes by an HPET and some frankenstein IRQ routing.
1968c2ecf20Sopenharmony_ci
1978c2ecf20Sopenharmony_ciThe RTC is accessed through CMOS variables, which uses an index register to
1988c2ecf20Sopenharmony_cicontrol which bytes are read.  Since there is only one index register, read
1998c2ecf20Sopenharmony_ciof the CMOS and read of the RTC require lock protection (in addition, it is
2008c2ecf20Sopenharmony_cidangerous to allow userspace utilities such as hwclock to have direct RTC
2018c2ecf20Sopenharmony_ciaccess, as they could corrupt kernel reads and writes of CMOS memory).
2028c2ecf20Sopenharmony_ci
2038c2ecf20Sopenharmony_ciThe RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
2048c2ecf20Sopenharmony_cican function as a periodic timer, an additional once a day alarm, and can issue
2058c2ecf20Sopenharmony_ciinterrupts after an update of the CMOS registers by the MC146818 is complete.
2068c2ecf20Sopenharmony_ciThe type of interrupt is signalled in the RTC status registers.
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ciThe RTC will update the current time fields by battery power even while the
2098c2ecf20Sopenharmony_cisystem is off.  The current time fields should not be read while an update is
2108c2ecf20Sopenharmony_ciin progress, as indicated in the status register.
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
2138c2ecf20Sopenharmony_ciprogrammed to a 32kHz divider if the RTC is to count seconds.
2148c2ecf20Sopenharmony_ci
2158c2ecf20Sopenharmony_ciThis is the RAM map originally used for the RTC/CMOS::
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ci  Location    Size    Description
2188c2ecf20Sopenharmony_ci  ------------------------------------------
2198c2ecf20Sopenharmony_ci  00h         byte    Current second (BCD)
2208c2ecf20Sopenharmony_ci  01h         byte    Seconds alarm (BCD)
2218c2ecf20Sopenharmony_ci  02h         byte    Current minute (BCD)
2228c2ecf20Sopenharmony_ci  03h         byte    Minutes alarm (BCD)
2238c2ecf20Sopenharmony_ci  04h         byte    Current hour (BCD)
2248c2ecf20Sopenharmony_ci  05h         byte    Hours alarm (BCD)
2258c2ecf20Sopenharmony_ci  06h         byte    Current day of week (BCD)
2268c2ecf20Sopenharmony_ci  07h         byte    Current day of month (BCD)
2278c2ecf20Sopenharmony_ci  08h         byte    Current month (BCD)
2288c2ecf20Sopenharmony_ci  09h         byte    Current year (BCD)
2298c2ecf20Sopenharmony_ci  0Ah         byte    Register A
2308c2ecf20Sopenharmony_ci                       bit 7   = Update in progress
2318c2ecf20Sopenharmony_ci                       bit 6-4 = Divider for clock
2328c2ecf20Sopenharmony_ci                                  000 = 4.194 MHz
2338c2ecf20Sopenharmony_ci                                  001 = 1.049 MHz
2348c2ecf20Sopenharmony_ci                                  010 = 32 kHz
2358c2ecf20Sopenharmony_ci                                  10X = test modes
2368c2ecf20Sopenharmony_ci                                  110 = reset / disable
2378c2ecf20Sopenharmony_ci                                  111 = reset / disable
2388c2ecf20Sopenharmony_ci                       bit 3-0 = Rate selection for periodic interrupt
2398c2ecf20Sopenharmony_ci                                  000 = periodic timer disabled
2408c2ecf20Sopenharmony_ci                                  001 = 3.90625 uS
2418c2ecf20Sopenharmony_ci                                  010 = 7.8125 uS
2428c2ecf20Sopenharmony_ci                                  011 = .122070 mS
2438c2ecf20Sopenharmony_ci                                  100 = .244141 mS
2448c2ecf20Sopenharmony_ci                                     ...
2458c2ecf20Sopenharmony_ci                                 1101 = 125 mS
2468c2ecf20Sopenharmony_ci                                 1110 = 250 mS
2478c2ecf20Sopenharmony_ci                                 1111 = 500 mS
2488c2ecf20Sopenharmony_ci  0Bh         byte    Register B
2498c2ecf20Sopenharmony_ci                       bit 7   = Run (0) / Halt (1)
2508c2ecf20Sopenharmony_ci                       bit 6   = Periodic interrupt enable
2518c2ecf20Sopenharmony_ci                       bit 5   = Alarm interrupt enable
2528c2ecf20Sopenharmony_ci                       bit 4   = Update-ended interrupt enable
2538c2ecf20Sopenharmony_ci                       bit 3   = Square wave interrupt enable
2548c2ecf20Sopenharmony_ci                       bit 2   = BCD calendar (0) / Binary (1)
2558c2ecf20Sopenharmony_ci                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
2568c2ecf20Sopenharmony_ci                       bit 0   = 0 (DST off) / 1 (DST enabled)
2578c2ecf20Sopenharmony_ci  OCh         byte    Register C (read only)
2588c2ecf20Sopenharmony_ci                       bit 7   = interrupt request flag (IRQF)
2598c2ecf20Sopenharmony_ci                       bit 6   = periodic interrupt flag (PF)
2608c2ecf20Sopenharmony_ci                       bit 5   = alarm interrupt flag (AF)
2618c2ecf20Sopenharmony_ci                       bit 4   = update interrupt flag (UF)
2628c2ecf20Sopenharmony_ci                       bit 3-0 = reserved
2638c2ecf20Sopenharmony_ci  ODh         byte    Register D (read only)
2648c2ecf20Sopenharmony_ci                       bit 7   = RTC has power
2658c2ecf20Sopenharmony_ci                       bit 6-0 = reserved
2668c2ecf20Sopenharmony_ci  32h         byte    Current century BCD (*)
2678c2ecf20Sopenharmony_ci  (*) location vendor specific and now determined from ACPI global tables
2688c2ecf20Sopenharmony_ci
2698c2ecf20Sopenharmony_ci2.3. APIC
2708c2ecf20Sopenharmony_ci---------
2718c2ecf20Sopenharmony_ci
2728c2ecf20Sopenharmony_ciOn Pentium and later processors, an on-board timer is available to each CPU
2738c2ecf20Sopenharmony_cias part of the Advanced Programmable Interrupt Controller.  The APIC is
2748c2ecf20Sopenharmony_ciaccessed through memory-mapped registers and provides interrupt service to each
2758c2ecf20Sopenharmony_ciCPU, used for IPIs and local timer interrupts.
2768c2ecf20Sopenharmony_ci
2778c2ecf20Sopenharmony_ciAlthough in theory the APIC is a safe and stable source for local interrupts,
2788c2ecf20Sopenharmony_ciin practice, many bugs and glitches have occurred due to the special nature of
2798c2ecf20Sopenharmony_cithe APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
2808c2ecf20Sopenharmony_cithe use of the APIC and that workarounds may be required.  In addition, some of
2818c2ecf20Sopenharmony_cithese workarounds pose unique constraints for virtualization - requiring either
2828c2ecf20Sopenharmony_ciextra overhead incurred from extra reads of memory-mapped I/O or additional
2838c2ecf20Sopenharmony_cifunctionality that may be more computationally expensive to implement.
2848c2ecf20Sopenharmony_ci
2858c2ecf20Sopenharmony_ciSince the APIC is documented quite well in the Intel and AMD manuals, we will
2868c2ecf20Sopenharmony_ciavoid repetition of the detail here.  It should be pointed out that the APIC
2878c2ecf20Sopenharmony_citimer is programmed through the LVT (local vector timer) register, is capable
2888c2ecf20Sopenharmony_ciof one-shot or periodic operation, and is based on the bus clock divided down
2898c2ecf20Sopenharmony_ciby the programmable divider register.
2908c2ecf20Sopenharmony_ci
2918c2ecf20Sopenharmony_ci2.4. HPET
2928c2ecf20Sopenharmony_ci---------
2938c2ecf20Sopenharmony_ci
2948c2ecf20Sopenharmony_ciHPET is quite complex, and was originally intended to replace the PIT / RTC
2958c2ecf20Sopenharmony_cisupport of the X86 PC.  It remains to be seen whether that will be the case, as
2968c2ecf20Sopenharmony_cithe de facto standard of PC hardware is to emulate these older devices.  Some
2978c2ecf20Sopenharmony_cisystems designated as legacy free may support only the HPET as a hardware timer
2988c2ecf20Sopenharmony_cidevice.
2998c2ecf20Sopenharmony_ci
3008c2ecf20Sopenharmony_ciThe HPET spec is rather loose and vague, requiring at least 3 hardware timers,
3018c2ecf20Sopenharmony_cibut allowing implementation freedom to support many more.  It also imposes no
3028c2ecf20Sopenharmony_cifixed rate on the timer frequency, but does impose some extremal values on
3038c2ecf20Sopenharmony_cifrequency, error and slew.
3048c2ecf20Sopenharmony_ci
3058c2ecf20Sopenharmony_ciIn general, the HPET is recommended as a high precision (compared to PIT /RTC)
3068c2ecf20Sopenharmony_citime source which is independent of local variation (as there is only one HPET
3078c2ecf20Sopenharmony_ciin any given system).  The HPET is also memory-mapped, and its presence is
3088c2ecf20Sopenharmony_ciindicated through ACPI tables by the BIOS.
3098c2ecf20Sopenharmony_ci
3108c2ecf20Sopenharmony_ciDetailed specification of the HPET is beyond the current scope of this
3118c2ecf20Sopenharmony_cidocument, as it is also very well documented elsewhere.
3128c2ecf20Sopenharmony_ci
3138c2ecf20Sopenharmony_ci2.5. Offboard Timers
3148c2ecf20Sopenharmony_ci--------------------
3158c2ecf20Sopenharmony_ci
3168c2ecf20Sopenharmony_ciSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have
3178c2ecf20Sopenharmony_citiming chips built into the cards which may have registers which are accessible
3188c2ecf20Sopenharmony_cito kernel or user drivers.  To the author's knowledge, using these to generate
3198c2ecf20Sopenharmony_cia clocksource for a Linux or other kernel has not yet been attempted and is in
3208c2ecf20Sopenharmony_cigeneral frowned upon as not playing by the agreed rules of the game.  Such a
3218c2ecf20Sopenharmony_citimer device would require additional support to be virtualized properly and is
3228c2ecf20Sopenharmony_cinot considered important at this time as no known operating system does this.
3238c2ecf20Sopenharmony_ci
3248c2ecf20Sopenharmony_ci3. TSC Hardware
3258c2ecf20Sopenharmony_ci===============
3268c2ecf20Sopenharmony_ci
3278c2ecf20Sopenharmony_ciThe TSC or time stamp counter is relatively simple in theory; it counts
3288c2ecf20Sopenharmony_ciinstruction cycles issued by the processor, which can be used as a measure of
3298c2ecf20Sopenharmony_citime.  In practice, due to a number of problems, it is the most complicated
3308c2ecf20Sopenharmony_citimekeeping device to use.
3318c2ecf20Sopenharmony_ci
3328c2ecf20Sopenharmony_ciThe TSC is represented internally as a 64-bit MSR which can be read with the
3338c2ecf20Sopenharmony_ciRDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
3348c2ecf20Sopenharmony_cilimitations made it possible to write the TSC, but generally on old hardware it
3358c2ecf20Sopenharmony_ciwas only possible to write the low 32-bits of the 64-bit counter, and the upper
3368c2ecf20Sopenharmony_ci32-bits of the counter were cleared.  Now, however, on Intel processors family
3378c2ecf20Sopenharmony_ci0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
3388c2ecf20Sopenharmony_cihas been lifted and all 64-bits are writable.  On AMD systems, the ability to
3398c2ecf20Sopenharmony_ciwrite the TSC MSR is not an architectural guarantee.
3408c2ecf20Sopenharmony_ci
3418c2ecf20Sopenharmony_ciThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
3428c2ecf20Sopenharmony_cimeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
3438c2ecf20Sopenharmony_ci
3448c2ecf20Sopenharmony_ciSome vendors have implemented an additional instruction, RDTSCP, which returns
3458c2ecf20Sopenharmony_ciatomically not just the TSC, but an indicator which corresponds to the
3468c2ecf20Sopenharmony_ciprocessor number.  This can be used to index into an array of TSC variables to
3478c2ecf20Sopenharmony_cidetermine offset information in SMP systems where TSCs are not synchronized.
3488c2ecf20Sopenharmony_ciThe presence of this instruction must be determined by consulting CPUID feature
3498c2ecf20Sopenharmony_cibits.
3508c2ecf20Sopenharmony_ci
3518c2ecf20Sopenharmony_ciBoth VMX and SVM provide extension fields in the virtualization hardware which
3528c2ecf20Sopenharmony_ciallows the guest visible TSC to be offset by a constant.  Newer implementations
3538c2ecf20Sopenharmony_cipromise to allow the TSC to additionally be scaled, but this hardware is not
3548c2ecf20Sopenharmony_ciyet widely available.
3558c2ecf20Sopenharmony_ci
3568c2ecf20Sopenharmony_ci3.1. TSC synchronization
3578c2ecf20Sopenharmony_ci------------------------
3588c2ecf20Sopenharmony_ci
3598c2ecf20Sopenharmony_ciThe TSC is a CPU-local clock in most implementations.  This means, on SMP
3608c2ecf20Sopenharmony_ciplatforms, the TSCs of different CPUs may start at different times depending
3618c2ecf20Sopenharmony_cion when the CPUs are powered on.  Generally, CPUs on the same die will share
3628c2ecf20Sopenharmony_cithe same clock, however, this is not always the case.
3638c2ecf20Sopenharmony_ci
3648c2ecf20Sopenharmony_ciThe BIOS may attempt to resynchronize the TSCs during the poweron process and
3658c2ecf20Sopenharmony_cithe operating system or other system software may attempt to do this as well.
3668c2ecf20Sopenharmony_ciSeveral hardware limitations make the problem worse - if it is not possible to
3678c2ecf20Sopenharmony_ciwrite the full 64-bits of the TSC, it may be impossible to match the TSC in
3688c2ecf20Sopenharmony_cinewly arriving CPUs to that of the rest of the system, resulting in
3698c2ecf20Sopenharmony_ciunsynchronized TSCs.  This may be done by BIOS or system software, but in
3708c2ecf20Sopenharmony_cipractice, getting a perfectly synchronized TSC will not be possible unless all
3718c2ecf20Sopenharmony_civalues are read from the same clock, which generally only is possible on single
3728c2ecf20Sopenharmony_cisocket systems or those with special hardware support.
3738c2ecf20Sopenharmony_ci
3748c2ecf20Sopenharmony_ci3.2. TSC and CPU hotplug
3758c2ecf20Sopenharmony_ci------------------------
3768c2ecf20Sopenharmony_ci
3778c2ecf20Sopenharmony_ciAs touched on already, CPUs which arrive later than the boot time of the system
3788c2ecf20Sopenharmony_cimay not have a TSC value that is synchronized with the rest of the system.
3798c2ecf20Sopenharmony_ciEither system software, BIOS, or SMM code may actually try to establish the TSC
3808c2ecf20Sopenharmony_cito a value matching the rest of the system, but a perfect match is usually not
3818c2ecf20Sopenharmony_cia guarantee.  This can have the effect of bringing a system from a state where
3828c2ecf20Sopenharmony_ciTSC is synchronized back to a state where TSC synchronization flaws, however
3838c2ecf20Sopenharmony_cismall, may be exposed to the OS and any virtualization environment.
3848c2ecf20Sopenharmony_ci
3858c2ecf20Sopenharmony_ci3.3. TSC and multi-socket / NUMA
3868c2ecf20Sopenharmony_ci--------------------------------
3878c2ecf20Sopenharmony_ci
3888c2ecf20Sopenharmony_ciMulti-socket systems, especially large multi-socket systems are likely to have
3898c2ecf20Sopenharmony_ciindividual clocksources rather than a single, universally distributed clock.
3908c2ecf20Sopenharmony_ciSince these clocks are driven by different crystals, they will not have
3918c2ecf20Sopenharmony_ciperfectly matched frequency, and temperature and electrical variations will
3928c2ecf20Sopenharmony_cicause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
3938c2ecf20Sopenharmony_ciexact clock and bus design, the drift may or may not be fixed in absolute
3948c2ecf20Sopenharmony_cierror, and may accumulate over time.
3958c2ecf20Sopenharmony_ci
3968c2ecf20Sopenharmony_ciIn addition, very large systems may deliberately slew the clocks of individual
3978c2ecf20Sopenharmony_cicores.  This technique, known as spread-spectrum clocking, reduces EMI at the
3988c2ecf20Sopenharmony_ciclock frequency and harmonics of it, which may be required to pass FCC
3998c2ecf20Sopenharmony_cistandards for telecommunications and computer equipment.
4008c2ecf20Sopenharmony_ci
4018c2ecf20Sopenharmony_ciIt is recommended not to trust the TSCs to remain synchronized on NUMA or
4028c2ecf20Sopenharmony_cimultiple socket systems for these reasons.
4038c2ecf20Sopenharmony_ci
4048c2ecf20Sopenharmony_ci3.4. TSC and C-states
4058c2ecf20Sopenharmony_ci---------------------
4068c2ecf20Sopenharmony_ci
4078c2ecf20Sopenharmony_ciC-states, or idling states of the processor, especially C1E and deeper sleep
4088c2ecf20Sopenharmony_cistates may be problematic for TSC as well.  The TSC may stop advancing in such
4098c2ecf20Sopenharmony_cia state, resulting in a TSC which is behind that of other CPUs when execution
4108c2ecf20Sopenharmony_ciis resumed.  Such CPUs must be detected and flagged by the operating system
4118c2ecf20Sopenharmony_cibased on CPU and chipset identifications.
4128c2ecf20Sopenharmony_ci
4138c2ecf20Sopenharmony_ciThe TSC in such a case may be corrected by catching it up to a known external
4148c2ecf20Sopenharmony_ciclocksource.
4158c2ecf20Sopenharmony_ci
4168c2ecf20Sopenharmony_ci3.5. TSC frequency change / P-states
4178c2ecf20Sopenharmony_ci------------------------------------
4188c2ecf20Sopenharmony_ci
4198c2ecf20Sopenharmony_ciTo make things slightly more interesting, some CPUs may change frequency.  They
4208c2ecf20Sopenharmony_cimay or may not run the TSC at the same rate, and because the frequency change
4218c2ecf20Sopenharmony_cimay be staggered or slewed, at some points in time, the TSC rate may not be
4228c2ecf20Sopenharmony_ciknown other than falling within a range of values.  In this case, the TSC will
4238c2ecf20Sopenharmony_cinot be a stable time source, and must be calibrated against a known, stable,
4248c2ecf20Sopenharmony_ciexternal clock to be a usable source of time.
4258c2ecf20Sopenharmony_ci
4268c2ecf20Sopenharmony_ciWhether the TSC runs at a constant rate or scales with the P-state is model
4278c2ecf20Sopenharmony_cidependent and must be determined by inspecting CPUID, chipset or vendor
4288c2ecf20Sopenharmony_cispecific MSR fields.
4298c2ecf20Sopenharmony_ci
4308c2ecf20Sopenharmony_ciIn addition, some vendors have known bugs where the P-state is actually
4318c2ecf20Sopenharmony_cicompensated for properly during normal operation, but when the processor is
4328c2ecf20Sopenharmony_ciinactive, the P-state may be raised temporarily to service cache misses from
4338c2ecf20Sopenharmony_ciother processors.  In such cases, the TSC on halted CPUs could advance faster
4348c2ecf20Sopenharmony_cithan that of non-halted processors.  AMD Turion processors are known to have
4358c2ecf20Sopenharmony_cithis problem.
4368c2ecf20Sopenharmony_ci
4378c2ecf20Sopenharmony_ci3.6. TSC and STPCLK / T-states
4388c2ecf20Sopenharmony_ci------------------------------
4398c2ecf20Sopenharmony_ci
4408c2ecf20Sopenharmony_ciExternal signals given to the processor may also have the effect of stopping
4418c2ecf20Sopenharmony_cithe TSC.  This is typically done for thermal emergency power control to prevent
4428c2ecf20Sopenharmony_cian overheating condition, and typically, there is no way to detect that this
4438c2ecf20Sopenharmony_cicondition has happened.
4448c2ecf20Sopenharmony_ci
4458c2ecf20Sopenharmony_ci3.7. TSC virtualization - VMX
4468c2ecf20Sopenharmony_ci-----------------------------
4478c2ecf20Sopenharmony_ci
4488c2ecf20Sopenharmony_ciVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
4498c2ecf20Sopenharmony_ciinstructions, which is enough for full virtualization of TSC in any manner.  In
4508c2ecf20Sopenharmony_ciaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
4518c2ecf20Sopenharmony_cifield specified in the VMCS.  Special instructions must be used to read and
4528c2ecf20Sopenharmony_ciwrite the VMCS field.
4538c2ecf20Sopenharmony_ci
4548c2ecf20Sopenharmony_ci3.8. TSC virtualization - SVM
4558c2ecf20Sopenharmony_ci-----------------------------
4568c2ecf20Sopenharmony_ci
4578c2ecf20Sopenharmony_ciSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
4588c2ecf20Sopenharmony_ciinstructions, which is enough for full virtualization of TSC in any manner.  In
4598c2ecf20Sopenharmony_ciaddition, SVM allows passing through the host TSC plus an additional offset
4608c2ecf20Sopenharmony_cifield specified in the SVM control block.
4618c2ecf20Sopenharmony_ci
4628c2ecf20Sopenharmony_ci3.9. TSC feature bits in Linux
4638c2ecf20Sopenharmony_ci------------------------------
4648c2ecf20Sopenharmony_ci
4658c2ecf20Sopenharmony_ciIn summary, there is no way to guarantee the TSC remains in perfect
4668c2ecf20Sopenharmony_cisynchronization unless it is explicitly guaranteed by the architecture.  Even
4678c2ecf20Sopenharmony_ciif so, the TSCs in multi-sockets or NUMA systems may still run independently
4688c2ecf20Sopenharmony_cidespite being locally consistent.
4698c2ecf20Sopenharmony_ci
4708c2ecf20Sopenharmony_ciThe following feature bits are used by Linux to signal various TSC attributes,
4718c2ecf20Sopenharmony_cibut they can only be taken to be meaningful for UP or single node systems.
4728c2ecf20Sopenharmony_ci
4738c2ecf20Sopenharmony_ci=========================	=======================================
4748c2ecf20Sopenharmony_ciX86_FEATURE_TSC			The TSC is available in hardware
4758c2ecf20Sopenharmony_ciX86_FEATURE_RDTSCP		The RDTSCP instruction is available
4768c2ecf20Sopenharmony_ciX86_FEATURE_CONSTANT_TSC	The TSC rate is unchanged with P-states
4778c2ecf20Sopenharmony_ciX86_FEATURE_NONSTOP_TSC		The TSC does not stop in C-states
4788c2ecf20Sopenharmony_ciX86_FEATURE_TSC_RELIABLE	TSC sync checks are skipped (VMware)
4798c2ecf20Sopenharmony_ci=========================	=======================================
4808c2ecf20Sopenharmony_ci
4818c2ecf20Sopenharmony_ci4. Virtualization Problems
4828c2ecf20Sopenharmony_ci==========================
4838c2ecf20Sopenharmony_ci
4848c2ecf20Sopenharmony_ciTimekeeping is especially problematic for virtualization because a number of
4858c2ecf20Sopenharmony_cichallenges arise.  The most obvious problem is that time is now shared between
4868c2ecf20Sopenharmony_cithe host and, potentially, a number of virtual machines.  Thus the virtual
4878c2ecf20Sopenharmony_cioperating system does not run with 100% usage of the CPU, despite the fact that
4888c2ecf20Sopenharmony_ciit may very well make that assumption.  It may expect it to remain true to very
4898c2ecf20Sopenharmony_ciexacting bounds when interrupt sources are disabled, but in reality only its
4908c2ecf20Sopenharmony_civirtual interrupt sources are disabled, and the machine may still be preempted
4918c2ecf20Sopenharmony_ciat any time.  This causes problems as the passage of real time, the injection
4928c2ecf20Sopenharmony_ciof machine interrupts and the associated clock sources are no longer completely
4938c2ecf20Sopenharmony_cisynchronized with real time.
4948c2ecf20Sopenharmony_ci
4958c2ecf20Sopenharmony_ciThis same problem can occur on native hardware to a degree, as SMM mode may
4968c2ecf20Sopenharmony_cisteal cycles from the naturally on X86 systems when SMM mode is used by the
4978c2ecf20Sopenharmony_ciBIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
4988c2ecf20Sopenharmony_cicause similar problems to virtualization makes it a good justification for
4998c2ecf20Sopenharmony_cisolving many of these problems on bare metal.
5008c2ecf20Sopenharmony_ci
5018c2ecf20Sopenharmony_ci4.1. Interrupt clocking
5028c2ecf20Sopenharmony_ci-----------------------
5038c2ecf20Sopenharmony_ci
5048c2ecf20Sopenharmony_ciOne of the most immediate problems that occurs with legacy operating systems
5058c2ecf20Sopenharmony_ciis that the system timekeeping routines are often designed to keep track of
5068c2ecf20Sopenharmony_citime by counting periodic interrupts.  These interrupts may come from the PIT
5078c2ecf20Sopenharmony_cior the RTC, but the problem is the same: the host virtualization engine may not
5088c2ecf20Sopenharmony_cibe able to deliver the proper number of interrupts per second, and so guest
5098c2ecf20Sopenharmony_citime may fall behind.  This is especially problematic if a high interrupt rate
5108c2ecf20Sopenharmony_ciis selected, such as 1000 HZ, which is unfortunately the default for many Linux
5118c2ecf20Sopenharmony_ciguests.
5128c2ecf20Sopenharmony_ci
5138c2ecf20Sopenharmony_ciThere are three approaches to solving this problem; first, it may be possible
5148c2ecf20Sopenharmony_cito simply ignore it.  Guests which have a separate time source for tracking
5158c2ecf20Sopenharmony_ci'wall clock' or 'real time' may not need any adjustment of their interrupts to
5168c2ecf20Sopenharmony_cimaintain proper time.  If this is not sufficient, it may be necessary to inject
5178c2ecf20Sopenharmony_ciadditional interrupts into the guest in order to increase the effective
5188c2ecf20Sopenharmony_ciinterrupt rate.  This approach leads to complications in extreme conditions,
5198c2ecf20Sopenharmony_ciwhere host load or guest lag is too much to compensate for, and thus another
5208c2ecf20Sopenharmony_cisolution to the problem has risen: the guest may need to become aware of lost
5218c2ecf20Sopenharmony_citicks and compensate for them internally.  Although promising in theory, the
5228c2ecf20Sopenharmony_ciimplementation of this policy in Linux has been extremely error prone, and a
5238c2ecf20Sopenharmony_cinumber of buggy variants of lost tick compensation are distributed across
5248c2ecf20Sopenharmony_cicommonly used Linux systems.
5258c2ecf20Sopenharmony_ci
5268c2ecf20Sopenharmony_ciWindows uses periodic RTC clocking as a means of keeping time internally, and
5278c2ecf20Sopenharmony_cithus requires interrupt slewing to keep proper time.  It does use a low enough
5288c2ecf20Sopenharmony_cirate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
5298c2ecf20Sopenharmony_cipractice.
5308c2ecf20Sopenharmony_ci
5318c2ecf20Sopenharmony_ci4.2. TSC sampling and serialization
5328c2ecf20Sopenharmony_ci-----------------------------------
5338c2ecf20Sopenharmony_ci
5348c2ecf20Sopenharmony_ciAs the highest precision time source available, the cycle counter of the CPU
5358c2ecf20Sopenharmony_cihas aroused much interest from developers.  As explained above, this timer has
5368c2ecf20Sopenharmony_cimany problems unique to its nature as a local, potentially unstable and
5378c2ecf20Sopenharmony_cipotentially unsynchronized source.  One issue which is not unique to the TSC,
5388c2ecf20Sopenharmony_cibut is highlighted because of its very precise nature is sampling delay.  By
5398c2ecf20Sopenharmony_cidefinition, the counter, once read is already old.  However, it is also
5408c2ecf20Sopenharmony_cipossible for the counter to be read ahead of the actual use of the result.
5418c2ecf20Sopenharmony_ciThis is a consequence of the superscalar execution of the instruction stream,
5428c2ecf20Sopenharmony_ciwhich may execute instructions out of order.  Such execution is called
5438c2ecf20Sopenharmony_cinon-serialized.  Forcing serialized execution is necessary for precise
5448c2ecf20Sopenharmony_cimeasurement with the TSC, and requires a serializing instruction, such as CPUID
5458c2ecf20Sopenharmony_cior an MSR read.
5468c2ecf20Sopenharmony_ci
5478c2ecf20Sopenharmony_ciSince CPUID may actually be virtualized by a trap and emulate mechanism, this
5488c2ecf20Sopenharmony_ciserialization can pose a performance issue for hardware virtualization.  An
5498c2ecf20Sopenharmony_ciaccurate time stamp counter reading may therefore not always be available, and
5508c2ecf20Sopenharmony_ciit may be necessary for an implementation to guard against "backwards" reads of
5518c2ecf20Sopenharmony_cithe TSC as seen from other CPUs, even in an otherwise perfectly synchronized
5528c2ecf20Sopenharmony_cisystem.
5538c2ecf20Sopenharmony_ci
5548c2ecf20Sopenharmony_ci4.3. Timespec aliasing
5558c2ecf20Sopenharmony_ci----------------------
5568c2ecf20Sopenharmony_ci
5578c2ecf20Sopenharmony_ciAdditionally, this lack of serialization from the TSC poses another challenge
5588c2ecf20Sopenharmony_ciwhen using results of the TSC when measured against another time source.  As
5598c2ecf20Sopenharmony_cithe TSC is much higher precision, many possible values of the TSC may be read
5608c2ecf20Sopenharmony_ciwhile another clock is still expressing the same value.
5618c2ecf20Sopenharmony_ci
5628c2ecf20Sopenharmony_ciThat is, you may read (T,T+10) while external clock C maintains the same value.
5638c2ecf20Sopenharmony_ciDue to non-serialized reads, you may actually end up with a range which
5648c2ecf20Sopenharmony_cifluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
5658c2ecf20Sopenharmony_cicalibrated against an external value may have a range of valid values.
5668c2ecf20Sopenharmony_ciRe-calibrating this computation may actually cause time, as computed after the
5678c2ecf20Sopenharmony_cicalibration, to go backwards, compared with time computed before the
5688c2ecf20Sopenharmony_cicalibration.
5698c2ecf20Sopenharmony_ci
5708c2ecf20Sopenharmony_ciThis problem is particularly pronounced with an internal time source in Linux,
5718c2ecf20Sopenharmony_cithe kernel time, which is expressed in the theoretically high resolution
5728c2ecf20Sopenharmony_citimespec - but which advances in much larger granularity intervals, sometimes
5738c2ecf20Sopenharmony_ciat the rate of jiffies, and possibly in catchup modes, at a much larger step.
5748c2ecf20Sopenharmony_ci
5758c2ecf20Sopenharmony_ciThis aliasing requires care in the computation and recalibration of kvmclock
5768c2ecf20Sopenharmony_ciand any other values derived from TSC computation (such as TSC virtualization
5778c2ecf20Sopenharmony_ciitself).
5788c2ecf20Sopenharmony_ci
5798c2ecf20Sopenharmony_ci4.4. Migration
5808c2ecf20Sopenharmony_ci--------------
5818c2ecf20Sopenharmony_ci
5828c2ecf20Sopenharmony_ciMigration of a virtual machine raises problems for timekeeping in two ways.
5838c2ecf20Sopenharmony_ciFirst, the migration itself may take time, during which interrupts cannot be
5848c2ecf20Sopenharmony_cidelivered, and after which, the guest time may need to be caught up.  NTP may
5858c2ecf20Sopenharmony_cibe able to help to some degree here, as the clock correction required is
5868c2ecf20Sopenharmony_citypically small enough to fall in the NTP-correctable window.
5878c2ecf20Sopenharmony_ci
5888c2ecf20Sopenharmony_ciAn additional concern is that timers based off the TSC (or HPET, if the raw bus
5898c2ecf20Sopenharmony_ciclock is exposed) may now be running at different rates, requiring compensation
5908c2ecf20Sopenharmony_ciin some way in the hypervisor by virtualizing these timers.  In addition,
5918c2ecf20Sopenharmony_cimigrating to a faster machine may preclude the use of a passthrough TSC, as a
5928c2ecf20Sopenharmony_cifaster clock cannot be made visible to a guest without the potential of time
5938c2ecf20Sopenharmony_ciadvancing faster than usual.  A slower clock is less of a problem, as it can
5948c2ecf20Sopenharmony_cialways be caught up to the original rate.  KVM clock avoids these problems by
5958c2ecf20Sopenharmony_cisimply storing multipliers and offsets against the TSC for the guest to convert
5968c2ecf20Sopenharmony_ciback into nanosecond resolution values.
5978c2ecf20Sopenharmony_ci
5988c2ecf20Sopenharmony_ci4.5. Scheduling
5998c2ecf20Sopenharmony_ci---------------
6008c2ecf20Sopenharmony_ci
6018c2ecf20Sopenharmony_ciSince scheduling may be based on precise timing and firing of interrupts, the
6028c2ecf20Sopenharmony_cischeduling algorithms of an operating system may be adversely affected by
6038c2ecf20Sopenharmony_civirtualization.  In theory, the effect is random and should be universally
6048c2ecf20Sopenharmony_cidistributed, but in contrived as well as real scenarios (guest device access,
6058c2ecf20Sopenharmony_cicauses of virtualization exits, possible context switch), this may not always
6068c2ecf20Sopenharmony_cibe the case.  The effect of this has not been well studied.
6078c2ecf20Sopenharmony_ci
6088c2ecf20Sopenharmony_ciIn an attempt to work around this, several implementations have provided a
6098c2ecf20Sopenharmony_ciparavirtualized scheduler clock, which reveals the true amount of CPU time for
6108c2ecf20Sopenharmony_ciwhich a virtual machine has been running.
6118c2ecf20Sopenharmony_ci
6128c2ecf20Sopenharmony_ci4.6. Watchdogs
6138c2ecf20Sopenharmony_ci--------------
6148c2ecf20Sopenharmony_ci
6158c2ecf20Sopenharmony_ciWatchdog timers, such as the lock detector in Linux may fire accidentally when
6168c2ecf20Sopenharmony_cirunning under hardware virtualization due to timer interrupts being delayed or
6178c2ecf20Sopenharmony_cimisinterpretation of the passage of real time.  Usually, these warnings are
6188c2ecf20Sopenharmony_cispurious and can be ignored, but in some circumstances it may be necessary to
6198c2ecf20Sopenharmony_cidisable such detection.
6208c2ecf20Sopenharmony_ci
6218c2ecf20Sopenharmony_ci4.7. Delays and precision timing
6228c2ecf20Sopenharmony_ci--------------------------------
6238c2ecf20Sopenharmony_ci
6248c2ecf20Sopenharmony_ciPrecise timing and delays may not be possible in a virtualized system.  This
6258c2ecf20Sopenharmony_cican happen if the system is controlling physical hardware, or issues delays to
6268c2ecf20Sopenharmony_cicompensate for slower I/O to and from devices.  The first issue is not solvable
6278c2ecf20Sopenharmony_ciin general for a virtualized system; hardware control software can't be
6288c2ecf20Sopenharmony_ciadequately virtualized without a full real-time operating system, which would
6298c2ecf20Sopenharmony_cirequire an RT aware virtualization platform.
6308c2ecf20Sopenharmony_ci
6318c2ecf20Sopenharmony_ciThe second issue may cause performance problems, but this is unlikely to be a
6328c2ecf20Sopenharmony_cisignificant issue.  In many cases these delays may be eliminated through
6338c2ecf20Sopenharmony_ciconfiguration or paravirtualization.
6348c2ecf20Sopenharmony_ci
6358c2ecf20Sopenharmony_ci4.8. Covert channels and leaks
6368c2ecf20Sopenharmony_ci------------------------------
6378c2ecf20Sopenharmony_ci
6388c2ecf20Sopenharmony_ciIn addition to the above problems, time information will inevitably leak to the
6398c2ecf20Sopenharmony_ciguest about the host in anything but a perfect implementation of virtualized
6408c2ecf20Sopenharmony_citime.  This may allow the guest to infer the presence of a hypervisor (as in a
6418c2ecf20Sopenharmony_cired-pill type detection), and it may allow information to leak between guests
6428c2ecf20Sopenharmony_ciby using CPU utilization itself as a signalling channel.  Preventing such
6438c2ecf20Sopenharmony_ciproblems would require completely isolated virtual time which may not track
6448c2ecf20Sopenharmony_cireal time any longer.  This may be useful in certain security or QA contexts,
6458c2ecf20Sopenharmony_cibut in general isn't recommended for real-world deployment scenarios.
646