162306a36Sopenharmony_ciEntry/exit handling for exceptions, interrupts, syscalls and KVM 262306a36Sopenharmony_ci================================================================ 362306a36Sopenharmony_ci 462306a36Sopenharmony_ciAll transitions between execution domains require state updates which are 562306a36Sopenharmony_cisubject to strict ordering constraints. State updates are required for the 662306a36Sopenharmony_cifollowing: 762306a36Sopenharmony_ci 862306a36Sopenharmony_ci * Lockdep 962306a36Sopenharmony_ci * RCU / Context tracking 1062306a36Sopenharmony_ci * Preemption counter 1162306a36Sopenharmony_ci * Tracing 1262306a36Sopenharmony_ci * Time accounting 1362306a36Sopenharmony_ci 1462306a36Sopenharmony_ciThe update order depends on the transition type and is explained below in 1562306a36Sopenharmony_cithe transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular 1662306a36Sopenharmony_ciexceptions`_, `NMI and NMI-like exceptions`_. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ciNon-instrumentable code - noinstr 1962306a36Sopenharmony_ci--------------------------------- 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciMost instrumentation facilities depend on RCU, so intrumentation is prohibited 2262306a36Sopenharmony_cifor entry code before RCU starts watching and exit code after RCU stops 2362306a36Sopenharmony_ciwatching. In addition, many architectures must save and restore register state, 2462306a36Sopenharmony_ciwhich means that (for example) a breakpoint in the breakpoint entry code would 2562306a36Sopenharmony_cioverwrite the debug registers of the initial breakpoint. 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ciSuch code must be marked with the 'noinstr' attribute, placing that code into a 2862306a36Sopenharmony_cispecial section inaccessible to instrumentation and debug facilities. Some 2962306a36Sopenharmony_cifunctions are partially instrumentable, which is handled by marking them 3062306a36Sopenharmony_cinoinstr and using instrumentation_begin() and instrumentation_end() to flag the 3162306a36Sopenharmony_ciinstrumentable ranges of code: 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ci.. code-block:: c 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ci noinstr void entry(void) 3662306a36Sopenharmony_ci { 3762306a36Sopenharmony_ci handle_entry(); // <-- must be 'noinstr' or '__always_inline' 3862306a36Sopenharmony_ci ... 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ci instrumentation_begin(); 4162306a36Sopenharmony_ci handle_context(); // <-- instrumentable code 4262306a36Sopenharmony_ci instrumentation_end(); 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ci ... 4562306a36Sopenharmony_ci handle_exit(); // <-- must be 'noinstr' or '__always_inline' 4662306a36Sopenharmony_ci } 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ciThis allows verification of the 'noinstr' restrictions via objtool on 4962306a36Sopenharmony_cisupported architectures. 5062306a36Sopenharmony_ci 5162306a36Sopenharmony_ciInvoking non-instrumentable functions from instrumentable context has no 5262306a36Sopenharmony_cirestrictions and is useful to protect e.g. state switching which would 5362306a36Sopenharmony_cicause malfunction if instrumented. 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciAll non-instrumentable entry/exit code sections before and after the RCU 5662306a36Sopenharmony_cistate transitions must run with interrupts disabled. 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciSyscalls 5962306a36Sopenharmony_ci-------- 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ciSyscall-entry code starts in assembly code and calls out into low-level C code 6262306a36Sopenharmony_ciafter establishing low-level architecture-specific state and stack frames. This 6362306a36Sopenharmony_cilow-level C code must not be instrumented. A typical syscall handling function 6462306a36Sopenharmony_ciinvoked from low-level assembly code looks like this: 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ci.. code-block:: c 6762306a36Sopenharmony_ci 6862306a36Sopenharmony_ci noinstr void syscall(struct pt_regs *regs, int nr) 6962306a36Sopenharmony_ci { 7062306a36Sopenharmony_ci arch_syscall_enter(regs); 7162306a36Sopenharmony_ci nr = syscall_enter_from_user_mode(regs, nr); 7262306a36Sopenharmony_ci 7362306a36Sopenharmony_ci instrumentation_begin(); 7462306a36Sopenharmony_ci if (!invoke_syscall(regs, nr) && nr != -1) 7562306a36Sopenharmony_ci result_reg(regs) = __sys_ni_syscall(regs); 7662306a36Sopenharmony_ci instrumentation_end(); 7762306a36Sopenharmony_ci 7862306a36Sopenharmony_ci syscall_exit_to_user_mode(regs); 7962306a36Sopenharmony_ci } 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_cisyscall_enter_from_user_mode() first invokes enter_from_user_mode() which 8262306a36Sopenharmony_ciestablishes state in the following order: 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ci * Lockdep 8562306a36Sopenharmony_ci * RCU / Context tracking 8662306a36Sopenharmony_ci * Tracing 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ciand then invokes the various entry work functions like ptrace, seccomp, audit, 8962306a36Sopenharmony_cisyscall tracing, etc. After all that is done, the instrumentable invoke_syscall 9062306a36Sopenharmony_cifunction can be invoked. The instrumentable code section then ends, after which 9162306a36Sopenharmony_cisyscall_exit_to_user_mode() is invoked. 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_cisyscall_exit_to_user_mode() handles all work which needs to be done before 9462306a36Sopenharmony_cireturning to user space like tracing, audit, signals, task work etc. After 9562306a36Sopenharmony_cithat it invokes exit_to_user_mode() which again handles the state 9662306a36Sopenharmony_citransition in the reverse order: 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ci * Tracing 9962306a36Sopenharmony_ci * RCU / Context tracking 10062306a36Sopenharmony_ci * Lockdep 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_cisyscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also 10362306a36Sopenharmony_ciavailable as fine grained subfunctions in cases where the architecture code 10462306a36Sopenharmony_cihas to do extra work between the various steps. In such cases it has to 10562306a36Sopenharmony_ciensure that enter_from_user_mode() is called first on entry and 10662306a36Sopenharmony_ciexit_to_user_mode() is called last on exit. 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ciDo not nest syscalls. Nested systcalls will cause RCU and/or context tracking 10962306a36Sopenharmony_cito print a warning. 11062306a36Sopenharmony_ci 11162306a36Sopenharmony_ciKVM 11262306a36Sopenharmony_ci--- 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ciEntering or exiting guest mode is very similar to syscalls. From the host 11562306a36Sopenharmony_cikernel point of view the CPU goes off into user space when entering the 11662306a36Sopenharmony_ciguest and returns to the kernel on exit. 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_cikvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() 11962306a36Sopenharmony_ciand kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). 12062306a36Sopenharmony_ciThe state operations have the same ordering. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ciTask work handling is done separately for guest at the boundary of the 12362306a36Sopenharmony_civcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of 12462306a36Sopenharmony_cithe work handled on return to user space. 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ciDo not nest KVM entry/exit transitions because doing so is nonsensical. 12762306a36Sopenharmony_ci 12862306a36Sopenharmony_ciInterrupts and regular exceptions 12962306a36Sopenharmony_ci--------------------------------- 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ciInterrupts entry and exit handling is slightly more complex than syscalls 13262306a36Sopenharmony_ciand KVM transitions. 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_ciIf an interrupt is raised while the CPU executes in user space, the entry 13562306a36Sopenharmony_ciand exit handling is exactly the same as for syscalls. 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ciIf the interrupt is raised while the CPU executes in kernel space the entry and 13862306a36Sopenharmony_ciexit handling is slightly different. RCU state is only updated when the 13962306a36Sopenharmony_ciinterrupt is raised in the context of the CPU's idle task. Otherwise, RCU will 14062306a36Sopenharmony_cialready be watching. Lockdep and tracing have to be updated unconditionally. 14162306a36Sopenharmony_ci 14262306a36Sopenharmony_ciirqentry_enter() and irqentry_exit() provide the implementation for this. 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciThe architecture-specific part looks similar to syscall handling: 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ci.. code-block:: c 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ci noinstr void interrupt(struct pt_regs *regs, int nr) 14962306a36Sopenharmony_ci { 15062306a36Sopenharmony_ci arch_interrupt_enter(regs); 15162306a36Sopenharmony_ci state = irqentry_enter(regs); 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ci instrumentation_begin(); 15462306a36Sopenharmony_ci 15562306a36Sopenharmony_ci irq_enter_rcu(); 15662306a36Sopenharmony_ci invoke_irq_handler(regs, nr); 15762306a36Sopenharmony_ci irq_exit_rcu(); 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ci instrumentation_end(); 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ci irqentry_exit(regs, state); 16262306a36Sopenharmony_ci } 16362306a36Sopenharmony_ci 16462306a36Sopenharmony_ciNote that the invocation of the actual interrupt handler is within a 16562306a36Sopenharmony_ciirq_enter_rcu() and irq_exit_rcu() pair. 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ciirq_enter_rcu() updates the preemption count which makes in_hardirq() 16862306a36Sopenharmony_cireturn true, handles NOHZ tick state and interrupt time accounting. This 16962306a36Sopenharmony_cimeans that up to the point where irq_enter_rcu() is invoked in_hardirq() 17062306a36Sopenharmony_cireturns false. 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciirq_exit_rcu() handles interrupt time accounting, undoes the preemption 17362306a36Sopenharmony_cicount update and eventually handles soft interrupts and NOHZ tick state. 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ciIn theory, the preemption count could be updated in irqentry_enter(). In 17662306a36Sopenharmony_cipractice, deferring this update to irq_enter_rcu() allows the preemption-count 17762306a36Sopenharmony_cicode to be traced, while also maintaining symmetry with irq_exit_rcu() and 17862306a36Sopenharmony_ciirqentry_exit(), which are described in the next paragraph. The only downside 17962306a36Sopenharmony_ciis that the early entry code up to irq_enter_rcu() must be aware that the 18062306a36Sopenharmony_cipreemption count has not yet been updated with the HARDIRQ_OFFSET state. 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciNote that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count 18362306a36Sopenharmony_cibefore it handles soft interrupts, whose handlers must run in BH context rather 18462306a36Sopenharmony_cithan irq-disabled context. In addition, irqentry_exit() might schedule, which 18562306a36Sopenharmony_cialso requires that HARDIRQ_OFFSET has been removed from the preemption count. 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ciEven though interrupt handlers are expected to run with local interrupts 18862306a36Sopenharmony_cidisabled, interrupt nesting is common from an entry/exit perspective. For 18962306a36Sopenharmony_ciexample, softirq handling happens within an irqentry_{enter,exit}() block with 19062306a36Sopenharmony_cilocal interrupts enabled. Also, although uncommon, nothing prevents an 19162306a36Sopenharmony_ciinterrupt handler from re-enabling interrupts. 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ciInterrupt entry/exit code doesn't strictly need to handle reentrancy, since it 19462306a36Sopenharmony_ciruns with local interrupts disabled. But NMIs can happen anytime, and a lot of 19562306a36Sopenharmony_cithe entry code is shared between the two. 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ciNMI and NMI-like exceptions 19862306a36Sopenharmony_ci--------------------------- 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ciNMIs and NMI-like exceptions (machine checks, double faults, debug 20162306a36Sopenharmony_ciinterrupts, etc.) can hit any context and must be extra careful with 20262306a36Sopenharmony_cithe state. 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ciState changes for debug exceptions and machine-check exceptions depend on 20562306a36Sopenharmony_ciwhether these exceptions happened in user-space (breakpoints or watchpoints) or 20662306a36Sopenharmony_ciin kernel mode (code patching). From user-space, they are treated like 20762306a36Sopenharmony_ciinterrupts, while from kernel mode they are treated like NMIs. 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ciNMIs and other NMI-like exceptions handle state transitions without 21062306a36Sopenharmony_cidistinguishing between user-mode and kernel-mode origin. 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciThe state update on entry is handled in irqentry_nmi_enter() which updates 21362306a36Sopenharmony_cistate in the following order: 21462306a36Sopenharmony_ci 21562306a36Sopenharmony_ci * Preemption counter 21662306a36Sopenharmony_ci * Lockdep 21762306a36Sopenharmony_ci * RCU / Context tracking 21862306a36Sopenharmony_ci * Tracing 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciThe exit counterpart irqentry_nmi_exit() does the reverse operation in the 22162306a36Sopenharmony_cireverse order. 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ciNote that the update of the preemption counter has to be the first 22462306a36Sopenharmony_cioperation on enter and the last operation on exit. The reason is that both 22562306a36Sopenharmony_cilockdep and RCU rely on in_nmi() returning true in this case. The 22662306a36Sopenharmony_cipreemption count modification in the NMI entry/exit case must not be 22762306a36Sopenharmony_citraced. 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ciArchitecture-specific code looks like this: 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ci.. code-block:: c 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ci noinstr void nmi(struct pt_regs *regs) 23462306a36Sopenharmony_ci { 23562306a36Sopenharmony_ci arch_nmi_enter(regs); 23662306a36Sopenharmony_ci state = irqentry_nmi_enter(regs); 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ci instrumentation_begin(); 23962306a36Sopenharmony_ci nmi_handler(regs); 24062306a36Sopenharmony_ci instrumentation_end(); 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ci irqentry_nmi_exit(regs); 24362306a36Sopenharmony_ci } 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciand for e.g. a debug exception it can look like this: 24662306a36Sopenharmony_ci 24762306a36Sopenharmony_ci.. code-block:: c 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ci noinstr void debug(struct pt_regs *regs) 25062306a36Sopenharmony_ci { 25162306a36Sopenharmony_ci arch_nmi_enter(regs); 25262306a36Sopenharmony_ci 25362306a36Sopenharmony_ci debug_regs = save_debug_regs(); 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ci if (user_mode(regs)) { 25662306a36Sopenharmony_ci state = irqentry_enter(regs); 25762306a36Sopenharmony_ci 25862306a36Sopenharmony_ci instrumentation_begin(); 25962306a36Sopenharmony_ci user_mode_debug_handler(regs, debug_regs); 26062306a36Sopenharmony_ci instrumentation_end(); 26162306a36Sopenharmony_ci 26262306a36Sopenharmony_ci irqentry_exit(regs, state); 26362306a36Sopenharmony_ci } else { 26462306a36Sopenharmony_ci state = irqentry_nmi_enter(regs); 26562306a36Sopenharmony_ci 26662306a36Sopenharmony_ci instrumentation_begin(); 26762306a36Sopenharmony_ci kernel_mode_debug_handler(regs, debug_regs); 26862306a36Sopenharmony_ci instrumentation_end(); 26962306a36Sopenharmony_ci 27062306a36Sopenharmony_ci irqentry_nmi_exit(regs, state); 27162306a36Sopenharmony_ci } 27262306a36Sopenharmony_ci } 27362306a36Sopenharmony_ci 27462306a36Sopenharmony_ciThere is no combined irqentry_nmi_if_kernel() function available as the 27562306a36Sopenharmony_ciabove cannot be handled in an exception-agnostic way. 27662306a36Sopenharmony_ci 27762306a36Sopenharmony_ciNMIs can happen in any context. For example, an NMI-like exception triggered 27862306a36Sopenharmony_ciwhile handling an NMI. So NMI entry code has to be reentrant and state updates 27962306a36Sopenharmony_cineed to handle nesting. 280