162306a36Sopenharmony_ciEntry/exit handling for exceptions, interrupts, syscalls and KVM
262306a36Sopenharmony_ci================================================================
362306a36Sopenharmony_ci
462306a36Sopenharmony_ciAll transitions between execution domains require state updates which are
562306a36Sopenharmony_cisubject to strict ordering constraints. State updates are required for the
662306a36Sopenharmony_cifollowing:
762306a36Sopenharmony_ci
862306a36Sopenharmony_ci  * Lockdep
962306a36Sopenharmony_ci  * RCU / Context tracking
1062306a36Sopenharmony_ci  * Preemption counter
1162306a36Sopenharmony_ci  * Tracing
1262306a36Sopenharmony_ci  * Time accounting
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ciThe update order depends on the transition type and is explained below in
1562306a36Sopenharmony_cithe transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
1662306a36Sopenharmony_ciexceptions`_, `NMI and NMI-like exceptions`_.
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ciNon-instrumentable code - noinstr
1962306a36Sopenharmony_ci---------------------------------
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciMost instrumentation facilities depend on RCU, so intrumentation is prohibited
2262306a36Sopenharmony_cifor entry code before RCU starts watching and exit code after RCU stops
2362306a36Sopenharmony_ciwatching. In addition, many architectures must save and restore register state,
2462306a36Sopenharmony_ciwhich means that (for example) a breakpoint in the breakpoint entry code would
2562306a36Sopenharmony_cioverwrite the debug registers of the initial breakpoint.
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ciSuch code must be marked with the 'noinstr' attribute, placing that code into a
2862306a36Sopenharmony_cispecial section inaccessible to instrumentation and debug facilities. Some
2962306a36Sopenharmony_cifunctions are partially instrumentable, which is handled by marking them
3062306a36Sopenharmony_cinoinstr and using instrumentation_begin() and instrumentation_end() to flag the
3162306a36Sopenharmony_ciinstrumentable ranges of code:
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ci.. code-block:: c
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ci  noinstr void entry(void)
3662306a36Sopenharmony_ci  {
3762306a36Sopenharmony_ci  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
3862306a36Sopenharmony_ci	...
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ci	instrumentation_begin();
4162306a36Sopenharmony_ci	handle_context();   // <-- instrumentable code
4262306a36Sopenharmony_ci	instrumentation_end();
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ci	...
4562306a36Sopenharmony_ci	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
4662306a36Sopenharmony_ci  }
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciThis allows verification of the 'noinstr' restrictions via objtool on
4962306a36Sopenharmony_cisupported architectures.
5062306a36Sopenharmony_ci
5162306a36Sopenharmony_ciInvoking non-instrumentable functions from instrumentable context has no
5262306a36Sopenharmony_cirestrictions and is useful to protect e.g. state switching which would
5362306a36Sopenharmony_cicause malfunction if instrumented.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciAll non-instrumentable entry/exit code sections before and after the RCU
5662306a36Sopenharmony_cistate transitions must run with interrupts disabled.
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciSyscalls
5962306a36Sopenharmony_ci--------
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ciSyscall-entry code starts in assembly code and calls out into low-level C code
6262306a36Sopenharmony_ciafter establishing low-level architecture-specific state and stack frames. This
6362306a36Sopenharmony_cilow-level C code must not be instrumented. A typical syscall handling function
6462306a36Sopenharmony_ciinvoked from low-level assembly code looks like this:
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ci.. code-block:: c
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ci  noinstr void syscall(struct pt_regs *regs, int nr)
6962306a36Sopenharmony_ci  {
7062306a36Sopenharmony_ci	arch_syscall_enter(regs);
7162306a36Sopenharmony_ci	nr = syscall_enter_from_user_mode(regs, nr);
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ci	instrumentation_begin();
7462306a36Sopenharmony_ci	if (!invoke_syscall(regs, nr) && nr != -1)
7562306a36Sopenharmony_ci	 	result_reg(regs) = __sys_ni_syscall(regs);
7662306a36Sopenharmony_ci	instrumentation_end();
7762306a36Sopenharmony_ci
7862306a36Sopenharmony_ci	syscall_exit_to_user_mode(regs);
7962306a36Sopenharmony_ci  }
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_cisyscall_enter_from_user_mode() first invokes enter_from_user_mode() which
8262306a36Sopenharmony_ciestablishes state in the following order:
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci  * Lockdep
8562306a36Sopenharmony_ci  * RCU / Context tracking
8662306a36Sopenharmony_ci  * Tracing
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ciand then invokes the various entry work functions like ptrace, seccomp, audit,
8962306a36Sopenharmony_cisyscall tracing, etc. After all that is done, the instrumentable invoke_syscall
9062306a36Sopenharmony_cifunction can be invoked. The instrumentable code section then ends, after which
9162306a36Sopenharmony_cisyscall_exit_to_user_mode() is invoked.
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_cisyscall_exit_to_user_mode() handles all work which needs to be done before
9462306a36Sopenharmony_cireturning to user space like tracing, audit, signals, task work etc. After
9562306a36Sopenharmony_cithat it invokes exit_to_user_mode() which again handles the state
9662306a36Sopenharmony_citransition in the reverse order:
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ci  * Tracing
9962306a36Sopenharmony_ci  * RCU / Context tracking
10062306a36Sopenharmony_ci  * Lockdep
10162306a36Sopenharmony_ci
10262306a36Sopenharmony_cisyscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
10362306a36Sopenharmony_ciavailable as fine grained subfunctions in cases where the architecture code
10462306a36Sopenharmony_cihas to do extra work between the various steps. In such cases it has to
10562306a36Sopenharmony_ciensure that enter_from_user_mode() is called first on entry and
10662306a36Sopenharmony_ciexit_to_user_mode() is called last on exit.
10762306a36Sopenharmony_ci
10862306a36Sopenharmony_ciDo not nest syscalls. Nested systcalls will cause RCU and/or context tracking
10962306a36Sopenharmony_cito print a warning.
11062306a36Sopenharmony_ci
11162306a36Sopenharmony_ciKVM
11262306a36Sopenharmony_ci---
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ciEntering or exiting guest mode is very similar to syscalls. From the host
11562306a36Sopenharmony_cikernel point of view the CPU goes off into user space when entering the
11662306a36Sopenharmony_ciguest and returns to the kernel on exit.
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_cikvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
11962306a36Sopenharmony_ciand kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
12062306a36Sopenharmony_ciThe state operations have the same ordering.
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ciTask work handling is done separately for guest at the boundary of the
12362306a36Sopenharmony_civcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
12462306a36Sopenharmony_cithe work handled on return to user space.
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciDo not nest KVM entry/exit transitions because doing so is nonsensical.
12762306a36Sopenharmony_ci
12862306a36Sopenharmony_ciInterrupts and regular exceptions
12962306a36Sopenharmony_ci---------------------------------
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ciInterrupts entry and exit handling is slightly more complex than syscalls
13262306a36Sopenharmony_ciand KVM transitions.
13362306a36Sopenharmony_ci
13462306a36Sopenharmony_ciIf an interrupt is raised while the CPU executes in user space, the entry
13562306a36Sopenharmony_ciand exit handling is exactly the same as for syscalls.
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ciIf the interrupt is raised while the CPU executes in kernel space the entry and
13862306a36Sopenharmony_ciexit handling is slightly different. RCU state is only updated when the
13962306a36Sopenharmony_ciinterrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
14062306a36Sopenharmony_cialready be watching. Lockdep and tracing have to be updated unconditionally.
14162306a36Sopenharmony_ci
14262306a36Sopenharmony_ciirqentry_enter() and irqentry_exit() provide the implementation for this.
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ciThe architecture-specific part looks similar to syscall handling:
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ci.. code-block:: c
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_ci  noinstr void interrupt(struct pt_regs *regs, int nr)
14962306a36Sopenharmony_ci  {
15062306a36Sopenharmony_ci	arch_interrupt_enter(regs);
15162306a36Sopenharmony_ci	state = irqentry_enter(regs);
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ci	instrumentation_begin();
15462306a36Sopenharmony_ci
15562306a36Sopenharmony_ci	irq_enter_rcu();
15662306a36Sopenharmony_ci	invoke_irq_handler(regs, nr);
15762306a36Sopenharmony_ci	irq_exit_rcu();
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ci	instrumentation_end();
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci	irqentry_exit(regs, state);
16262306a36Sopenharmony_ci  }
16362306a36Sopenharmony_ci
16462306a36Sopenharmony_ciNote that the invocation of the actual interrupt handler is within a
16562306a36Sopenharmony_ciirq_enter_rcu() and irq_exit_rcu() pair.
16662306a36Sopenharmony_ci
16762306a36Sopenharmony_ciirq_enter_rcu() updates the preemption count which makes in_hardirq()
16862306a36Sopenharmony_cireturn true, handles NOHZ tick state and interrupt time accounting. This
16962306a36Sopenharmony_cimeans that up to the point where irq_enter_rcu() is invoked in_hardirq()
17062306a36Sopenharmony_cireturns false.
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciirq_exit_rcu() handles interrupt time accounting, undoes the preemption
17362306a36Sopenharmony_cicount update and eventually handles soft interrupts and NOHZ tick state.
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ciIn theory, the preemption count could be updated in irqentry_enter(). In
17662306a36Sopenharmony_cipractice, deferring this update to irq_enter_rcu() allows the preemption-count
17762306a36Sopenharmony_cicode to be traced, while also maintaining symmetry with irq_exit_rcu() and
17862306a36Sopenharmony_ciirqentry_exit(), which are described in the next paragraph. The only downside
17962306a36Sopenharmony_ciis that the early entry code up to irq_enter_rcu() must be aware that the
18062306a36Sopenharmony_cipreemption count has not yet been updated with the HARDIRQ_OFFSET state.
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ciNote that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
18362306a36Sopenharmony_cibefore it handles soft interrupts, whose handlers must run in BH context rather
18462306a36Sopenharmony_cithan irq-disabled context. In addition, irqentry_exit() might schedule, which
18562306a36Sopenharmony_cialso requires that HARDIRQ_OFFSET has been removed from the preemption count.
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ciEven though interrupt handlers are expected to run with local interrupts
18862306a36Sopenharmony_cidisabled, interrupt nesting is common from an entry/exit perspective. For
18962306a36Sopenharmony_ciexample, softirq handling happens within an irqentry_{enter,exit}() block with
19062306a36Sopenharmony_cilocal interrupts enabled. Also, although uncommon, nothing prevents an
19162306a36Sopenharmony_ciinterrupt handler from re-enabling interrupts.
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ciInterrupt entry/exit code doesn't strictly need to handle reentrancy, since it
19462306a36Sopenharmony_ciruns with local interrupts disabled. But NMIs can happen anytime, and a lot of
19562306a36Sopenharmony_cithe entry code is shared between the two.
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciNMI and NMI-like exceptions
19862306a36Sopenharmony_ci---------------------------
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ciNMIs and NMI-like exceptions (machine checks, double faults, debug
20162306a36Sopenharmony_ciinterrupts, etc.) can hit any context and must be extra careful with
20262306a36Sopenharmony_cithe state.
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ciState changes for debug exceptions and machine-check exceptions depend on
20562306a36Sopenharmony_ciwhether these exceptions happened in user-space (breakpoints or watchpoints) or
20662306a36Sopenharmony_ciin kernel mode (code patching). From user-space, they are treated like
20762306a36Sopenharmony_ciinterrupts, while from kernel mode they are treated like NMIs.
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ciNMIs and other NMI-like exceptions handle state transitions without
21062306a36Sopenharmony_cidistinguishing between user-mode and kernel-mode origin.
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ciThe state update on entry is handled in irqentry_nmi_enter() which updates
21362306a36Sopenharmony_cistate in the following order:
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ci  * Preemption counter
21662306a36Sopenharmony_ci  * Lockdep
21762306a36Sopenharmony_ci  * RCU / Context tracking
21862306a36Sopenharmony_ci  * Tracing
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciThe exit counterpart irqentry_nmi_exit() does the reverse operation in the
22162306a36Sopenharmony_cireverse order.
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ciNote that the update of the preemption counter has to be the first
22462306a36Sopenharmony_cioperation on enter and the last operation on exit. The reason is that both
22562306a36Sopenharmony_cilockdep and RCU rely on in_nmi() returning true in this case. The
22662306a36Sopenharmony_cipreemption count modification in the NMI entry/exit case must not be
22762306a36Sopenharmony_citraced.
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ciArchitecture-specific code looks like this:
23062306a36Sopenharmony_ci
23162306a36Sopenharmony_ci.. code-block:: c
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ci  noinstr void nmi(struct pt_regs *regs)
23462306a36Sopenharmony_ci  {
23562306a36Sopenharmony_ci	arch_nmi_enter(regs);
23662306a36Sopenharmony_ci	state = irqentry_nmi_enter(regs);
23762306a36Sopenharmony_ci
23862306a36Sopenharmony_ci	instrumentation_begin();
23962306a36Sopenharmony_ci	nmi_handler(regs);
24062306a36Sopenharmony_ci	instrumentation_end();
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ci	irqentry_nmi_exit(regs);
24362306a36Sopenharmony_ci  }
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciand for e.g. a debug exception it can look like this:
24662306a36Sopenharmony_ci
24762306a36Sopenharmony_ci.. code-block:: c
24862306a36Sopenharmony_ci
24962306a36Sopenharmony_ci  noinstr void debug(struct pt_regs *regs)
25062306a36Sopenharmony_ci  {
25162306a36Sopenharmony_ci	arch_nmi_enter(regs);
25262306a36Sopenharmony_ci
25362306a36Sopenharmony_ci	debug_regs = save_debug_regs();
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ci	if (user_mode(regs)) {
25662306a36Sopenharmony_ci		state = irqentry_enter(regs);
25762306a36Sopenharmony_ci
25862306a36Sopenharmony_ci		instrumentation_begin();
25962306a36Sopenharmony_ci		user_mode_debug_handler(regs, debug_regs);
26062306a36Sopenharmony_ci		instrumentation_end();
26162306a36Sopenharmony_ci
26262306a36Sopenharmony_ci		irqentry_exit(regs, state);
26362306a36Sopenharmony_ci  	} else {
26462306a36Sopenharmony_ci  		state = irqentry_nmi_enter(regs);
26562306a36Sopenharmony_ci
26662306a36Sopenharmony_ci		instrumentation_begin();
26762306a36Sopenharmony_ci		kernel_mode_debug_handler(regs, debug_regs);
26862306a36Sopenharmony_ci		instrumentation_end();
26962306a36Sopenharmony_ci
27062306a36Sopenharmony_ci		irqentry_nmi_exit(regs, state);
27162306a36Sopenharmony_ci	}
27262306a36Sopenharmony_ci  }
27362306a36Sopenharmony_ci
27462306a36Sopenharmony_ciThere is no combined irqentry_nmi_if_kernel() function available as the
27562306a36Sopenharmony_ciabove cannot be handled in an exception-agnostic way.
27662306a36Sopenharmony_ci
27762306a36Sopenharmony_ciNMIs can happen in any context. For example, an NMI-like exception triggered
27862306a36Sopenharmony_ciwhile handling an NMI. So NMI entry code has to be reentrant and state updates
27962306a36Sopenharmony_cineed to handle nesting.
280