162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=====================
462306a36Sopenharmony_ciSyscall User Dispatch
562306a36Sopenharmony_ci=====================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciBackground
862306a36Sopenharmony_ci----------
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciCompatibility layers like Wine need a way to efficiently emulate system
1162306a36Sopenharmony_cicalls of only a part of their process - the part that has the
1262306a36Sopenharmony_ciincompatible code - while being able to execute native syscalls without
1362306a36Sopenharmony_cia high performance penalty on the native part of the process.  Seccomp
1462306a36Sopenharmony_cifalls short on this task, since it has limited support to efficiently
1562306a36Sopenharmony_cifilter syscalls based on memory regions, and it doesn't support removing
1662306a36Sopenharmony_cifilters.  Therefore a new mechanism is necessary.
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ciSyscall User Dispatch brings the filtering of the syscall dispatcher
1962306a36Sopenharmony_ciaddress back to userspace.  The application is in control of a flip
2062306a36Sopenharmony_ciswitch, indicating the current personality of the process.  A
2162306a36Sopenharmony_cimultiple-personality application can then flip the switch without
2262306a36Sopenharmony_ciinvoking the kernel, when crossing the compatibility layer API
2362306a36Sopenharmony_ciboundaries, to enable/disable the syscall redirection and execute
2462306a36Sopenharmony_cisyscalls directly (disabled) or send them to be emulated in userspace
2562306a36Sopenharmony_cithrough a SIGSYS.
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ciThe goal of this design is to provide very quick compatibility layer
2862306a36Sopenharmony_ciboundary crosses, which is achieved by not executing a syscall to change
2962306a36Sopenharmony_cipersonality every time the compatibility layer executes.  Instead, a
3062306a36Sopenharmony_ciuserspace memory region exposed to the kernel indicates the current
3162306a36Sopenharmony_cipersonality, and the application simply modifies that variable to
3262306a36Sopenharmony_ciconfigure the mechanism.
3362306a36Sopenharmony_ci
3462306a36Sopenharmony_ciThere is a relatively high cost associated with handling signals on most
3562306a36Sopenharmony_ciarchitectures, like x86, but at least for Wine, syscalls issued by
3662306a36Sopenharmony_cinative Windows code are currently not known to be a performance problem,
3762306a36Sopenharmony_cisince they are quite rare, at least for modern gaming applications.
3862306a36Sopenharmony_ci
3962306a36Sopenharmony_ciSince this mechanism is designed to capture syscalls issued by
4062306a36Sopenharmony_cinon-native applications, it must function on syscalls whose invocation
4162306a36Sopenharmony_ciABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
4262306a36Sopenharmony_cidoesn't rely on any of the syscall ABI to make the filtering.  It uses
4362306a36Sopenharmony_cionly the syscall dispatcher address and the userspace key.
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ciAs the ABI of these intercepted syscalls is unknown to Linux, these
4662306a36Sopenharmony_cisyscalls are not instrumentable via ptrace or the syscall tracepoints.
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciInterface
4962306a36Sopenharmony_ci---------
5062306a36Sopenharmony_ci
5162306a36Sopenharmony_ciA thread can setup this mechanism on supported kernels by executing the
5262306a36Sopenharmony_cifollowing prctl:
5362306a36Sopenharmony_ci
5462306a36Sopenharmony_ci  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ci<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
5762306a36Sopenharmony_cidisable the mechanism globally for that thread.  When
5862306a36Sopenharmony_ciPR_SYS_DISPATCH_OFF is used, the other fields must be zero.
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ci[<offset>, <offset>+<length>) delimit a memory region interval
6162306a36Sopenharmony_cifrom which syscalls are always executed directly, regardless of the
6262306a36Sopenharmony_ciuserspace selector.  This provides a fast path for the C library, which
6362306a36Sopenharmony_ciincludes the most common syscall dispatchers in the native code
6462306a36Sopenharmony_ciapplications, and also provides a way for the signal handler to return
6562306a36Sopenharmony_ciwithout triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
6662306a36Sopenharmony_ciinterface should make sure that at least the signal trampoline code is
6762306a36Sopenharmony_ciincluded in this region. In addition, for syscalls that implement the
6862306a36Sopenharmony_citrampoline code on the vDSO, that trampoline is never intercepted.
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci[selector] is a pointer to a char-sized region in the process memory
7162306a36Sopenharmony_ciregion, that provides a quick way to enable disable syscall redirection
7262306a36Sopenharmony_cithread-wide, without the need to invoke the kernel directly.  selector
7362306a36Sopenharmony_cican be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
7462306a36Sopenharmony_ciAny other value should terminate the program with a SIGSYS.
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_ciAdditionally, a tasks syscall user dispatch configuration can be peeked
7762306a36Sopenharmony_ciand poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
7862306a36Sopenharmony_cirequests. This is useful for checkpoint/restart software.
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_ciSecurity Notes
8162306a36Sopenharmony_ci--------------
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ciSyscall User Dispatch provides functionality for compatibility layers to
8462306a36Sopenharmony_ciquickly capture system calls issued by a non-native part of the
8562306a36Sopenharmony_ciapplication, while not impacting the Linux native regions of the
8662306a36Sopenharmony_ciprocess.  It is not a mechanism for sandboxing system calls, and it
8762306a36Sopenharmony_cishould not be seen as a security mechanism, since it is trivial for a
8862306a36Sopenharmony_cimalicious application to subvert the mechanism by jumping to an allowed
8962306a36Sopenharmony_cidispatcher region prior to executing the syscall, or to discover the
9062306a36Sopenharmony_ciaddress and modify the selector value.  If the use case requires any
9162306a36Sopenharmony_cikind of security sandboxing, Seccomp should be used instead.
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ciAny fork or exec of the existing process resets the mechanism to
9462306a36Sopenharmony_ciPR_SYS_DISPATCH_OFF.
95