162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci===================== 462306a36Sopenharmony_ciSyscall User Dispatch 562306a36Sopenharmony_ci===================== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciBackground 862306a36Sopenharmony_ci---------- 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciCompatibility layers like Wine need a way to efficiently emulate system 1162306a36Sopenharmony_cicalls of only a part of their process - the part that has the 1262306a36Sopenharmony_ciincompatible code - while being able to execute native syscalls without 1362306a36Sopenharmony_cia high performance penalty on the native part of the process. Seccomp 1462306a36Sopenharmony_cifalls short on this task, since it has limited support to efficiently 1562306a36Sopenharmony_cifilter syscalls based on memory regions, and it doesn't support removing 1662306a36Sopenharmony_cifilters. Therefore a new mechanism is necessary. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ciSyscall User Dispatch brings the filtering of the syscall dispatcher 1962306a36Sopenharmony_ciaddress back to userspace. The application is in control of a flip 2062306a36Sopenharmony_ciswitch, indicating the current personality of the process. A 2162306a36Sopenharmony_cimultiple-personality application can then flip the switch without 2262306a36Sopenharmony_ciinvoking the kernel, when crossing the compatibility layer API 2362306a36Sopenharmony_ciboundaries, to enable/disable the syscall redirection and execute 2462306a36Sopenharmony_cisyscalls directly (disabled) or send them to be emulated in userspace 2562306a36Sopenharmony_cithrough a SIGSYS. 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ciThe goal of this design is to provide very quick compatibility layer 2862306a36Sopenharmony_ciboundary crosses, which is achieved by not executing a syscall to change 2962306a36Sopenharmony_cipersonality every time the compatibility layer executes. Instead, a 3062306a36Sopenharmony_ciuserspace memory region exposed to the kernel indicates the current 3162306a36Sopenharmony_cipersonality, and the application simply modifies that variable to 3262306a36Sopenharmony_ciconfigure the mechanism. 3362306a36Sopenharmony_ci 3462306a36Sopenharmony_ciThere is a relatively high cost associated with handling signals on most 3562306a36Sopenharmony_ciarchitectures, like x86, but at least for Wine, syscalls issued by 3662306a36Sopenharmony_cinative Windows code are currently not known to be a performance problem, 3762306a36Sopenharmony_cisince they are quite rare, at least for modern gaming applications. 3862306a36Sopenharmony_ci 3962306a36Sopenharmony_ciSince this mechanism is designed to capture syscalls issued by 4062306a36Sopenharmony_cinon-native applications, it must function on syscalls whose invocation 4162306a36Sopenharmony_ciABI is completely unexpected to Linux. Syscall User Dispatch, therefore 4262306a36Sopenharmony_cidoesn't rely on any of the syscall ABI to make the filtering. It uses 4362306a36Sopenharmony_cionly the syscall dispatcher address and the userspace key. 4462306a36Sopenharmony_ci 4562306a36Sopenharmony_ciAs the ABI of these intercepted syscalls is unknown to Linux, these 4662306a36Sopenharmony_cisyscalls are not instrumentable via ptrace or the syscall tracepoints. 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ciInterface 4962306a36Sopenharmony_ci--------- 5062306a36Sopenharmony_ci 5162306a36Sopenharmony_ciA thread can setup this mechanism on supported kernels by executing the 5262306a36Sopenharmony_cifollowing prctl: 5362306a36Sopenharmony_ci 5462306a36Sopenharmony_ci prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_ci<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and 5762306a36Sopenharmony_cidisable the mechanism globally for that thread. When 5862306a36Sopenharmony_ciPR_SYS_DISPATCH_OFF is used, the other fields must be zero. 5962306a36Sopenharmony_ci 6062306a36Sopenharmony_ci[<offset>, <offset>+<length>) delimit a memory region interval 6162306a36Sopenharmony_cifrom which syscalls are always executed directly, regardless of the 6262306a36Sopenharmony_ciuserspace selector. This provides a fast path for the C library, which 6362306a36Sopenharmony_ciincludes the most common syscall dispatchers in the native code 6462306a36Sopenharmony_ciapplications, and also provides a way for the signal handler to return 6562306a36Sopenharmony_ciwithout triggering a nested SIGSYS on (rt\_)sigreturn. Users of this 6662306a36Sopenharmony_ciinterface should make sure that at least the signal trampoline code is 6762306a36Sopenharmony_ciincluded in this region. In addition, for syscalls that implement the 6862306a36Sopenharmony_citrampoline code on the vDSO, that trampoline is never intercepted. 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci[selector] is a pointer to a char-sized region in the process memory 7162306a36Sopenharmony_ciregion, that provides a quick way to enable disable syscall redirection 7262306a36Sopenharmony_cithread-wide, without the need to invoke the kernel directly. selector 7362306a36Sopenharmony_cican be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. 7462306a36Sopenharmony_ciAny other value should terminate the program with a SIGSYS. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ciAdditionally, a tasks syscall user dispatch configuration can be peeked 7762306a36Sopenharmony_ciand poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace 7862306a36Sopenharmony_cirequests. This is useful for checkpoint/restart software. 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_ciSecurity Notes 8162306a36Sopenharmony_ci-------------- 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ciSyscall User Dispatch provides functionality for compatibility layers to 8462306a36Sopenharmony_ciquickly capture system calls issued by a non-native part of the 8562306a36Sopenharmony_ciapplication, while not impacting the Linux native regions of the 8662306a36Sopenharmony_ciprocess. It is not a mechanism for sandboxing system calls, and it 8762306a36Sopenharmony_cishould not be seen as a security mechanism, since it is trivial for a 8862306a36Sopenharmony_cimalicious application to subvert the mechanism by jumping to an allowed 8962306a36Sopenharmony_cidispatcher region prior to executing the syscall, or to discover the 9062306a36Sopenharmony_ciaddress and modify the selector value. If the use case requires any 9162306a36Sopenharmony_cikind of security sandboxing, Seccomp should be used instead. 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ciAny fork or exec of the existing process resets the mechanism to 9462306a36Sopenharmony_ciPR_SYS_DISPATCH_OFF. 95