162306a36Sopenharmony_ci
262306a36Sopenharmony_ci.. _addsyscalls:
362306a36Sopenharmony_ci
462306a36Sopenharmony_ciAdding a New System Call
562306a36Sopenharmony_ci========================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciThis document describes what's involved in adding a new system call to the
862306a36Sopenharmony_ciLinux kernel, over and above the normal submission advice in
962306a36Sopenharmony_ci:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
1062306a36Sopenharmony_ci
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ciSystem Call Alternatives
1362306a36Sopenharmony_ci------------------------
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciThe first thing to consider when adding a new system call is whether one of
1662306a36Sopenharmony_cithe alternatives might be suitable instead.  Although system calls are the
1762306a36Sopenharmony_cimost traditional and most obvious interaction points between userspace and the
1862306a36Sopenharmony_cikernel, there are other possibilities -- choose what fits best for your
1962306a36Sopenharmony_ciinterface.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ci - If the operations involved can be made to look like a filesystem-like
2262306a36Sopenharmony_ci   object, it may make more sense to create a new filesystem or device.  This
2362306a36Sopenharmony_ci   also makes it easier to encapsulate the new functionality in a kernel module
2462306a36Sopenharmony_ci   rather than requiring it to be built into the main kernel.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ci     - If the new functionality involves operations where the kernel notifies
2762306a36Sopenharmony_ci       userspace that something has happened, then returning a new file
2862306a36Sopenharmony_ci       descriptor for the relevant object allows userspace to use
2962306a36Sopenharmony_ci       ``poll``/``select``/``epoll`` to receive that notification.
3062306a36Sopenharmony_ci     - However, operations that don't map to
3162306a36Sopenharmony_ci       :manpage:`read(2)`/:manpage:`write(2)`-like operations
3262306a36Sopenharmony_ci       have to be implemented as :manpage:`ioctl(2)` requests, which can lead
3362306a36Sopenharmony_ci       to a somewhat opaque API.
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ci - If you're just exposing runtime system information, a new node in sysfs
3662306a36Sopenharmony_ci   (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may
3762306a36Sopenharmony_ci   be more appropriate.  However, access to these mechanisms requires that the
3862306a36Sopenharmony_ci   relevant filesystem is mounted, which might not always be the case (e.g.
3962306a36Sopenharmony_ci   in a namespaced/sandboxed/chrooted environment).  Avoid adding any API to
4062306a36Sopenharmony_ci   debugfs, as this is not considered a 'production' interface to userspace.
4162306a36Sopenharmony_ci - If the operation is specific to a particular file or file descriptor, then
4262306a36Sopenharmony_ci   an additional :manpage:`fcntl(2)` command option may be more appropriate.  However,
4362306a36Sopenharmony_ci   :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
4462306a36Sopenharmony_ci   this option is best for when the new function is closely analogous to
4562306a36Sopenharmony_ci   existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
4662306a36Sopenharmony_ci   (for example, getting/setting a simple flag related to a file descriptor).
4762306a36Sopenharmony_ci - If the operation is specific to a particular task or process, then an
4862306a36Sopenharmony_ci   additional :manpage:`prctl(2)` command option may be more appropriate.  As
4962306a36Sopenharmony_ci   with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
5062306a36Sopenharmony_ci   is best reserved for near-analogs of existing ``prctl()`` commands or
5162306a36Sopenharmony_ci   getting/setting a simple flag related to a process.
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ci
5462306a36Sopenharmony_ciDesigning the API: Planning for Extension
5562306a36Sopenharmony_ci-----------------------------------------
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciA new system call forms part of the API of the kernel, and has to be supported
5862306a36Sopenharmony_ciindefinitely.  As such, it's a very good idea to explicitly discuss the
5962306a36Sopenharmony_ciinterface on the kernel mailing list, and it's important to plan for future
6062306a36Sopenharmony_ciextensions of the interface.
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci(The syscall table is littered with historical examples where this wasn't done,
6362306a36Sopenharmony_citogether with the corresponding follow-up system calls --
6462306a36Sopenharmony_ci``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
6562306a36Sopenharmony_ci``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
6662306a36Sopenharmony_cilearn from the history of the kernel and plan for extensions from the start.)
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ciFor simpler system calls that only take a couple of arguments, the preferred
6962306a36Sopenharmony_ciway to allow for future extensibility is to include a flags argument to the
7062306a36Sopenharmony_cisystem call.  To make sure that userspace programs can safely use flags
7162306a36Sopenharmony_cibetween kernel versions, check whether the flags value holds any unknown
7262306a36Sopenharmony_ciflags, and reject the system call (with ``EINVAL``) if it does::
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ci    if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
7562306a36Sopenharmony_ci        return -EINVAL;
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ci(If no flags values are used yet, check that the flags argument is zero.)
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciFor more sophisticated system calls that involve a larger number of arguments,
8062306a36Sopenharmony_ciit's preferred to encapsulate the majority of the arguments into a structure
8162306a36Sopenharmony_cithat is passed in by pointer.  Such a structure can cope with future extension
8262306a36Sopenharmony_ciby including a size argument in the structure::
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci    struct xyzzy_params {
8562306a36Sopenharmony_ci        u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
8662306a36Sopenharmony_ci        u32 param_1;
8762306a36Sopenharmony_ci        u64 param_2;
8862306a36Sopenharmony_ci        u64 param_3;
8962306a36Sopenharmony_ci    };
9062306a36Sopenharmony_ci
9162306a36Sopenharmony_ciAs long as any subsequently added field, say ``param_4``, is designed so that a
9262306a36Sopenharmony_cizero value gives the previous behaviour, then this allows both directions of
9362306a36Sopenharmony_civersion mismatch:
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ci - To cope with a later userspace program calling an older kernel, the kernel
9662306a36Sopenharmony_ci   code should check that any memory beyond the size of the structure that it
9762306a36Sopenharmony_ci   expects is zero (effectively checking that ``param_4 == 0``).
9862306a36Sopenharmony_ci - To cope with an older userspace program calling a newer kernel, the kernel
9962306a36Sopenharmony_ci   code can zero-extend a smaller instance of the structure (effectively
10062306a36Sopenharmony_ci   setting ``param_4 = 0``).
10162306a36Sopenharmony_ci
10262306a36Sopenharmony_ciSee :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
10362306a36Sopenharmony_ci``kernel/events/core.c``) for an example of this approach.
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciDesigning the API: Other Considerations
10762306a36Sopenharmony_ci---------------------------------------
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ciIf your new system call allows userspace to refer to a kernel object, it
11062306a36Sopenharmony_cishould use a file descriptor as the handle for that object -- don't invent a
11162306a36Sopenharmony_cinew type of userspace object handle when the kernel already has mechanisms and
11262306a36Sopenharmony_ciwell-defined semantics for using file descriptors.
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ciIf your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
11562306a36Sopenharmony_cithen the flags argument should include a value that is equivalent to setting
11662306a36Sopenharmony_ci``O_CLOEXEC`` on the new FD.  This makes it possible for userspace to close
11762306a36Sopenharmony_cithe timing window between ``xyzzy()`` and calling
11862306a36Sopenharmony_ci``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
11962306a36Sopenharmony_ci``execve()`` in another thread could leak a descriptor to
12062306a36Sopenharmony_cithe exec'ed program. (However, resist the temptation to re-use the actual value
12162306a36Sopenharmony_ciof the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
12262306a36Sopenharmony_cinumbering space of ``O_*`` flags that is fairly full.)
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ciIf your system call returns a new file descriptor, you should also consider
12562306a36Sopenharmony_ciwhat it means to use the :manpage:`poll(2)` family of system calls on that file
12662306a36Sopenharmony_cidescriptor. Making a file descriptor ready for reading or writing is the
12762306a36Sopenharmony_cinormal way for the kernel to indicate to userspace that an event has
12862306a36Sopenharmony_cioccurred on the corresponding kernel object.
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ciIf your new :manpage:`xyzzy(2)` system call involves a filename argument::
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ci    int sys_xyzzy(const char __user *path, ..., unsigned int flags);
13362306a36Sopenharmony_ci
13462306a36Sopenharmony_ciyou should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci    int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ciThis allows more flexibility for how userspace specifies the file in question;
13962306a36Sopenharmony_ciin particular it allows userspace to request the functionality for an
14062306a36Sopenharmony_cialready-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
14162306a36Sopenharmony_cigiving an :manpage:`fxyzzy(3)` operation for free::
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ci - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
14462306a36Sopenharmony_ci - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ci(For more details on the rationale of the \*at() calls, see the
14762306a36Sopenharmony_ci:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
14862306a36Sopenharmony_ci:manpage:`fstatat(2)` man page.)
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ciIf your new :manpage:`xyzzy(2)` system call involves a parameter describing an
15162306a36Sopenharmony_cioffset within a file, make its type ``loff_t`` so that 64-bit offsets can be
15262306a36Sopenharmony_cisupported even on 32-bit architectures.
15362306a36Sopenharmony_ci
15462306a36Sopenharmony_ciIf your new :manpage:`xyzzy(2)` system call involves privileged functionality,
15562306a36Sopenharmony_ciit needs to be governed by the appropriate Linux capability bit (checked with
15662306a36Sopenharmony_cia call to ``capable()``), as described in the :manpage:`capabilities(7)` man
15762306a36Sopenharmony_cipage.  Choose an existing capability bit that governs related functionality,
15862306a36Sopenharmony_cibut try to avoid combining lots of only vaguely related functions together
15962306a36Sopenharmony_ciunder the same bit, as this goes against capabilities' purpose of splitting
16062306a36Sopenharmony_cithe power of root.  In particular, avoid adding new uses of the already
16162306a36Sopenharmony_cioverly-general ``CAP_SYS_ADMIN`` capability.
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciIf your new :manpage:`xyzzy(2)` system call manipulates a process other than
16462306a36Sopenharmony_cithe calling process, it should be restricted (using a call to
16562306a36Sopenharmony_ci``ptrace_may_access()``) so that only a calling process with the same
16662306a36Sopenharmony_cipermissions as the target process, or with the necessary capabilities, can
16762306a36Sopenharmony_cimanipulate the target process.
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciFinally, be aware that some non-x86 architectures have an easier time if
17062306a36Sopenharmony_cisystem call parameters that are explicitly 64-bit fall on odd-numbered
17162306a36Sopenharmony_ciarguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit
17262306a36Sopenharmony_ciregisters.  (This concern does not apply if the arguments are part of a
17362306a36Sopenharmony_cistructure that's passed in by pointer.)
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ciProposing the API
17762306a36Sopenharmony_ci-----------------
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ciTo make new system calls easy to review, it's best to divide up the patchset
18062306a36Sopenharmony_ciinto separate chunks.  These should include at least the following items as
18162306a36Sopenharmony_cidistinct commits (each of which is described further below):
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ci - The core implementation of the system call, together with prototypes,
18462306a36Sopenharmony_ci   generic numbering, Kconfig changes and fallback stub implementation.
18562306a36Sopenharmony_ci - Wiring up of the new system call for one particular architecture, usually
18662306a36Sopenharmony_ci   x86 (including all of x86_64, x86_32 and x32).
18762306a36Sopenharmony_ci - A demonstration of the use of the new system call in userspace via a
18862306a36Sopenharmony_ci   selftest in ``tools/testing/selftests/``.
18962306a36Sopenharmony_ci - A draft man-page for the new system call, either as plain text in the
19062306a36Sopenharmony_ci   cover letter, or as a patch to the (separate) man-pages repository.
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ciNew system call proposals, like any change to the kernel's API, should always
19362306a36Sopenharmony_cibe cc'ed to linux-api@vger.kernel.org.
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ci
19662306a36Sopenharmony_ciGeneric System Call Implementation
19762306a36Sopenharmony_ci----------------------------------
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ciThe main entry point for your new :manpage:`xyzzy(2)` system call will be called
20062306a36Sopenharmony_ci``sys_xyzzy()``, but you add this entry point with the appropriate
20162306a36Sopenharmony_ci``SYSCALL_DEFINEn()`` macro rather than explicitly.  The 'n' indicates the
20262306a36Sopenharmony_cinumber of arguments to the system call, and the macro takes the system call name
20362306a36Sopenharmony_cifollowed by the (type, name) pairs for the parameters as arguments.  Using
20462306a36Sopenharmony_cithis macro allows metadata about the new system call to be made available for
20562306a36Sopenharmony_ciother tools.
20662306a36Sopenharmony_ci
20762306a36Sopenharmony_ciThe new entry point also needs a corresponding function prototype, in
20862306a36Sopenharmony_ci``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
20962306a36Sopenharmony_cicalls are invoked::
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ci    asmlinkage long sys_xyzzy(...);
21262306a36Sopenharmony_ci
21362306a36Sopenharmony_ciSome architectures (e.g. x86) have their own architecture-specific syscall
21462306a36Sopenharmony_citables, but several other architectures share a generic syscall table. Add your
21562306a36Sopenharmony_cinew system call to the generic list by adding an entry to the list in
21662306a36Sopenharmony_ci``include/uapi/asm-generic/unistd.h``::
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ci    #define __NR_xyzzy 292
21962306a36Sopenharmony_ci    __SYSCALL(__NR_xyzzy, sys_xyzzy)
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ciAlso update the __NR_syscalls count to reflect the additional system call, and
22262306a36Sopenharmony_cinote that if multiple new system calls are added in the same merge window,
22362306a36Sopenharmony_ciyour new syscall number may get adjusted to resolve conflicts.
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ciThe file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
22662306a36Sopenharmony_cisystem call, returning ``-ENOSYS``.  Add your new system call here too::
22762306a36Sopenharmony_ci
22862306a36Sopenharmony_ci    COND_SYSCALL(xyzzy);
22962306a36Sopenharmony_ci
23062306a36Sopenharmony_ciYour new kernel functionality, and the system call that controls it, should
23162306a36Sopenharmony_cinormally be optional, so add a ``CONFIG`` option (typically to
23262306a36Sopenharmony_ci``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci - Include a description of the new functionality and system call controlled
23562306a36Sopenharmony_ci   by the option.
23662306a36Sopenharmony_ci - Make the option depend on EXPERT if it should be hidden from normal users.
23762306a36Sopenharmony_ci - Make any new source files implementing the function dependent on the CONFIG
23862306a36Sopenharmony_ci   option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
23962306a36Sopenharmony_ci - Double check that the kernel still builds with the new CONFIG option turned
24062306a36Sopenharmony_ci   off.
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciTo summarize, you need a commit that includes:
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ci - ``CONFIG`` option for the new function, normally in ``init/Kconfig``
24562306a36Sopenharmony_ci - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
24662306a36Sopenharmony_ci - corresponding prototype in ``include/linux/syscalls.h``
24762306a36Sopenharmony_ci - generic table entry in ``include/uapi/asm-generic/unistd.h``
24862306a36Sopenharmony_ci - fallback stub in ``kernel/sys_ni.c``
24962306a36Sopenharmony_ci
25062306a36Sopenharmony_ci
25162306a36Sopenharmony_cix86 System Call Implementation
25262306a36Sopenharmony_ci------------------------------
25362306a36Sopenharmony_ci
25462306a36Sopenharmony_ciTo wire up your new system call for x86 platforms, you need to update the
25562306a36Sopenharmony_cimaster syscall tables.  Assuming your new system call isn't special in some
25662306a36Sopenharmony_ciway (see below), this involves a "common" entry (for x86_64 and x32) in
25762306a36Sopenharmony_ciarch/x86/entry/syscalls/syscall_64.tbl::
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ci    333   common   xyzzy     sys_xyzzy
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ciand an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
26262306a36Sopenharmony_ci
26362306a36Sopenharmony_ci    380   i386     xyzzy     sys_xyzzy
26462306a36Sopenharmony_ci
26562306a36Sopenharmony_ciAgain, these numbers are liable to be changed if there are conflicts in the
26662306a36Sopenharmony_cirelevant merge window.
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciCompatibility System Calls (Generic)
27062306a36Sopenharmony_ci------------------------------------
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ciFor most system calls the same 64-bit implementation can be invoked even when
27362306a36Sopenharmony_cithe userspace program is itself 32-bit; even if the system call's parameters
27462306a36Sopenharmony_ciinclude an explicit pointer, this is handled transparently.
27562306a36Sopenharmony_ci
27662306a36Sopenharmony_ciHowever, there are a couple of situations where a compatibility layer is
27762306a36Sopenharmony_cineeded to cope with size differences between 32-bit and 64-bit.
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ciThe first is if the 64-bit kernel also supports 32-bit userspace programs, and
28062306a36Sopenharmony_ciso needs to parse areas of (``__user``) memory that could hold either 32-bit or
28162306a36Sopenharmony_ci64-bit values.  In particular, this is needed whenever a system call argument
28262306a36Sopenharmony_ciis:
28362306a36Sopenharmony_ci
28462306a36Sopenharmony_ci - a pointer to a pointer
28562306a36Sopenharmony_ci - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
28662306a36Sopenharmony_ci - a pointer to a varying sized integral type (``time_t``, ``off_t``,
28762306a36Sopenharmony_ci   ``long``, ...)
28862306a36Sopenharmony_ci - a pointer to a struct containing a varying sized integral type.
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciThe second situation that requires a compatibility layer is if one of the
29162306a36Sopenharmony_cisystem call's arguments has a type that is explicitly 64-bit even on a 32-bit
29262306a36Sopenharmony_ciarchitecture, for example ``loff_t`` or ``__u64``.  In this case, a value that
29362306a36Sopenharmony_ciarrives at a 64-bit kernel from a 32-bit application will be split into two
29462306a36Sopenharmony_ci32-bit values, which then need to be re-assembled in the compatibility layer.
29562306a36Sopenharmony_ci
29662306a36Sopenharmony_ci(Note that a system call argument that's a pointer to an explicit 64-bit type
29762306a36Sopenharmony_cidoes **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
29862306a36Sopenharmony_citype ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
29962306a36Sopenharmony_ci
30062306a36Sopenharmony_ciThe compatibility version of the system call is called ``compat_sys_xyzzy()``,
30162306a36Sopenharmony_ciand is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
30262306a36Sopenharmony_ciSYSCALL_DEFINEn.  This version of the implementation runs as part of a 64-bit
30362306a36Sopenharmony_cikernel, but expects to receive 32-bit parameter values and does whatever is
30462306a36Sopenharmony_cineeded to deal with them.  (Typically, the ``compat_sys_`` version converts the
30562306a36Sopenharmony_civalues to 64-bit versions and either calls on to the ``sys_`` version, or both of
30662306a36Sopenharmony_cithem call a common inner implementation function.)
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ciThe compat entry point also needs a corresponding function prototype, in
30962306a36Sopenharmony_ci``include/linux/compat.h``, marked as asmlinkage to match the way that system
31062306a36Sopenharmony_cicalls are invoked::
31162306a36Sopenharmony_ci
31262306a36Sopenharmony_ci    asmlinkage long compat_sys_xyzzy(...);
31362306a36Sopenharmony_ci
31462306a36Sopenharmony_ciIf the system call involves a structure that is laid out differently on 32-bit
31562306a36Sopenharmony_ciand 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
31662306a36Sopenharmony_ciheader file should also include a compat version of the structure (``struct
31762306a36Sopenharmony_cicompat_xyzzy_args``) where each variable-size field has the appropriate
31862306a36Sopenharmony_ci``compat_`` type that corresponds to the type in ``struct xyzzy_args``.  The
31962306a36Sopenharmony_ci``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
32062306a36Sopenharmony_ciparse the arguments from a 32-bit invocation.
32162306a36Sopenharmony_ci
32262306a36Sopenharmony_ciFor example, if there are fields::
32362306a36Sopenharmony_ci
32462306a36Sopenharmony_ci    struct xyzzy_args {
32562306a36Sopenharmony_ci        const char __user *ptr;
32662306a36Sopenharmony_ci        __kernel_long_t varying_val;
32762306a36Sopenharmony_ci        u64 fixed_val;
32862306a36Sopenharmony_ci        /* ... */
32962306a36Sopenharmony_ci    };
33062306a36Sopenharmony_ci
33162306a36Sopenharmony_ciin struct xyzzy_args, then struct compat_xyzzy_args would have::
33262306a36Sopenharmony_ci
33362306a36Sopenharmony_ci    struct compat_xyzzy_args {
33462306a36Sopenharmony_ci        compat_uptr_t ptr;
33562306a36Sopenharmony_ci        compat_long_t varying_val;
33662306a36Sopenharmony_ci        u64 fixed_val;
33762306a36Sopenharmony_ci        /* ... */
33862306a36Sopenharmony_ci    };
33962306a36Sopenharmony_ci
34062306a36Sopenharmony_ciThe generic system call list also needs adjusting to allow for the compat
34162306a36Sopenharmony_civersion; the entry in ``include/uapi/asm-generic/unistd.h`` should use
34262306a36Sopenharmony_ci``__SC_COMP`` rather than ``__SYSCALL``::
34362306a36Sopenharmony_ci
34462306a36Sopenharmony_ci    #define __NR_xyzzy 292
34562306a36Sopenharmony_ci    __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
34662306a36Sopenharmony_ci
34762306a36Sopenharmony_ciTo summarize, you need:
34862306a36Sopenharmony_ci
34962306a36Sopenharmony_ci - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
35062306a36Sopenharmony_ci - corresponding prototype in ``include/linux/compat.h``
35162306a36Sopenharmony_ci - (if needed) 32-bit mapping struct in ``include/linux/compat.h``
35262306a36Sopenharmony_ci - instance of ``__SC_COMP`` not ``__SYSCALL`` in
35362306a36Sopenharmony_ci   ``include/uapi/asm-generic/unistd.h``
35462306a36Sopenharmony_ci
35562306a36Sopenharmony_ci
35662306a36Sopenharmony_ciCompatibility System Calls (x86)
35762306a36Sopenharmony_ci--------------------------------
35862306a36Sopenharmony_ci
35962306a36Sopenharmony_ciTo wire up the x86 architecture of a system call with a compatibility version,
36062306a36Sopenharmony_cithe entries in the syscall tables need to be adjusted.
36162306a36Sopenharmony_ci
36262306a36Sopenharmony_ciFirst, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
36362306a36Sopenharmony_cicolumn to indicate that a 32-bit userspace program running on a 64-bit kernel
36462306a36Sopenharmony_cishould hit the compat entry point::
36562306a36Sopenharmony_ci
36662306a36Sopenharmony_ci    380   i386     xyzzy     sys_xyzzy    __ia32_compat_sys_xyzzy
36762306a36Sopenharmony_ci
36862306a36Sopenharmony_ciSecond, you need to figure out what should happen for the x32 ABI version of
36962306a36Sopenharmony_cithe new system call.  There's a choice here: the layout of the arguments
37062306a36Sopenharmony_cishould either match the 64-bit version or the 32-bit version.
37162306a36Sopenharmony_ci
37262306a36Sopenharmony_ciIf there's a pointer-to-a-pointer involved, the decision is easy: x32 is
37362306a36Sopenharmony_ciILP32, so the layout should match the 32-bit version, and the entry in
37462306a36Sopenharmony_ci``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
37562306a36Sopenharmony_cithe compatibility wrapper::
37662306a36Sopenharmony_ci
37762306a36Sopenharmony_ci    333   64       xyzzy     sys_xyzzy
37862306a36Sopenharmony_ci    ...
37962306a36Sopenharmony_ci    555   x32      xyzzy     __x32_compat_sys_xyzzy
38062306a36Sopenharmony_ci
38162306a36Sopenharmony_ciIf no pointers are involved, then it is preferable to re-use the 64-bit system
38262306a36Sopenharmony_cicall for the x32 ABI (and consequently the entry in
38362306a36Sopenharmony_ciarch/x86/entry/syscalls/syscall_64.tbl is unchanged).
38462306a36Sopenharmony_ci
38562306a36Sopenharmony_ciIn either case, you should check that the types involved in your argument
38662306a36Sopenharmony_cilayout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or
38762306a36Sopenharmony_ci64-bit (-m64) equivalents.
38862306a36Sopenharmony_ci
38962306a36Sopenharmony_ci
39062306a36Sopenharmony_ciSystem Calls Returning Elsewhere
39162306a36Sopenharmony_ci--------------------------------
39262306a36Sopenharmony_ci
39362306a36Sopenharmony_ciFor most system calls, once the system call is complete the user program
39462306a36Sopenharmony_cicontinues exactly where it left off -- at the next instruction, with the
39562306a36Sopenharmony_cistack the same and most of the registers the same as before the system call,
39662306a36Sopenharmony_ciand with the same virtual memory space.
39762306a36Sopenharmony_ci
39862306a36Sopenharmony_ciHowever, a few system calls do things differently.  They might return to a
39962306a36Sopenharmony_cidifferent location (``rt_sigreturn``) or change the memory space
40062306a36Sopenharmony_ci(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
40162306a36Sopenharmony_ciof the program.
40262306a36Sopenharmony_ci
40362306a36Sopenharmony_ciTo allow for this, the kernel implementation of the system call may need to
40462306a36Sopenharmony_cisave and restore additional registers to the kernel stack, allowing complete
40562306a36Sopenharmony_cicontrol of where and how execution continues after the system call.
40662306a36Sopenharmony_ci
40762306a36Sopenharmony_ciThis is arch-specific, but typically involves defining assembly entry points
40862306a36Sopenharmony_cithat save/restore additional registers and invoke the real system call entry
40962306a36Sopenharmony_cipoint.
41062306a36Sopenharmony_ci
41162306a36Sopenharmony_ciFor x86_64, this is implemented as a ``stub_xyzzy`` entry point in
41262306a36Sopenharmony_ci``arch/x86/entry/entry_64.S``, and the entry in the syscall table
41362306a36Sopenharmony_ci(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
41462306a36Sopenharmony_ci
41562306a36Sopenharmony_ci    333   common   xyzzy     stub_xyzzy
41662306a36Sopenharmony_ci
41762306a36Sopenharmony_ciThe equivalent for 32-bit programs running on a 64-bit kernel is normally
41862306a36Sopenharmony_cicalled ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
41962306a36Sopenharmony_ciwith the corresponding syscall table adjustment in
42062306a36Sopenharmony_ci``arch/x86/entry/syscalls/syscall_32.tbl``::
42162306a36Sopenharmony_ci
42262306a36Sopenharmony_ci    380   i386     xyzzy     sys_xyzzy    stub32_xyzzy
42362306a36Sopenharmony_ci
42462306a36Sopenharmony_ciIf the system call needs a compatibility layer (as in the previous section)
42562306a36Sopenharmony_cithen the ``stub32_`` version needs to call on to the ``compat_sys_`` version
42662306a36Sopenharmony_ciof the system call rather than the native 64-bit version.  Also, if the x32 ABI
42762306a36Sopenharmony_ciimplementation is not common with the x86_64 version, then its syscall
42862306a36Sopenharmony_citable will also need to invoke a stub that calls on to the ``compat_sys_``
42962306a36Sopenharmony_civersion.
43062306a36Sopenharmony_ci
43162306a36Sopenharmony_ciFor completeness, it's also nice to set up a mapping so that user-mode Linux
43262306a36Sopenharmony_cistill works -- its syscall table will reference stub_xyzzy, but the UML build
43362306a36Sopenharmony_cidoesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
43462306a36Sopenharmony_cisimulates registers etc).  Fixing this is as simple as adding a #define to
43562306a36Sopenharmony_ci``arch/x86/um/sys_call_table_64.c``::
43662306a36Sopenharmony_ci
43762306a36Sopenharmony_ci    #define stub_xyzzy sys_xyzzy
43862306a36Sopenharmony_ci
43962306a36Sopenharmony_ci
44062306a36Sopenharmony_ciOther Details
44162306a36Sopenharmony_ci-------------
44262306a36Sopenharmony_ci
44362306a36Sopenharmony_ciMost of the kernel treats system calls in a generic way, but there is the
44462306a36Sopenharmony_cioccasional exception that may need updating for your particular system call.
44562306a36Sopenharmony_ci
44662306a36Sopenharmony_ciThe audit subsystem is one such special case; it includes (arch-specific)
44762306a36Sopenharmony_cifunctions that classify some special types of system call -- specifically
44862306a36Sopenharmony_cifile open (``open``/``openat``), program execution (``execve``/``exeveat``) or
44962306a36Sopenharmony_cisocket multiplexor (``socketcall``) operations. If your new system call is
45062306a36Sopenharmony_cianalogous to one of these, then the audit system should be updated.
45162306a36Sopenharmony_ci
45262306a36Sopenharmony_ciMore generally, if there is an existing system call that is analogous to your
45362306a36Sopenharmony_cinew system call, it's worth doing a kernel-wide grep for the existing system
45462306a36Sopenharmony_cicall to check there are no other special cases.
45562306a36Sopenharmony_ci
45662306a36Sopenharmony_ci
45762306a36Sopenharmony_ciTesting
45862306a36Sopenharmony_ci-------
45962306a36Sopenharmony_ci
46062306a36Sopenharmony_ciA new system call should obviously be tested; it is also useful to provide
46162306a36Sopenharmony_cireviewers with a demonstration of how user space programs will use the system
46262306a36Sopenharmony_cicall.  A good way to combine these aims is to include a simple self-test
46362306a36Sopenharmony_ciprogram in a new directory under ``tools/testing/selftests/``.
46462306a36Sopenharmony_ci
46562306a36Sopenharmony_ciFor a new system call, there will obviously be no libc wrapper function and so
46662306a36Sopenharmony_cithe test will need to invoke it using ``syscall()``; also, if the system call
46762306a36Sopenharmony_ciinvolves a new userspace-visible structure, the corresponding header will need
46862306a36Sopenharmony_cito be installed to compile the test.
46962306a36Sopenharmony_ci
47062306a36Sopenharmony_ciMake sure the selftest runs successfully on all supported architectures.  For
47162306a36Sopenharmony_ciexample, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)
47262306a36Sopenharmony_ciand x32 (-mx32) ABI program.
47362306a36Sopenharmony_ci
47462306a36Sopenharmony_ciFor more extensive and thorough testing of new functionality, you should also
47562306a36Sopenharmony_ciconsider adding tests to the Linux Test Project, or to the xfstests project
47662306a36Sopenharmony_cifor filesystem-related changes.
47762306a36Sopenharmony_ci
47862306a36Sopenharmony_ci - https://linux-test-project.github.io/
47962306a36Sopenharmony_ci - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ci
48262306a36Sopenharmony_ciMan Page
48362306a36Sopenharmony_ci--------
48462306a36Sopenharmony_ci
48562306a36Sopenharmony_ciAll new system calls should come with a complete man page, ideally using groff
48662306a36Sopenharmony_cimarkup, but plain text will do.  If groff is used, it's helpful to include a
48762306a36Sopenharmony_cipre-rendered ASCII version of the man page in the cover email for the
48862306a36Sopenharmony_cipatchset, for the convenience of reviewers.
48962306a36Sopenharmony_ci
49062306a36Sopenharmony_ciThe man page should be cc'ed to linux-man@vger.kernel.org
49162306a36Sopenharmony_ciFor more details, see https://www.kernel.org/doc/man-pages/patches.html
49262306a36Sopenharmony_ci
49362306a36Sopenharmony_ci
49462306a36Sopenharmony_ciDo not call System Calls in the Kernel
49562306a36Sopenharmony_ci--------------------------------------
49662306a36Sopenharmony_ci
49762306a36Sopenharmony_ciSystem calls are, as stated above, interaction points between userspace and
49862306a36Sopenharmony_cithe kernel.  Therefore, system call functions such as ``sys_xyzzy()`` or
49962306a36Sopenharmony_ci``compat_sys_xyzzy()`` should only be called from userspace via the syscall
50062306a36Sopenharmony_citable, but not from elsewhere in the kernel.  If the syscall functionality is
50162306a36Sopenharmony_ciuseful to be used within the kernel, needs to be shared between an old and a
50262306a36Sopenharmony_cinew syscall, or needs to be shared between a syscall and its compatibility
50362306a36Sopenharmony_civariant, it should be implemented by means of a "helper" function (such as
50462306a36Sopenharmony_ci``ksys_xyzzy()``).  This kernel function may then be called within the
50562306a36Sopenharmony_cisyscall stub (``sys_xyzzy()``), the compatibility syscall stub
50662306a36Sopenharmony_ci(``compat_sys_xyzzy()``), and/or other kernel code.
50762306a36Sopenharmony_ci
50862306a36Sopenharmony_ciAt least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not
50962306a36Sopenharmony_cicall system call functions in the kernel.  It uses a different calling
51062306a36Sopenharmony_ciconvention for system calls where ``struct pt_regs`` is decoded on-the-fly in a
51162306a36Sopenharmony_cisyscall wrapper which then hands processing over to the actual syscall function.
51262306a36Sopenharmony_ciThis means that only those parameters which are actually needed for a specific
51362306a36Sopenharmony_cisyscall are passed on during syscall entry, instead of filling in six CPU
51462306a36Sopenharmony_ciregisters with random user space content all the time (which may cause serious
51562306a36Sopenharmony_citrouble down the call chain).
51662306a36Sopenharmony_ci
51762306a36Sopenharmony_ciMoreover, rules on how data may be accessed may differ between kernel data and
51862306a36Sopenharmony_ciuser data.  This is another reason why calling ``sys_xyzzy()`` is generally a
51962306a36Sopenharmony_cibad idea.
52062306a36Sopenharmony_ci
52162306a36Sopenharmony_ciExceptions to this rule are only allowed in architecture-specific overrides,
52262306a36Sopenharmony_ciarchitecture-specific compatibility wrappers, or other code in arch/.
52362306a36Sopenharmony_ci
52462306a36Sopenharmony_ci
52562306a36Sopenharmony_ciReferences and Sources
52662306a36Sopenharmony_ci----------------------
52762306a36Sopenharmony_ci
52862306a36Sopenharmony_ci - LWN article from Michael Kerrisk on use of flags argument in system calls:
52962306a36Sopenharmony_ci   https://lwn.net/Articles/585415/
53062306a36Sopenharmony_ci - LWN article from Michael Kerrisk on how to handle unknown flags in a system
53162306a36Sopenharmony_ci   call: https://lwn.net/Articles/588444/
53262306a36Sopenharmony_ci - LWN article from Jake Edge describing constraints on 64-bit system call
53362306a36Sopenharmony_ci   arguments: https://lwn.net/Articles/311630/
53462306a36Sopenharmony_ci - Pair of LWN articles from David Drysdale that describe the system call
53562306a36Sopenharmony_ci   implementation paths in detail for v3.14:
53662306a36Sopenharmony_ci
53762306a36Sopenharmony_ci    - https://lwn.net/Articles/604287/
53862306a36Sopenharmony_ci    - https://lwn.net/Articles/604515/
53962306a36Sopenharmony_ci
54062306a36Sopenharmony_ci - Architecture-specific requirements for system calls are discussed in the
54162306a36Sopenharmony_ci   :manpage:`syscall(2)` man-page:
54262306a36Sopenharmony_ci   http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
54362306a36Sopenharmony_ci - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
54462306a36Sopenharmony_ci   https://yarchive.net/comp/linux/ioctl.html
54562306a36Sopenharmony_ci - "How to not invent kernel interfaces", Arnd Bergmann,
54662306a36Sopenharmony_ci   https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
54762306a36Sopenharmony_ci - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:
54862306a36Sopenharmony_ci   https://lwn.net/Articles/486306/
54962306a36Sopenharmony_ci - Recommendation from Andrew Morton that all related information for a new
55062306a36Sopenharmony_ci   system call should come in the same email thread:
55162306a36Sopenharmony_ci   https://lore.kernel.org/r/20140724144747.3041b208832bbdf9fbce5d96@linux-foundation.org
55262306a36Sopenharmony_ci - Recommendation from Michael Kerrisk that a new system call should come with
55362306a36Sopenharmony_ci   a man page: https://lore.kernel.org/r/CAKgNAkgMA39AfoSoA5Pe1r9N+ZzfYQNvNPvcRN7tOvRb8+v06Q@mail.gmail.com
55462306a36Sopenharmony_ci - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate
55562306a36Sopenharmony_ci   commit: https://lore.kernel.org/r/alpine.DEB.2.11.1411191249560.3909@nanos
55662306a36Sopenharmony_ci - Suggestion from Greg Kroah-Hartman that it's good for new system calls to
55762306a36Sopenharmony_ci   come with a man-page & selftest: https://lore.kernel.org/r/20140320025530.GA25469@kroah.com
55862306a36Sopenharmony_ci - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
55962306a36Sopenharmony_ci   https://lore.kernel.org/r/CAHO5Pa3F2MjfTtfNxa8LbnkeeU8=YJ+9tDqxZpw7Gz59E-4AUg@mail.gmail.com
56062306a36Sopenharmony_ci - Suggestion from Ingo Molnar that system calls that involve multiple
56162306a36Sopenharmony_ci   arguments should encapsulate those arguments in a struct, which includes a
56262306a36Sopenharmony_ci   size field for future extensibility: https://lore.kernel.org/r/20150730083831.GA22182@gmail.com
56362306a36Sopenharmony_ci - Numbering oddities arising from (re-)use of O_* numbering space flags:
56462306a36Sopenharmony_ci
56562306a36Sopenharmony_ci    - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
56662306a36Sopenharmony_ci      check")
56762306a36Sopenharmony_ci    - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
56862306a36Sopenharmony_ci      conflict")
56962306a36Sopenharmony_ci    - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
57062306a36Sopenharmony_ci
57162306a36Sopenharmony_ci - Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
57262306a36Sopenharmony_ci   https://lore.kernel.org/r/20081212152929.GM26095@parisc-linux.org
57362306a36Sopenharmony_ci - Recommendation from Greg Kroah-Hartman that unknown flags should be
57462306a36Sopenharmony_ci   policed: https://lore.kernel.org/r/20140717193330.GB4703@kroah.com
57562306a36Sopenharmony_ci - Recommendation from Linus Torvalds that x32 system calls should prefer
57662306a36Sopenharmony_ci   compatibility with 64-bit versions rather than 32-bit versions:
57762306a36Sopenharmony_ci   https://lore.kernel.org/r/CA+55aFxfmwfB7jbbrXxa=K7VBYPfAvmu3XOkGrLbB1UFjX1+Ew@mail.gmail.com
578