162306a36Sopenharmony_ciunshare system call
262306a36Sopenharmony_ci===================
362306a36Sopenharmony_ci
462306a36Sopenharmony_ciThis document describes the new system call, unshare(). The document
562306a36Sopenharmony_ciprovides an overview of the feature, why it is needed, how it can
662306a36Sopenharmony_cibe used, its interface specification, design, implementation and
762306a36Sopenharmony_cihow it can be tested.
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciChange Log
1062306a36Sopenharmony_ci----------
1162306a36Sopenharmony_civersion 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciContents
1462306a36Sopenharmony_ci--------
1562306a36Sopenharmony_ci	1) Overview
1662306a36Sopenharmony_ci	2) Benefits
1762306a36Sopenharmony_ci	3) Cost
1862306a36Sopenharmony_ci	4) Requirements
1962306a36Sopenharmony_ci	5) Functional Specification
2062306a36Sopenharmony_ci	6) High Level Design
2162306a36Sopenharmony_ci	7) Low Level Design
2262306a36Sopenharmony_ci	8) Test Specification
2362306a36Sopenharmony_ci	9) Future Work
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ci1) Overview
2662306a36Sopenharmony_ci-----------
2762306a36Sopenharmony_ci
2862306a36Sopenharmony_ciMost legacy operating system kernels support an abstraction of threads
2962306a36Sopenharmony_cias multiple execution contexts within a process. These kernels provide
3062306a36Sopenharmony_cispecial resources and mechanisms to maintain these "threads". The Linux
3162306a36Sopenharmony_cikernel, in a clever and simple manner, does not make distinction
3262306a36Sopenharmony_cibetween processes and "threads". The kernel allows processes to share
3362306a36Sopenharmony_ciresources and thus they can achieve legacy "threads" behavior without
3462306a36Sopenharmony_cirequiring additional data structures and mechanisms in the kernel. The
3562306a36Sopenharmony_cipower of implementing threads in this manner comes not only from
3662306a36Sopenharmony_ciits simplicity but also from allowing application programmers to work
3762306a36Sopenharmony_cioutside the confinement of all-or-nothing shared resources of legacy
3862306a36Sopenharmony_cithreads. On Linux, at the time of thread creation using the clone system
3962306a36Sopenharmony_cicall, applications can selectively choose which resources to share
4062306a36Sopenharmony_cibetween threads.
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciunshare() system call adds a primitive to the Linux thread model that
4362306a36Sopenharmony_ciallows threads to selectively 'unshare' any resources that were being
4462306a36Sopenharmony_cishared at the time of their creation. unshare() was conceptualized by
4562306a36Sopenharmony_ciAl Viro in the August of 2000, on the Linux-Kernel mailing list, as part
4662306a36Sopenharmony_ciof the discussion on POSIX threads on Linux.  unshare() augments the
4762306a36Sopenharmony_ciusefulness of Linux threads for applications that would like to control
4862306a36Sopenharmony_cishared resources without creating a new process. unshare() is a natural
4962306a36Sopenharmony_ciaddition to the set of available primitives on Linux that implement
5062306a36Sopenharmony_cithe concept of process/thread as a virtual machine.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ci2) Benefits
5362306a36Sopenharmony_ci-----------
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciunshare() would be useful to large application frameworks such as PAM
5662306a36Sopenharmony_ciwhere creating a new process to control sharing/unsharing of process
5762306a36Sopenharmony_ciresources is not possible. Since namespaces are shared by default
5862306a36Sopenharmony_ciwhen creating a new process using fork or clone, unshare() can benefit
5962306a36Sopenharmony_cieven non-threaded applications if they have a need to disassociate
6062306a36Sopenharmony_cifrom default shared namespace. The following lists two use-cases
6162306a36Sopenharmony_ciwhere unshare() can be used.
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ci2.1 Per-security context namespaces
6462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ciunshare() can be used to implement polyinstantiated directories using
6762306a36Sopenharmony_cithe kernel's per-process namespace mechanism. Polyinstantiated directories,
6862306a36Sopenharmony_cisuch as per-user and/or per-security context instance of /tmp, /var/tmp or
6962306a36Sopenharmony_ciper-security context instance of a user's home directory, isolate user
7062306a36Sopenharmony_ciprocesses when working with these directories. Using unshare(), a PAM
7162306a36Sopenharmony_cimodule can easily setup a private namespace for a user at login.
7262306a36Sopenharmony_ciPolyinstantiated directories are required for Common Criteria certification
7362306a36Sopenharmony_ciwith Labeled System Protection Profile, however, with the availability
7462306a36Sopenharmony_ciof shared-tree feature in the Linux kernel, even regular Linux systems
7562306a36Sopenharmony_cican benefit from setting up private namespaces at login and
7662306a36Sopenharmony_cipolyinstantiating /tmp, /var/tmp and other directories deemed
7762306a36Sopenharmony_ciappropriate by system administrators.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ci2.2 unsharing of virtual memory and/or open files
8062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ciConsider a client/server application where the server is processing
8362306a36Sopenharmony_ciclient requests by creating processes that share resources such as
8462306a36Sopenharmony_civirtual memory and open files. Without unshare(), the server has to
8562306a36Sopenharmony_cidecide what needs to be shared at the time of creating the process
8662306a36Sopenharmony_ciwhich services the request. unshare() allows the server an ability to
8762306a36Sopenharmony_cidisassociate parts of the context during the servicing of the
8862306a36Sopenharmony_cirequest. For large and complex middleware application frameworks, this
8962306a36Sopenharmony_ciability to unshare() after the process was created can be very
9062306a36Sopenharmony_ciuseful.
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ci3) Cost
9362306a36Sopenharmony_ci-------
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ciIn order to not duplicate code and to handle the fact that unshare()
9662306a36Sopenharmony_ciworks on an active task (as opposed to clone/fork working on a newly
9762306a36Sopenharmony_ciallocated inactive task) unshare() had to make minor reorganizational
9862306a36Sopenharmony_cichanges to copy_* functions utilized by clone/fork system call.
9962306a36Sopenharmony_ciThere is a cost associated with altering existing, well tested and
10062306a36Sopenharmony_cistable code to implement a new feature that may not get exercised
10162306a36Sopenharmony_ciextensively in the beginning. However, with proper design and code
10262306a36Sopenharmony_cireview of the changes and creation of an unshare() test for the LTP
10362306a36Sopenharmony_cithe benefits of this new feature can exceed its cost.
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ci4) Requirements
10662306a36Sopenharmony_ci---------------
10762306a36Sopenharmony_ci
10862306a36Sopenharmony_ciunshare() reverses sharing that was done using clone(2) system call,
10962306a36Sopenharmony_ciso unshare() should have a similar interface as clone(2). That is,
11062306a36Sopenharmony_cisince flags in clone(int flags, void \*stack) specifies what should
11162306a36Sopenharmony_cibe shared, similar flags in unshare(int flags) should specify
11262306a36Sopenharmony_ciwhat should be unshared. Unfortunately, this may appear to invert
11362306a36Sopenharmony_cithe meaning of the flags from the way they are used in clone(2).
11462306a36Sopenharmony_ciHowever, there was no easy solution that was less confusing and that
11562306a36Sopenharmony_ciallowed incremental context unsharing in future without an ABI change.
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciunshare() interface should accommodate possible future addition of
11862306a36Sopenharmony_cinew context flags without requiring a rebuild of old applications.
11962306a36Sopenharmony_ciIf and when new context flags are added, unshare() design should allow
12062306a36Sopenharmony_ciincremental unsharing of those resources on an as needed basis.
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ci5) Functional Specification
12362306a36Sopenharmony_ci---------------------------
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ciNAME
12662306a36Sopenharmony_ci	unshare - disassociate parts of the process execution context
12762306a36Sopenharmony_ci
12862306a36Sopenharmony_ciSYNOPSIS
12962306a36Sopenharmony_ci	#include <sched.h>
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci	int unshare(int flags);
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciDESCRIPTION
13462306a36Sopenharmony_ci	unshare() allows a process to disassociate parts of its execution
13562306a36Sopenharmony_ci	context that are currently being shared with other processes. Part
13662306a36Sopenharmony_ci	of execution context, such as the namespace, is shared by default
13762306a36Sopenharmony_ci	when a new process is created using fork(2), while other parts,
13862306a36Sopenharmony_ci	such as the virtual memory, open file descriptors, etc, may be
13962306a36Sopenharmony_ci	shared by explicit request to share them when creating a process
14062306a36Sopenharmony_ci	using clone(2).
14162306a36Sopenharmony_ci
14262306a36Sopenharmony_ci	The main use of unshare() is to allow a process to control its
14362306a36Sopenharmony_ci	shared execution context without creating a new process.
14462306a36Sopenharmony_ci
14562306a36Sopenharmony_ci	The flags argument specifies one or bitwise-or'ed of several of
14662306a36Sopenharmony_ci	the following constants.
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_ci	CLONE_FS
14962306a36Sopenharmony_ci		If CLONE_FS is set, file system information of the caller
15062306a36Sopenharmony_ci		is disassociated from the shared file system information.
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ci	CLONE_FILES
15362306a36Sopenharmony_ci		If CLONE_FILES is set, the file descriptor table of the
15462306a36Sopenharmony_ci		caller is disassociated from the shared file descriptor
15562306a36Sopenharmony_ci		table.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ci	CLONE_NEWNS
15862306a36Sopenharmony_ci		If CLONE_NEWNS is set, the namespace of the caller is
15962306a36Sopenharmony_ci		disassociated from the shared namespace.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci	CLONE_VM
16262306a36Sopenharmony_ci		If CLONE_VM is set, the virtual memory of the caller is
16362306a36Sopenharmony_ci		disassociated from the shared virtual memory.
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciRETURN VALUE
16662306a36Sopenharmony_ci	On success, zero returned. On failure, -1 is returned and errno is
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ciERRORS
16962306a36Sopenharmony_ci	EPERM	CLONE_NEWNS was specified by a non-root process (process
17062306a36Sopenharmony_ci		without CAP_SYS_ADMIN).
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ci	ENOMEM	Cannot allocate sufficient memory to copy parts of caller's
17362306a36Sopenharmony_ci		context that need to be unshared.
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ci	EINVAL	Invalid flag was specified as an argument.
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ciCONFORMING TO
17862306a36Sopenharmony_ci	The unshare() call is Linux-specific and  should  not be used
17962306a36Sopenharmony_ci	in programs intended to be portable.
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ciSEE ALSO
18262306a36Sopenharmony_ci	clone(2), fork(2)
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ci6) High Level Design
18562306a36Sopenharmony_ci--------------------
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ciDepending on the flags argument, the unshare() system call allocates
18862306a36Sopenharmony_ciappropriate process context structures, populates it with values from
18962306a36Sopenharmony_cithe current shared version, associates newly duplicated structures
19062306a36Sopenharmony_ciwith the current task structure and releases corresponding shared
19162306a36Sopenharmony_civersions. Helper functions of clone (copy_*) could not be used
19262306a36Sopenharmony_cidirectly by unshare() because of the following two reasons.
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ci  1) clone operates on a newly allocated not-yet-active task
19562306a36Sopenharmony_ci     structure, where as unshare() operates on the current active
19662306a36Sopenharmony_ci     task. Therefore unshare() has to take appropriate task_lock()
19762306a36Sopenharmony_ci     before associating newly duplicated context structures
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ci  2) unshare() has to allocate and duplicate all context structures
20062306a36Sopenharmony_ci     that are being unshared, before associating them with the
20162306a36Sopenharmony_ci     current task and releasing older shared structures. Failure
20262306a36Sopenharmony_ci     do so will create race conditions and/or oops when trying
20362306a36Sopenharmony_ci     to backout due to an error. Consider the case of unsharing
20462306a36Sopenharmony_ci     both virtual memory and namespace. After successfully unsharing
20562306a36Sopenharmony_ci     vm, if the system call encounters an error while allocating
20662306a36Sopenharmony_ci     new namespace structure, the error return code will have to
20762306a36Sopenharmony_ci     reverse the unsharing of vm. As part of the reversal the
20862306a36Sopenharmony_ci     system call will have to go back to older, shared, vm
20962306a36Sopenharmony_ci     structure, which may not exist anymore.
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ciTherefore code from copy_* functions that allocated and duplicated
21262306a36Sopenharmony_cicurrent context structure was moved into new dup_* functions. Now,
21362306a36Sopenharmony_cicopy_* functions call dup_* functions to allocate and duplicate
21462306a36Sopenharmony_ciappropriate context structures and then associate them with the
21562306a36Sopenharmony_citask structure that is being constructed. unshare() system call on
21662306a36Sopenharmony_cithe other hand performs the following:
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ci  1) Check flags to force missing, but implied, flags
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ci  2) For each context structure, call the corresponding unshare()
22162306a36Sopenharmony_ci     helper function to allocate and duplicate a new context
22262306a36Sopenharmony_ci     structure, if the appropriate bit is set in the flags argument.
22362306a36Sopenharmony_ci
22462306a36Sopenharmony_ci  3) If there is no error in allocation and duplication and there
22562306a36Sopenharmony_ci     are new context structures then lock the current task structure,
22662306a36Sopenharmony_ci     associate new context structures with the current task structure,
22762306a36Sopenharmony_ci     and release the lock on the current task structure.
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ci  4) Appropriately release older, shared, context structures.
23062306a36Sopenharmony_ci
23162306a36Sopenharmony_ci7) Low Level Design
23262306a36Sopenharmony_ci-------------------
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ciImplementation of unshare() can be grouped in the following 4 different
23562306a36Sopenharmony_ciitems:
23662306a36Sopenharmony_ci
23762306a36Sopenharmony_ci  a) Reorganization of existing copy_* functions
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ci  b) unshare() system call service function
24062306a36Sopenharmony_ci
24162306a36Sopenharmony_ci  c) unshare() helper functions for each different process context
24262306a36Sopenharmony_ci
24362306a36Sopenharmony_ci  d) Registration of system call number for different architectures
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ci7.1) Reorganization of copy_* functions
24662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24762306a36Sopenharmony_ci
24862306a36Sopenharmony_ciEach copy function such as copy_mm, copy_namespace, copy_files,
24962306a36Sopenharmony_cietc, had roughly two components. The first component allocated
25062306a36Sopenharmony_ciand duplicated the appropriate structure and the second component
25162306a36Sopenharmony_cilinked it to the task structure passed in as an argument to the copy
25262306a36Sopenharmony_cifunction. The first component was split into its own function.
25362306a36Sopenharmony_ciThese dup_* functions allocated and duplicated the appropriate
25462306a36Sopenharmony_cicontext structure. The reorganized copy_* functions invoked
25562306a36Sopenharmony_citheir corresponding dup_* functions and then linked the newly
25662306a36Sopenharmony_ciduplicated structures to the task structure with which the
25762306a36Sopenharmony_cicopy function was called.
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ci7.2) unshare() system call service function
26062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
26162306a36Sopenharmony_ci
26262306a36Sopenharmony_ci       * Check flags
26362306a36Sopenharmony_ci	 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
26462306a36Sopenharmony_ci	 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
26562306a36Sopenharmony_ci	 set and signals are also being shared, force CLONE_THREAD. If
26662306a36Sopenharmony_ci	 CLONE_NEWNS is set, force CLONE_FS.
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ci       * For each context flag, invoke the corresponding unshare_*
26962306a36Sopenharmony_ci	 helper routine with flags passed into the system call and a
27062306a36Sopenharmony_ci	 reference to pointer pointing the new unshared structure
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ci       * If any new structures are created by unshare_* helper
27362306a36Sopenharmony_ci	 functions, take the task_lock() on the current task,
27462306a36Sopenharmony_ci	 modify appropriate context pointers, and release the
27562306a36Sopenharmony_ci         task lock.
27662306a36Sopenharmony_ci
27762306a36Sopenharmony_ci       * For all newly unshared structures, release the corresponding
27862306a36Sopenharmony_ci         older, shared, structures.
27962306a36Sopenharmony_ci
28062306a36Sopenharmony_ci7.3) unshare_* helper functions
28162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_ciFor unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
28462306a36Sopenharmony_ciand CLONE_THREAD, return -EINVAL since they are not implemented yet.
28562306a36Sopenharmony_ciFor others, check the flag value to see if the unsharing is
28662306a36Sopenharmony_cirequired for that structure. If it is, invoke the corresponding
28762306a36Sopenharmony_cidup_* function to allocate and duplicate the structure and return
28862306a36Sopenharmony_cia pointer to it.
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ci7.4) Finally
29162306a36Sopenharmony_ci~~~~~~~~~~~~
29262306a36Sopenharmony_ci
29362306a36Sopenharmony_ciAppropriately modify architecture specific code to register the
29462306a36Sopenharmony_cinew system call.
29562306a36Sopenharmony_ci
29662306a36Sopenharmony_ci8) Test Specification
29762306a36Sopenharmony_ci---------------------
29862306a36Sopenharmony_ci
29962306a36Sopenharmony_ciThe test for unshare() should test the following:
30062306a36Sopenharmony_ci
30162306a36Sopenharmony_ci  1) Valid flags: Test to check that clone flags for signal and
30262306a36Sopenharmony_ci     signal handlers, for which unsharing is not implemented
30362306a36Sopenharmony_ci     yet, return -EINVAL.
30462306a36Sopenharmony_ci
30562306a36Sopenharmony_ci  2) Missing/implied flags: Test to make sure that if unsharing
30662306a36Sopenharmony_ci     namespace without specifying unsharing of filesystem, correctly
30762306a36Sopenharmony_ci     unshares both namespace and filesystem information.
30862306a36Sopenharmony_ci
30962306a36Sopenharmony_ci  3) For each of the four (namespace, filesystem, files and vm)
31062306a36Sopenharmony_ci     supported unsharing, verify that the system call correctly
31162306a36Sopenharmony_ci     unshares the appropriate structure. Verify that unsharing
31262306a36Sopenharmony_ci     them individually as well as in combination with each
31362306a36Sopenharmony_ci     other works as expected.
31462306a36Sopenharmony_ci
31562306a36Sopenharmony_ci  4) Concurrent execution: Use shared memory segments and futex on
31662306a36Sopenharmony_ci     an address in the shm segment to synchronize execution of
31762306a36Sopenharmony_ci     about 10 threads. Have a couple of threads execute execve,
31862306a36Sopenharmony_ci     a couple _exit and the rest unshare with different combination
31962306a36Sopenharmony_ci     of flags. Verify that unsharing is performed as expected and
32062306a36Sopenharmony_ci     that there are no oops or hangs.
32162306a36Sopenharmony_ci
32262306a36Sopenharmony_ci9) Future Work
32362306a36Sopenharmony_ci--------------
32462306a36Sopenharmony_ci
32562306a36Sopenharmony_ciThe current implementation of unshare() does not allow unsharing of
32662306a36Sopenharmony_cisignals and signal handlers. Signals are complex to begin with and
32762306a36Sopenharmony_cito unshare signals and/or signal handlers of a currently running
32862306a36Sopenharmony_ciprocess is even more complex. If in the future there is a specific
32962306a36Sopenharmony_cineed to allow unsharing of signals and/or signal handlers, it can
33062306a36Sopenharmony_cibe incrementally added to unshare() without affecting legacy
33162306a36Sopenharmony_ciapplications using unshare().
33262306a36Sopenharmony_ci
333