162306a36Sopenharmony_ciunshare system call 262306a36Sopenharmony_ci=================== 362306a36Sopenharmony_ci 462306a36Sopenharmony_ciThis document describes the new system call, unshare(). The document 562306a36Sopenharmony_ciprovides an overview of the feature, why it is needed, how it can 662306a36Sopenharmony_cibe used, its interface specification, design, implementation and 762306a36Sopenharmony_cihow it can be tested. 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciChange Log 1062306a36Sopenharmony_ci---------- 1162306a36Sopenharmony_civersion 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciContents 1462306a36Sopenharmony_ci-------- 1562306a36Sopenharmony_ci 1) Overview 1662306a36Sopenharmony_ci 2) Benefits 1762306a36Sopenharmony_ci 3) Cost 1862306a36Sopenharmony_ci 4) Requirements 1962306a36Sopenharmony_ci 5) Functional Specification 2062306a36Sopenharmony_ci 6) High Level Design 2162306a36Sopenharmony_ci 7) Low Level Design 2262306a36Sopenharmony_ci 8) Test Specification 2362306a36Sopenharmony_ci 9) Future Work 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ci1) Overview 2662306a36Sopenharmony_ci----------- 2762306a36Sopenharmony_ci 2862306a36Sopenharmony_ciMost legacy operating system kernels support an abstraction of threads 2962306a36Sopenharmony_cias multiple execution contexts within a process. These kernels provide 3062306a36Sopenharmony_cispecial resources and mechanisms to maintain these "threads". The Linux 3162306a36Sopenharmony_cikernel, in a clever and simple manner, does not make distinction 3262306a36Sopenharmony_cibetween processes and "threads". The kernel allows processes to share 3362306a36Sopenharmony_ciresources and thus they can achieve legacy "threads" behavior without 3462306a36Sopenharmony_cirequiring additional data structures and mechanisms in the kernel. The 3562306a36Sopenharmony_cipower of implementing threads in this manner comes not only from 3662306a36Sopenharmony_ciits simplicity but also from allowing application programmers to work 3762306a36Sopenharmony_cioutside the confinement of all-or-nothing shared resources of legacy 3862306a36Sopenharmony_cithreads. On Linux, at the time of thread creation using the clone system 3962306a36Sopenharmony_cicall, applications can selectively choose which resources to share 4062306a36Sopenharmony_cibetween threads. 4162306a36Sopenharmony_ci 4262306a36Sopenharmony_ciunshare() system call adds a primitive to the Linux thread model that 4362306a36Sopenharmony_ciallows threads to selectively 'unshare' any resources that were being 4462306a36Sopenharmony_cishared at the time of their creation. unshare() was conceptualized by 4562306a36Sopenharmony_ciAl Viro in the August of 2000, on the Linux-Kernel mailing list, as part 4662306a36Sopenharmony_ciof the discussion on POSIX threads on Linux. unshare() augments the 4762306a36Sopenharmony_ciusefulness of Linux threads for applications that would like to control 4862306a36Sopenharmony_cishared resources without creating a new process. unshare() is a natural 4962306a36Sopenharmony_ciaddition to the set of available primitives on Linux that implement 5062306a36Sopenharmony_cithe concept of process/thread as a virtual machine. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ci2) Benefits 5362306a36Sopenharmony_ci----------- 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciunshare() would be useful to large application frameworks such as PAM 5662306a36Sopenharmony_ciwhere creating a new process to control sharing/unsharing of process 5762306a36Sopenharmony_ciresources is not possible. Since namespaces are shared by default 5862306a36Sopenharmony_ciwhen creating a new process using fork or clone, unshare() can benefit 5962306a36Sopenharmony_cieven non-threaded applications if they have a need to disassociate 6062306a36Sopenharmony_cifrom default shared namespace. The following lists two use-cases 6162306a36Sopenharmony_ciwhere unshare() can be used. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ci2.1 Per-security context namespaces 6462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ciunshare() can be used to implement polyinstantiated directories using 6762306a36Sopenharmony_cithe kernel's per-process namespace mechanism. Polyinstantiated directories, 6862306a36Sopenharmony_cisuch as per-user and/or per-security context instance of /tmp, /var/tmp or 6962306a36Sopenharmony_ciper-security context instance of a user's home directory, isolate user 7062306a36Sopenharmony_ciprocesses when working with these directories. Using unshare(), a PAM 7162306a36Sopenharmony_cimodule can easily setup a private namespace for a user at login. 7262306a36Sopenharmony_ciPolyinstantiated directories are required for Common Criteria certification 7362306a36Sopenharmony_ciwith Labeled System Protection Profile, however, with the availability 7462306a36Sopenharmony_ciof shared-tree feature in the Linux kernel, even regular Linux systems 7562306a36Sopenharmony_cican benefit from setting up private namespaces at login and 7662306a36Sopenharmony_cipolyinstantiating /tmp, /var/tmp and other directories deemed 7762306a36Sopenharmony_ciappropriate by system administrators. 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ci2.2 unsharing of virtual memory and/or open files 8062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ciConsider a client/server application where the server is processing 8362306a36Sopenharmony_ciclient requests by creating processes that share resources such as 8462306a36Sopenharmony_civirtual memory and open files. Without unshare(), the server has to 8562306a36Sopenharmony_cidecide what needs to be shared at the time of creating the process 8662306a36Sopenharmony_ciwhich services the request. unshare() allows the server an ability to 8762306a36Sopenharmony_cidisassociate parts of the context during the servicing of the 8862306a36Sopenharmony_cirequest. For large and complex middleware application frameworks, this 8962306a36Sopenharmony_ciability to unshare() after the process was created can be very 9062306a36Sopenharmony_ciuseful. 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci3) Cost 9362306a36Sopenharmony_ci------- 9462306a36Sopenharmony_ci 9562306a36Sopenharmony_ciIn order to not duplicate code and to handle the fact that unshare() 9662306a36Sopenharmony_ciworks on an active task (as opposed to clone/fork working on a newly 9762306a36Sopenharmony_ciallocated inactive task) unshare() had to make minor reorganizational 9862306a36Sopenharmony_cichanges to copy_* functions utilized by clone/fork system call. 9962306a36Sopenharmony_ciThere is a cost associated with altering existing, well tested and 10062306a36Sopenharmony_cistable code to implement a new feature that may not get exercised 10162306a36Sopenharmony_ciextensively in the beginning. However, with proper design and code 10262306a36Sopenharmony_cireview of the changes and creation of an unshare() test for the LTP 10362306a36Sopenharmony_cithe benefits of this new feature can exceed its cost. 10462306a36Sopenharmony_ci 10562306a36Sopenharmony_ci4) Requirements 10662306a36Sopenharmony_ci--------------- 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ciunshare() reverses sharing that was done using clone(2) system call, 10962306a36Sopenharmony_ciso unshare() should have a similar interface as clone(2). That is, 11062306a36Sopenharmony_cisince flags in clone(int flags, void \*stack) specifies what should 11162306a36Sopenharmony_cibe shared, similar flags in unshare(int flags) should specify 11262306a36Sopenharmony_ciwhat should be unshared. Unfortunately, this may appear to invert 11362306a36Sopenharmony_cithe meaning of the flags from the way they are used in clone(2). 11462306a36Sopenharmony_ciHowever, there was no easy solution that was less confusing and that 11562306a36Sopenharmony_ciallowed incremental context unsharing in future without an ABI change. 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ciunshare() interface should accommodate possible future addition of 11862306a36Sopenharmony_cinew context flags without requiring a rebuild of old applications. 11962306a36Sopenharmony_ciIf and when new context flags are added, unshare() design should allow 12062306a36Sopenharmony_ciincremental unsharing of those resources on an as needed basis. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci5) Functional Specification 12362306a36Sopenharmony_ci--------------------------- 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ciNAME 12662306a36Sopenharmony_ci unshare - disassociate parts of the process execution context 12762306a36Sopenharmony_ci 12862306a36Sopenharmony_ciSYNOPSIS 12962306a36Sopenharmony_ci #include <sched.h> 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci int unshare(int flags); 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ciDESCRIPTION 13462306a36Sopenharmony_ci unshare() allows a process to disassociate parts of its execution 13562306a36Sopenharmony_ci context that are currently being shared with other processes. Part 13662306a36Sopenharmony_ci of execution context, such as the namespace, is shared by default 13762306a36Sopenharmony_ci when a new process is created using fork(2), while other parts, 13862306a36Sopenharmony_ci such as the virtual memory, open file descriptors, etc, may be 13962306a36Sopenharmony_ci shared by explicit request to share them when creating a process 14062306a36Sopenharmony_ci using clone(2). 14162306a36Sopenharmony_ci 14262306a36Sopenharmony_ci The main use of unshare() is to allow a process to control its 14362306a36Sopenharmony_ci shared execution context without creating a new process. 14462306a36Sopenharmony_ci 14562306a36Sopenharmony_ci The flags argument specifies one or bitwise-or'ed of several of 14662306a36Sopenharmony_ci the following constants. 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ci CLONE_FS 14962306a36Sopenharmony_ci If CLONE_FS is set, file system information of the caller 15062306a36Sopenharmony_ci is disassociated from the shared file system information. 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ci CLONE_FILES 15362306a36Sopenharmony_ci If CLONE_FILES is set, the file descriptor table of the 15462306a36Sopenharmony_ci caller is disassociated from the shared file descriptor 15562306a36Sopenharmony_ci table. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ci CLONE_NEWNS 15862306a36Sopenharmony_ci If CLONE_NEWNS is set, the namespace of the caller is 15962306a36Sopenharmony_ci disassociated from the shared namespace. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ci CLONE_VM 16262306a36Sopenharmony_ci If CLONE_VM is set, the virtual memory of the caller is 16362306a36Sopenharmony_ci disassociated from the shared virtual memory. 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciRETURN VALUE 16662306a36Sopenharmony_ci On success, zero returned. On failure, -1 is returned and errno is 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ciERRORS 16962306a36Sopenharmony_ci EPERM CLONE_NEWNS was specified by a non-root process (process 17062306a36Sopenharmony_ci without CAP_SYS_ADMIN). 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ci ENOMEM Cannot allocate sufficient memory to copy parts of caller's 17362306a36Sopenharmony_ci context that need to be unshared. 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ci EINVAL Invalid flag was specified as an argument. 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ciCONFORMING TO 17862306a36Sopenharmony_ci The unshare() call is Linux-specific and should not be used 17962306a36Sopenharmony_ci in programs intended to be portable. 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ciSEE ALSO 18262306a36Sopenharmony_ci clone(2), fork(2) 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ci6) High Level Design 18562306a36Sopenharmony_ci-------------------- 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ciDepending on the flags argument, the unshare() system call allocates 18862306a36Sopenharmony_ciappropriate process context structures, populates it with values from 18962306a36Sopenharmony_cithe current shared version, associates newly duplicated structures 19062306a36Sopenharmony_ciwith the current task structure and releases corresponding shared 19162306a36Sopenharmony_civersions. Helper functions of clone (copy_*) could not be used 19262306a36Sopenharmony_cidirectly by unshare() because of the following two reasons. 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ci 1) clone operates on a newly allocated not-yet-active task 19562306a36Sopenharmony_ci structure, where as unshare() operates on the current active 19662306a36Sopenharmony_ci task. Therefore unshare() has to take appropriate task_lock() 19762306a36Sopenharmony_ci before associating newly duplicated context structures 19862306a36Sopenharmony_ci 19962306a36Sopenharmony_ci 2) unshare() has to allocate and duplicate all context structures 20062306a36Sopenharmony_ci that are being unshared, before associating them with the 20162306a36Sopenharmony_ci current task and releasing older shared structures. Failure 20262306a36Sopenharmony_ci do so will create race conditions and/or oops when trying 20362306a36Sopenharmony_ci to backout due to an error. Consider the case of unsharing 20462306a36Sopenharmony_ci both virtual memory and namespace. After successfully unsharing 20562306a36Sopenharmony_ci vm, if the system call encounters an error while allocating 20662306a36Sopenharmony_ci new namespace structure, the error return code will have to 20762306a36Sopenharmony_ci reverse the unsharing of vm. As part of the reversal the 20862306a36Sopenharmony_ci system call will have to go back to older, shared, vm 20962306a36Sopenharmony_ci structure, which may not exist anymore. 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ciTherefore code from copy_* functions that allocated and duplicated 21262306a36Sopenharmony_cicurrent context structure was moved into new dup_* functions. Now, 21362306a36Sopenharmony_cicopy_* functions call dup_* functions to allocate and duplicate 21462306a36Sopenharmony_ciappropriate context structures and then associate them with the 21562306a36Sopenharmony_citask structure that is being constructed. unshare() system call on 21662306a36Sopenharmony_cithe other hand performs the following: 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ci 1) Check flags to force missing, but implied, flags 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ci 2) For each context structure, call the corresponding unshare() 22162306a36Sopenharmony_ci helper function to allocate and duplicate a new context 22262306a36Sopenharmony_ci structure, if the appropriate bit is set in the flags argument. 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ci 3) If there is no error in allocation and duplication and there 22562306a36Sopenharmony_ci are new context structures then lock the current task structure, 22662306a36Sopenharmony_ci associate new context structures with the current task structure, 22762306a36Sopenharmony_ci and release the lock on the current task structure. 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ci 4) Appropriately release older, shared, context structures. 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ci7) Low Level Design 23262306a36Sopenharmony_ci------------------- 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ciImplementation of unshare() can be grouped in the following 4 different 23562306a36Sopenharmony_ciitems: 23662306a36Sopenharmony_ci 23762306a36Sopenharmony_ci a) Reorganization of existing copy_* functions 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ci b) unshare() system call service function 24062306a36Sopenharmony_ci 24162306a36Sopenharmony_ci c) unshare() helper functions for each different process context 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_ci d) Registration of system call number for different architectures 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ci7.1) Reorganization of copy_* functions 24662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24762306a36Sopenharmony_ci 24862306a36Sopenharmony_ciEach copy function such as copy_mm, copy_namespace, copy_files, 24962306a36Sopenharmony_cietc, had roughly two components. The first component allocated 25062306a36Sopenharmony_ciand duplicated the appropriate structure and the second component 25162306a36Sopenharmony_cilinked it to the task structure passed in as an argument to the copy 25262306a36Sopenharmony_cifunction. The first component was split into its own function. 25362306a36Sopenharmony_ciThese dup_* functions allocated and duplicated the appropriate 25462306a36Sopenharmony_cicontext structure. The reorganized copy_* functions invoked 25562306a36Sopenharmony_citheir corresponding dup_* functions and then linked the newly 25662306a36Sopenharmony_ciduplicated structures to the task structure with which the 25762306a36Sopenharmony_cicopy function was called. 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ci7.2) unshare() system call service function 26062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26162306a36Sopenharmony_ci 26262306a36Sopenharmony_ci * Check flags 26362306a36Sopenharmony_ci Force implied flags. If CLONE_THREAD is set force CLONE_VM. 26462306a36Sopenharmony_ci If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is 26562306a36Sopenharmony_ci set and signals are also being shared, force CLONE_THREAD. If 26662306a36Sopenharmony_ci CLONE_NEWNS is set, force CLONE_FS. 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ci * For each context flag, invoke the corresponding unshare_* 26962306a36Sopenharmony_ci helper routine with flags passed into the system call and a 27062306a36Sopenharmony_ci reference to pointer pointing the new unshared structure 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_ci * If any new structures are created by unshare_* helper 27362306a36Sopenharmony_ci functions, take the task_lock() on the current task, 27462306a36Sopenharmony_ci modify appropriate context pointers, and release the 27562306a36Sopenharmony_ci task lock. 27662306a36Sopenharmony_ci 27762306a36Sopenharmony_ci * For all newly unshared structures, release the corresponding 27862306a36Sopenharmony_ci older, shared, structures. 27962306a36Sopenharmony_ci 28062306a36Sopenharmony_ci7.3) unshare_* helper functions 28162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ciFor unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, 28462306a36Sopenharmony_ciand CLONE_THREAD, return -EINVAL since they are not implemented yet. 28562306a36Sopenharmony_ciFor others, check the flag value to see if the unsharing is 28662306a36Sopenharmony_cirequired for that structure. If it is, invoke the corresponding 28762306a36Sopenharmony_cidup_* function to allocate and duplicate the structure and return 28862306a36Sopenharmony_cia pointer to it. 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ci7.4) Finally 29162306a36Sopenharmony_ci~~~~~~~~~~~~ 29262306a36Sopenharmony_ci 29362306a36Sopenharmony_ciAppropriately modify architecture specific code to register the 29462306a36Sopenharmony_cinew system call. 29562306a36Sopenharmony_ci 29662306a36Sopenharmony_ci8) Test Specification 29762306a36Sopenharmony_ci--------------------- 29862306a36Sopenharmony_ci 29962306a36Sopenharmony_ciThe test for unshare() should test the following: 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ci 1) Valid flags: Test to check that clone flags for signal and 30262306a36Sopenharmony_ci signal handlers, for which unsharing is not implemented 30362306a36Sopenharmony_ci yet, return -EINVAL. 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ci 2) Missing/implied flags: Test to make sure that if unsharing 30662306a36Sopenharmony_ci namespace without specifying unsharing of filesystem, correctly 30762306a36Sopenharmony_ci unshares both namespace and filesystem information. 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_ci 3) For each of the four (namespace, filesystem, files and vm) 31062306a36Sopenharmony_ci supported unsharing, verify that the system call correctly 31162306a36Sopenharmony_ci unshares the appropriate structure. Verify that unsharing 31262306a36Sopenharmony_ci them individually as well as in combination with each 31362306a36Sopenharmony_ci other works as expected. 31462306a36Sopenharmony_ci 31562306a36Sopenharmony_ci 4) Concurrent execution: Use shared memory segments and futex on 31662306a36Sopenharmony_ci an address in the shm segment to synchronize execution of 31762306a36Sopenharmony_ci about 10 threads. Have a couple of threads execute execve, 31862306a36Sopenharmony_ci a couple _exit and the rest unshare with different combination 31962306a36Sopenharmony_ci of flags. Verify that unsharing is performed as expected and 32062306a36Sopenharmony_ci that there are no oops or hangs. 32162306a36Sopenharmony_ci 32262306a36Sopenharmony_ci9) Future Work 32362306a36Sopenharmony_ci-------------- 32462306a36Sopenharmony_ci 32562306a36Sopenharmony_ciThe current implementation of unshare() does not allow unsharing of 32662306a36Sopenharmony_cisignals and signal handlers. Signals are complex to begin with and 32762306a36Sopenharmony_cito unshare signals and/or signal handlers of a currently running 32862306a36Sopenharmony_ciprocess is even more complex. If in the future there is a specific 32962306a36Sopenharmony_cineed to allow unsharing of signals and/or signal handlers, it can 33062306a36Sopenharmony_cibe incrementally added to unshare() without affecting legacy 33162306a36Sopenharmony_ciapplications using unshare(). 33262306a36Sopenharmony_ci 333