18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=====================================================
48c2ecf20Sopenharmony_ciMandatory File Locking For The Linux Operating System
58c2ecf20Sopenharmony_ci=====================================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ci		Andy Walker <andy@lysaker.kvaerner.no>
88c2ecf20Sopenharmony_ci
98c2ecf20Sopenharmony_ci			   15 April 1996
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ci		     (Updated September 2007)
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ci0. Why you should avoid mandatory locking
148c2ecf20Sopenharmony_ci-----------------------------------------
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ciThe Linux implementation is prey to a number of difficult-to-fix race
178c2ecf20Sopenharmony_ciconditions which in practice make it not dependable:
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ci	- The write system call checks for a mandatory lock only once
208c2ecf20Sopenharmony_ci	  at its start.  It is therefore possible for a lock request to
218c2ecf20Sopenharmony_ci	  be granted after this check but before the data is modified.
228c2ecf20Sopenharmony_ci	  A process may then see file data change even while a mandatory
238c2ecf20Sopenharmony_ci	  lock was held.
248c2ecf20Sopenharmony_ci	- Similarly, an exclusive lock may be granted on a file after
258c2ecf20Sopenharmony_ci	  the kernel has decided to proceed with a read, but before the
268c2ecf20Sopenharmony_ci	  read has actually completed, and the reading process may see
278c2ecf20Sopenharmony_ci	  the file data in a state which should not have been visible
288c2ecf20Sopenharmony_ci	  to it.
298c2ecf20Sopenharmony_ci	- Similar races make the claimed mutual exclusion between lock
308c2ecf20Sopenharmony_ci	  and mmap similarly unreliable.
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ci1. What is  mandatory locking?
338c2ecf20Sopenharmony_ci------------------------------
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ciMandatory locking is kernel enforced file locking, as opposed to the more usual
368c2ecf20Sopenharmony_cicooperative file locking used to guarantee sequential access to files among
378c2ecf20Sopenharmony_ciprocesses. File locks are applied using the flock() and fcntl() system calls
388c2ecf20Sopenharmony_ci(and the lockf() library routine which is a wrapper around fcntl().) It is
398c2ecf20Sopenharmony_cinormally a process' responsibility to check for locks on a file it wishes to
408c2ecf20Sopenharmony_ciupdate, before applying its own lock, updating the file and unlocking it again.
418c2ecf20Sopenharmony_ciThe most commonly used example of this (and in the case of sendmail, the most
428c2ecf20Sopenharmony_citroublesome) is access to a user's mailbox. The mail user agent and the mail
438c2ecf20Sopenharmony_citransfer agent must guard against updating the mailbox at the same time, and
448c2ecf20Sopenharmony_ciprevent reading the mailbox while it is being updated.
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ciIn a perfect world all processes would use and honour a cooperative, or
478c2ecf20Sopenharmony_ci"advisory" locking scheme. However, the world isn't perfect, and there's
488c2ecf20Sopenharmony_cia lot of poorly written code out there.
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_ciIn trying to address this problem, the designers of System V UNIX came up
518c2ecf20Sopenharmony_ciwith a "mandatory" locking scheme, whereby the operating system kernel would
528c2ecf20Sopenharmony_ciblock attempts by a process to write to a file that another process holds a
538c2ecf20Sopenharmony_ci"read" -or- "shared" lock on, and block attempts to both read and write to a 
548c2ecf20Sopenharmony_cifile that a process holds a "write " -or- "exclusive" lock on.
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciThe System V mandatory locking scheme was intended to have as little impact as
578c2ecf20Sopenharmony_cipossible on existing user code. The scheme is based on marking individual files
588c2ecf20Sopenharmony_cias candidates for mandatory locking, and using the existing fcntl()/lockf()
598c2ecf20Sopenharmony_ciinterface for applying locks just as if they were normal, advisory locks.
608c2ecf20Sopenharmony_ci
618c2ecf20Sopenharmony_ci.. Note::
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ci   1. In saying "file" in the paragraphs above I am actually not telling
648c2ecf20Sopenharmony_ci      the whole truth. System V locking is based on fcntl(). The granularity of
658c2ecf20Sopenharmony_ci      fcntl() is such that it allows the locking of byte ranges in files, in
668c2ecf20Sopenharmony_ci      addition to entire files, so the mandatory locking rules also have byte
678c2ecf20Sopenharmony_ci      level granularity.
688c2ecf20Sopenharmony_ci
698c2ecf20Sopenharmony_ci   2. POSIX.1 does not specify any scheme for mandatory locking, despite
708c2ecf20Sopenharmony_ci      borrowing the fcntl() locking scheme from System V. The mandatory locking
718c2ecf20Sopenharmony_ci      scheme is defined by the System V Interface Definition (SVID) Version 3.
728c2ecf20Sopenharmony_ci
738c2ecf20Sopenharmony_ci2. Marking a file for mandatory locking
748c2ecf20Sopenharmony_ci---------------------------------------
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciA file is marked as a candidate for mandatory locking by setting the group-id
778c2ecf20Sopenharmony_cibit in its file mode but removing the group-execute bit. This is an otherwise
788c2ecf20Sopenharmony_cimeaningless combination, and was chosen by the System V implementors so as not
798c2ecf20Sopenharmony_cito break existing user programs.
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ciNote that the group-id bit is usually automatically cleared by the kernel when
828c2ecf20Sopenharmony_cia setgid file is written to. This is a security measure. The kernel has been
838c2ecf20Sopenharmony_cimodified to recognize the special case of a mandatory lock candidate and to
848c2ecf20Sopenharmony_cirefrain from clearing this bit. Similarly the kernel has been modified not
858c2ecf20Sopenharmony_cito run mandatory lock candidates with setgid privileges.
868c2ecf20Sopenharmony_ci
878c2ecf20Sopenharmony_ci3. Available implementations
888c2ecf20Sopenharmony_ci----------------------------
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ciI have considered the implementations of mandatory locking available with
918c2ecf20Sopenharmony_ciSunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ciGenerally I have tried to make the most sense out of the behaviour exhibited
948c2ecf20Sopenharmony_ciby these three reference systems. There are many anomalies.
958c2ecf20Sopenharmony_ci
968c2ecf20Sopenharmony_ciAll the reference systems reject all calls to open() for a file on which
978c2ecf20Sopenharmony_cianother process has outstanding mandatory locks. This is in direct
988c2ecf20Sopenharmony_cicontravention of SVID 3, which states that only calls to open() with the
998c2ecf20Sopenharmony_ciO_TRUNC flag set should be rejected. The Linux implementation follows the SVID
1008c2ecf20Sopenharmony_cidefinition, which is the "Right Thing", since only calls with O_TRUNC can
1018c2ecf20Sopenharmony_cimodify the contents of the file.
1028c2ecf20Sopenharmony_ci
1038c2ecf20Sopenharmony_ciHP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
1048c2ecf20Sopenharmony_cijust mandatory locks. That would appear to contravene POSIX.1.
1058c2ecf20Sopenharmony_ci
1068c2ecf20Sopenharmony_cimmap() is another interesting case. All the operating systems mentioned
1078c2ecf20Sopenharmony_ciprevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX
1088c2ecf20Sopenharmony_cialso disallows advisory locks for such a file. SVID actually specifies the
1098c2ecf20Sopenharmony_ciparanoid HP-UX behaviour.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ciIn my opinion only MAP_SHARED mappings should be immune from locking, and then
1128c2ecf20Sopenharmony_cionly from mandatory locks - that is what is currently implemented.
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ciSunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
1158c2ecf20Sopenharmony_cimandatory locks, so reads and writes to locked files always block when they
1168c2ecf20Sopenharmony_cishould return EAGAIN.
1178c2ecf20Sopenharmony_ci
1188c2ecf20Sopenharmony_ciI'm afraid that this is such an esoteric area that the semantics described
1198c2ecf20Sopenharmony_cibelow are just as valid as any others, so long as the main points seem to
1208c2ecf20Sopenharmony_ciagree. 
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ci4. Semantics
1238c2ecf20Sopenharmony_ci------------
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ci1. Mandatory locks can only be applied via the fcntl()/lockf() locking
1268c2ecf20Sopenharmony_ci   interface - in other words the System V/POSIX interface. BSD style
1278c2ecf20Sopenharmony_ci   locks using flock() never result in a mandatory lock.
1288c2ecf20Sopenharmony_ci
1298c2ecf20Sopenharmony_ci2. If a process has locked a region of a file with a mandatory read lock, then
1308c2ecf20Sopenharmony_ci   other processes are permitted to read from that region. If any of these
1318c2ecf20Sopenharmony_ci   processes attempts to write to the region it will block until the lock is
1328c2ecf20Sopenharmony_ci   released, unless the process has opened the file with the O_NONBLOCK
1338c2ecf20Sopenharmony_ci   flag in which case the system call will return immediately with the error
1348c2ecf20Sopenharmony_ci   status EAGAIN.
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ci3. If a process has locked a region of a file with a mandatory write lock, all
1378c2ecf20Sopenharmony_ci   attempts to read or write to that region block until the lock is released,
1388c2ecf20Sopenharmony_ci   unless a process has opened the file with the O_NONBLOCK flag in which case
1398c2ecf20Sopenharmony_ci   the system call will return immediately with the error status EAGAIN.
1408c2ecf20Sopenharmony_ci
1418c2ecf20Sopenharmony_ci4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
1428c2ecf20Sopenharmony_ci   any mandatory locks owned by other processes will be rejected with the
1438c2ecf20Sopenharmony_ci   error status EAGAIN.
1448c2ecf20Sopenharmony_ci
1458c2ecf20Sopenharmony_ci5. Attempts to apply a mandatory lock to a file that is memory mapped and
1468c2ecf20Sopenharmony_ci   shared (via mmap() with MAP_SHARED) will be rejected with the error status
1478c2ecf20Sopenharmony_ci   EAGAIN.
1488c2ecf20Sopenharmony_ci
1498c2ecf20Sopenharmony_ci6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
1508c2ecf20Sopenharmony_ci   that has any mandatory locks in effect will be rejected with the error status
1518c2ecf20Sopenharmony_ci   EAGAIN.
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ci5. Which system calls are affected?
1548c2ecf20Sopenharmony_ci-----------------------------------
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_ciThose which modify a file's contents, not just the inode. That gives read(),
1578c2ecf20Sopenharmony_ciwrite(), readv(), writev(), open(), creat(), mmap(), truncate() and
1588c2ecf20Sopenharmony_ciftruncate(). truncate() and ftruncate() are considered to be "write" actions
1598c2ecf20Sopenharmony_cifor the purposes of mandatory locking.
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ciThe affected region is usually defined as stretching from the current position
1628c2ecf20Sopenharmony_cifor the total number of bytes read or written. For the truncate calls it is
1638c2ecf20Sopenharmony_cidefined as the bytes of a file removed or added (we must also consider bytes
1648c2ecf20Sopenharmony_ciadded, as a lock can specify just "the whole file", rather than a specific
1658c2ecf20Sopenharmony_cirange of bytes.)
1668c2ecf20Sopenharmony_ci
1678c2ecf20Sopenharmony_ciNote 3: I may have overlooked some system calls that need mandatory lock
1688c2ecf20Sopenharmony_cichecking in my eagerness to get this code out the door. Please let me know, or
1698c2ecf20Sopenharmony_cibetter still fix the system calls yourself and submit a patch to me or Linus.
1708c2ecf20Sopenharmony_ci
1718c2ecf20Sopenharmony_ci6. Warning!
1728c2ecf20Sopenharmony_ci-----------
1738c2ecf20Sopenharmony_ci
1748c2ecf20Sopenharmony_ciNot even root can override a mandatory lock, so runaway processes can wreak
1758c2ecf20Sopenharmony_cihavoc if they lock crucial files. The way around it is to change the file
1768c2ecf20Sopenharmony_cipermissions (remove the setgid bit) before trying to read or write to it.
1778c2ecf20Sopenharmony_ciOf course, that might be a bit tricky if the system is hung :-(
1788c2ecf20Sopenharmony_ci
1798c2ecf20Sopenharmony_ci7. The "mand" mount option
1808c2ecf20Sopenharmony_ci--------------------------
1818c2ecf20Sopenharmony_ciMandatory locking is disabled on all filesystems by default, and must be
1828c2ecf20Sopenharmony_ciadministratively enabled by mounting with "-o mand". That mount option
1838c2ecf20Sopenharmony_ciis only allowed if the mounting task has the CAP_SYS_ADMIN capability.
1848c2ecf20Sopenharmony_ci
1858c2ecf20Sopenharmony_ciSince kernel v4.5, it is possible to disable mandatory locking
1868c2ecf20Sopenharmony_cialtogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
1878c2ecf20Sopenharmony_ciwith this disabled will reject attempts to mount filesystems with the
1888c2ecf20Sopenharmony_ci"mand" mount option with the error status EPERM.
189