18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci===================================================== 48c2ecf20Sopenharmony_ciMandatory File Locking For The Linux Operating System 58c2ecf20Sopenharmony_ci===================================================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ci Andy Walker <andy@lysaker.kvaerner.no> 88c2ecf20Sopenharmony_ci 98c2ecf20Sopenharmony_ci 15 April 1996 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ci (Updated September 2007) 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ci0. Why you should avoid mandatory locking 148c2ecf20Sopenharmony_ci----------------------------------------- 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ciThe Linux implementation is prey to a number of difficult-to-fix race 178c2ecf20Sopenharmony_ciconditions which in practice make it not dependable: 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ci - The write system call checks for a mandatory lock only once 208c2ecf20Sopenharmony_ci at its start. It is therefore possible for a lock request to 218c2ecf20Sopenharmony_ci be granted after this check but before the data is modified. 228c2ecf20Sopenharmony_ci A process may then see file data change even while a mandatory 238c2ecf20Sopenharmony_ci lock was held. 248c2ecf20Sopenharmony_ci - Similarly, an exclusive lock may be granted on a file after 258c2ecf20Sopenharmony_ci the kernel has decided to proceed with a read, but before the 268c2ecf20Sopenharmony_ci read has actually completed, and the reading process may see 278c2ecf20Sopenharmony_ci the file data in a state which should not have been visible 288c2ecf20Sopenharmony_ci to it. 298c2ecf20Sopenharmony_ci - Similar races make the claimed mutual exclusion between lock 308c2ecf20Sopenharmony_ci and mmap similarly unreliable. 318c2ecf20Sopenharmony_ci 328c2ecf20Sopenharmony_ci1. What is mandatory locking? 338c2ecf20Sopenharmony_ci------------------------------ 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ciMandatory locking is kernel enforced file locking, as opposed to the more usual 368c2ecf20Sopenharmony_cicooperative file locking used to guarantee sequential access to files among 378c2ecf20Sopenharmony_ciprocesses. File locks are applied using the flock() and fcntl() system calls 388c2ecf20Sopenharmony_ci(and the lockf() library routine which is a wrapper around fcntl().) It is 398c2ecf20Sopenharmony_cinormally a process' responsibility to check for locks on a file it wishes to 408c2ecf20Sopenharmony_ciupdate, before applying its own lock, updating the file and unlocking it again. 418c2ecf20Sopenharmony_ciThe most commonly used example of this (and in the case of sendmail, the most 428c2ecf20Sopenharmony_citroublesome) is access to a user's mailbox. The mail user agent and the mail 438c2ecf20Sopenharmony_citransfer agent must guard against updating the mailbox at the same time, and 448c2ecf20Sopenharmony_ciprevent reading the mailbox while it is being updated. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ciIn a perfect world all processes would use and honour a cooperative, or 478c2ecf20Sopenharmony_ci"advisory" locking scheme. However, the world isn't perfect, and there's 488c2ecf20Sopenharmony_cia lot of poorly written code out there. 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_ciIn trying to address this problem, the designers of System V UNIX came up 518c2ecf20Sopenharmony_ciwith a "mandatory" locking scheme, whereby the operating system kernel would 528c2ecf20Sopenharmony_ciblock attempts by a process to write to a file that another process holds a 538c2ecf20Sopenharmony_ci"read" -or- "shared" lock on, and block attempts to both read and write to a 548c2ecf20Sopenharmony_cifile that a process holds a "write " -or- "exclusive" lock on. 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ciThe System V mandatory locking scheme was intended to have as little impact as 578c2ecf20Sopenharmony_cipossible on existing user code. The scheme is based on marking individual files 588c2ecf20Sopenharmony_cias candidates for mandatory locking, and using the existing fcntl()/lockf() 598c2ecf20Sopenharmony_ciinterface for applying locks just as if they were normal, advisory locks. 608c2ecf20Sopenharmony_ci 618c2ecf20Sopenharmony_ci.. Note:: 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ci 1. In saying "file" in the paragraphs above I am actually not telling 648c2ecf20Sopenharmony_ci the whole truth. System V locking is based on fcntl(). The granularity of 658c2ecf20Sopenharmony_ci fcntl() is such that it allows the locking of byte ranges in files, in 668c2ecf20Sopenharmony_ci addition to entire files, so the mandatory locking rules also have byte 678c2ecf20Sopenharmony_ci level granularity. 688c2ecf20Sopenharmony_ci 698c2ecf20Sopenharmony_ci 2. POSIX.1 does not specify any scheme for mandatory locking, despite 708c2ecf20Sopenharmony_ci borrowing the fcntl() locking scheme from System V. The mandatory locking 718c2ecf20Sopenharmony_ci scheme is defined by the System V Interface Definition (SVID) Version 3. 728c2ecf20Sopenharmony_ci 738c2ecf20Sopenharmony_ci2. Marking a file for mandatory locking 748c2ecf20Sopenharmony_ci--------------------------------------- 758c2ecf20Sopenharmony_ci 768c2ecf20Sopenharmony_ciA file is marked as a candidate for mandatory locking by setting the group-id 778c2ecf20Sopenharmony_cibit in its file mode but removing the group-execute bit. This is an otherwise 788c2ecf20Sopenharmony_cimeaningless combination, and was chosen by the System V implementors so as not 798c2ecf20Sopenharmony_cito break existing user programs. 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ciNote that the group-id bit is usually automatically cleared by the kernel when 828c2ecf20Sopenharmony_cia setgid file is written to. This is a security measure. The kernel has been 838c2ecf20Sopenharmony_cimodified to recognize the special case of a mandatory lock candidate and to 848c2ecf20Sopenharmony_cirefrain from clearing this bit. Similarly the kernel has been modified not 858c2ecf20Sopenharmony_cito run mandatory lock candidates with setgid privileges. 868c2ecf20Sopenharmony_ci 878c2ecf20Sopenharmony_ci3. Available implementations 888c2ecf20Sopenharmony_ci---------------------------- 898c2ecf20Sopenharmony_ci 908c2ecf20Sopenharmony_ciI have considered the implementations of mandatory locking available with 918c2ecf20Sopenharmony_ciSunOS 4.1.x, Solaris 2.x and HP-UX 9.x. 928c2ecf20Sopenharmony_ci 938c2ecf20Sopenharmony_ciGenerally I have tried to make the most sense out of the behaviour exhibited 948c2ecf20Sopenharmony_ciby these three reference systems. There are many anomalies. 958c2ecf20Sopenharmony_ci 968c2ecf20Sopenharmony_ciAll the reference systems reject all calls to open() for a file on which 978c2ecf20Sopenharmony_cianother process has outstanding mandatory locks. This is in direct 988c2ecf20Sopenharmony_cicontravention of SVID 3, which states that only calls to open() with the 998c2ecf20Sopenharmony_ciO_TRUNC flag set should be rejected. The Linux implementation follows the SVID 1008c2ecf20Sopenharmony_cidefinition, which is the "Right Thing", since only calls with O_TRUNC can 1018c2ecf20Sopenharmony_cimodify the contents of the file. 1028c2ecf20Sopenharmony_ci 1038c2ecf20Sopenharmony_ciHP-UX even disallows open() with O_TRUNC for a file with advisory locks, not 1048c2ecf20Sopenharmony_cijust mandatory locks. That would appear to contravene POSIX.1. 1058c2ecf20Sopenharmony_ci 1068c2ecf20Sopenharmony_cimmap() is another interesting case. All the operating systems mentioned 1078c2ecf20Sopenharmony_ciprevent mandatory locks from being applied to an mmap()'ed file, but HP-UX 1088c2ecf20Sopenharmony_cialso disallows advisory locks for such a file. SVID actually specifies the 1098c2ecf20Sopenharmony_ciparanoid HP-UX behaviour. 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ciIn my opinion only MAP_SHARED mappings should be immune from locking, and then 1128c2ecf20Sopenharmony_cionly from mandatory locks - that is what is currently implemented. 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ciSunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for 1158c2ecf20Sopenharmony_cimandatory locks, so reads and writes to locked files always block when they 1168c2ecf20Sopenharmony_cishould return EAGAIN. 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ciI'm afraid that this is such an esoteric area that the semantics described 1198c2ecf20Sopenharmony_cibelow are just as valid as any others, so long as the main points seem to 1208c2ecf20Sopenharmony_ciagree. 1218c2ecf20Sopenharmony_ci 1228c2ecf20Sopenharmony_ci4. Semantics 1238c2ecf20Sopenharmony_ci------------ 1248c2ecf20Sopenharmony_ci 1258c2ecf20Sopenharmony_ci1. Mandatory locks can only be applied via the fcntl()/lockf() locking 1268c2ecf20Sopenharmony_ci interface - in other words the System V/POSIX interface. BSD style 1278c2ecf20Sopenharmony_ci locks using flock() never result in a mandatory lock. 1288c2ecf20Sopenharmony_ci 1298c2ecf20Sopenharmony_ci2. If a process has locked a region of a file with a mandatory read lock, then 1308c2ecf20Sopenharmony_ci other processes are permitted to read from that region. If any of these 1318c2ecf20Sopenharmony_ci processes attempts to write to the region it will block until the lock is 1328c2ecf20Sopenharmony_ci released, unless the process has opened the file with the O_NONBLOCK 1338c2ecf20Sopenharmony_ci flag in which case the system call will return immediately with the error 1348c2ecf20Sopenharmony_ci status EAGAIN. 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ci3. If a process has locked a region of a file with a mandatory write lock, all 1378c2ecf20Sopenharmony_ci attempts to read or write to that region block until the lock is released, 1388c2ecf20Sopenharmony_ci unless a process has opened the file with the O_NONBLOCK flag in which case 1398c2ecf20Sopenharmony_ci the system call will return immediately with the error status EAGAIN. 1408c2ecf20Sopenharmony_ci 1418c2ecf20Sopenharmony_ci4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has 1428c2ecf20Sopenharmony_ci any mandatory locks owned by other processes will be rejected with the 1438c2ecf20Sopenharmony_ci error status EAGAIN. 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ci5. Attempts to apply a mandatory lock to a file that is memory mapped and 1468c2ecf20Sopenharmony_ci shared (via mmap() with MAP_SHARED) will be rejected with the error status 1478c2ecf20Sopenharmony_ci EAGAIN. 1488c2ecf20Sopenharmony_ci 1498c2ecf20Sopenharmony_ci6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) 1508c2ecf20Sopenharmony_ci that has any mandatory locks in effect will be rejected with the error status 1518c2ecf20Sopenharmony_ci EAGAIN. 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_ci5. Which system calls are affected? 1548c2ecf20Sopenharmony_ci----------------------------------- 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_ciThose which modify a file's contents, not just the inode. That gives read(), 1578c2ecf20Sopenharmony_ciwrite(), readv(), writev(), open(), creat(), mmap(), truncate() and 1588c2ecf20Sopenharmony_ciftruncate(). truncate() and ftruncate() are considered to be "write" actions 1598c2ecf20Sopenharmony_cifor the purposes of mandatory locking. 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ciThe affected region is usually defined as stretching from the current position 1628c2ecf20Sopenharmony_cifor the total number of bytes read or written. For the truncate calls it is 1638c2ecf20Sopenharmony_cidefined as the bytes of a file removed or added (we must also consider bytes 1648c2ecf20Sopenharmony_ciadded, as a lock can specify just "the whole file", rather than a specific 1658c2ecf20Sopenharmony_cirange of bytes.) 1668c2ecf20Sopenharmony_ci 1678c2ecf20Sopenharmony_ciNote 3: I may have overlooked some system calls that need mandatory lock 1688c2ecf20Sopenharmony_cichecking in my eagerness to get this code out the door. Please let me know, or 1698c2ecf20Sopenharmony_cibetter still fix the system calls yourself and submit a patch to me or Linus. 1708c2ecf20Sopenharmony_ci 1718c2ecf20Sopenharmony_ci6. Warning! 1728c2ecf20Sopenharmony_ci----------- 1738c2ecf20Sopenharmony_ci 1748c2ecf20Sopenharmony_ciNot even root can override a mandatory lock, so runaway processes can wreak 1758c2ecf20Sopenharmony_cihavoc if they lock crucial files. The way around it is to change the file 1768c2ecf20Sopenharmony_cipermissions (remove the setgid bit) before trying to read or write to it. 1778c2ecf20Sopenharmony_ciOf course, that might be a bit tricky if the system is hung :-( 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ci7. The "mand" mount option 1808c2ecf20Sopenharmony_ci-------------------------- 1818c2ecf20Sopenharmony_ciMandatory locking is disabled on all filesystems by default, and must be 1828c2ecf20Sopenharmony_ciadministratively enabled by mounting with "-o mand". That mount option 1838c2ecf20Sopenharmony_ciis only allowed if the mounting task has the CAP_SYS_ADMIN capability. 1848c2ecf20Sopenharmony_ci 1858c2ecf20Sopenharmony_ciSince kernel v4.5, it is possible to disable mandatory locking 1868c2ecf20Sopenharmony_cialtogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel 1878c2ecf20Sopenharmony_ciwith this disabled will reject attempts to mount filesystems with the 1888c2ecf20Sopenharmony_ci"mand" mount option with the error status EPERM. 189