18c2ecf20Sopenharmony_ci
28c2ecf20Sopenharmony_ci	JFFS2 LOCKING DOCUMENTATION
38c2ecf20Sopenharmony_ci	---------------------------
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ciThis document attempts to describe the existing locking rules for
68c2ecf20Sopenharmony_ciJFFS2. It is not expected to remain perfectly up to date, but ought to
78c2ecf20Sopenharmony_cibe fairly close.
88c2ecf20Sopenharmony_ci
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ci	alloc_sem
118c2ecf20Sopenharmony_ci	---------
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciThe alloc_sem is a per-filesystem mutex, used primarily to ensure
148c2ecf20Sopenharmony_cicontiguous allocation of space on the medium. It is automatically
158c2ecf20Sopenharmony_ciobtained during space allocations (jffs2_reserve_space()) and freed
168c2ecf20Sopenharmony_ciupon write completion (jffs2_complete_reservation()). Note that
178c2ecf20Sopenharmony_cithe garbage collector will obtain this right at the beginning of
188c2ecf20Sopenharmony_cijffs2_garbage_collect_pass() and release it at the end, thereby
198c2ecf20Sopenharmony_cipreventing any other write activity on the file system during a
208c2ecf20Sopenharmony_cigarbage collect pass.
218c2ecf20Sopenharmony_ci
228c2ecf20Sopenharmony_ciWhen writing new nodes, the alloc_sem must be held until the new nodes
238c2ecf20Sopenharmony_cihave been properly linked into the data structures for the inode to
248c2ecf20Sopenharmony_ciwhich they belong. This is for the benefit of NAND flash - adding new
258c2ecf20Sopenharmony_cinodes to an inode may obsolete old ones, and by holding the alloc_sem
268c2ecf20Sopenharmony_ciuntil this happens we ensure that any data in the write-buffer at the
278c2ecf20Sopenharmony_citime this happens are part of the new node, not just something that
288c2ecf20Sopenharmony_ciwas written afterwards. Hence, we can ensure the newly-obsoleted nodes
298c2ecf20Sopenharmony_cidon't actually get erased until the write-buffer has been flushed to
308c2ecf20Sopenharmony_cithe medium.
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ciWith the introduction of NAND flash support and the write-buffer, 
338c2ecf20Sopenharmony_cithe alloc_sem is also used to protect the wbuf-related members of the
348c2ecf20Sopenharmony_cijffs2_sb_info structure. Atomically reading the wbuf_len member to see
358c2ecf20Sopenharmony_ciif the wbuf is currently holding any data is permitted, though.
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ciOrdering constraints: See f->sem.
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ci	File Mutex f->sem
418c2ecf20Sopenharmony_ci	---------------------
428c2ecf20Sopenharmony_ci
438c2ecf20Sopenharmony_ciThis is the JFFS2-internal equivalent of the inode mutex i->i_sem.
448c2ecf20Sopenharmony_ciIt protects the contents of the jffs2_inode_info private inode data,
458c2ecf20Sopenharmony_ciincluding the linked list of node fragments (but see the notes below on
468c2ecf20Sopenharmony_cierase_completion_lock), etc.
478c2ecf20Sopenharmony_ci
488c2ecf20Sopenharmony_ciThe reason that the i_sem itself isn't used for this purpose is to
498c2ecf20Sopenharmony_ciavoid deadlocks with garbage collection -- the VFS will lock the i_sem
508c2ecf20Sopenharmony_cibefore calling a function which may need to allocate space. The
518c2ecf20Sopenharmony_ciallocation may trigger garbage-collection, which may need to move a
528c2ecf20Sopenharmony_cinode belonging to the inode which was locked in the first place by the
538c2ecf20Sopenharmony_ciVFS. If the garbage collection code were to attempt to lock the i_sem
548c2ecf20Sopenharmony_ciof the inode from which it's garbage-collecting a physical node, this
558c2ecf20Sopenharmony_cilead to deadlock, unless we played games with unlocking the i_sem
568c2ecf20Sopenharmony_cibefore calling the space allocation functions.
578c2ecf20Sopenharmony_ci
588c2ecf20Sopenharmony_ciInstead of playing such games, we just have an extra internal
598c2ecf20Sopenharmony_cimutex, which is obtained by the garbage collection code and also
608c2ecf20Sopenharmony_ciby the normal file system code _after_ allocation of space.
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ciOrdering constraints: 
638c2ecf20Sopenharmony_ci
648c2ecf20Sopenharmony_ci	1. Never attempt to allocate space or lock alloc_sem with 
658c2ecf20Sopenharmony_ci	   any f->sem held.
668c2ecf20Sopenharmony_ci	2. Never attempt to lock two file mutexes in one thread.
678c2ecf20Sopenharmony_ci	   No ordering rules have been made for doing so.
688c2ecf20Sopenharmony_ci	3. Never lock a page cache page with f->sem held.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ci	erase_completion_lock spinlock
728c2ecf20Sopenharmony_ci	------------------------------
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ciThis is used to serialise access to the eraseblock lists, to the
758c2ecf20Sopenharmony_ciper-eraseblock lists of physical jffs2_raw_node_ref structures, and
768c2ecf20Sopenharmony_ci(NB) the per-inode list of physical nodes. The latter is a special
778c2ecf20Sopenharmony_cicase - see below.
788c2ecf20Sopenharmony_ci
798c2ecf20Sopenharmony_ciAs the MTD API no longer permits erase-completion callback functions
808c2ecf20Sopenharmony_cito be called from bottom-half (timer) context (on the basis that nobody
818c2ecf20Sopenharmony_ciever actually implemented such a thing), it's now sufficient to use
828c2ecf20Sopenharmony_cia simple spin_lock() rather than spin_lock_bh().
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ciNote that the per-inode list of physical nodes (f->nodes) is a special
858c2ecf20Sopenharmony_cicase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in
868c2ecf20Sopenharmony_cithe list are protected by the file mutex f->sem. But the erase code
878c2ecf20Sopenharmony_cimay remove _obsolete_ nodes from the list while holding only the
888c2ecf20Sopenharmony_cierase_completion_lock. So you can walk the list only while holding the
898c2ecf20Sopenharmony_cierase_completion_lock, and can drop the lock temporarily mid-walk as
908c2ecf20Sopenharmony_cilong as the pointer you're holding is to a _valid_ node, not an
918c2ecf20Sopenharmony_ciobsolete one.
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ciThe erase_completion_lock is also used to protect the c->gc_task
948c2ecf20Sopenharmony_cipointer when the garbage collection thread exits. The code to kill the
958c2ecf20Sopenharmony_ciGC thread locks it, sends the signal, then unlocks it - while the GC
968c2ecf20Sopenharmony_cithread itself locks it, zeroes c->gc_task, then unlocks on the exit path.
978c2ecf20Sopenharmony_ci
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ci	inocache_lock spinlock
1008c2ecf20Sopenharmony_ci	----------------------
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciThis spinlock protects the hashed list (c->inocache_list) of the
1038c2ecf20Sopenharmony_ciin-core jffs2_inode_cache objects (each inode in JFFS2 has the
1048c2ecf20Sopenharmony_cicorrespondent jffs2_inode_cache object). So, the inocache_lock
1058c2ecf20Sopenharmony_cihas to be locked while walking the c->inocache_list hash buckets.
1068c2ecf20Sopenharmony_ci
1078c2ecf20Sopenharmony_ciThis spinlock also covers allocation of new inode numbers, which is
1088c2ecf20Sopenharmony_cicurrently just '++->highest_ino++', but might one day get more complicated
1098c2ecf20Sopenharmony_ciif we need to deal with wrapping after 4 milliard inode numbers are used.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ciNote, the f->sem guarantees that the correspondent jffs2_inode_cache
1128c2ecf20Sopenharmony_ciwill not be removed. So, it is allowed to access it without locking
1138c2ecf20Sopenharmony_cithe inocache_lock spinlock. 
1148c2ecf20Sopenharmony_ci
1158c2ecf20Sopenharmony_ciOrdering constraints: 
1168c2ecf20Sopenharmony_ci
1178c2ecf20Sopenharmony_ci	If both erase_completion_lock and inocache_lock are needed, the
1188c2ecf20Sopenharmony_ci	c->erase_completion has to be acquired first.
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ci
1218c2ecf20Sopenharmony_ci	erase_free_sem
1228c2ecf20Sopenharmony_ci	--------------
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ciThis mutex is only used by the erase code which frees obsolete node
1258c2ecf20Sopenharmony_cireferences and the jffs2_garbage_collect_deletion_dirent() function.
1268c2ecf20Sopenharmony_ciThe latter function on NAND flash must read _obsolete_ nodes to
1278c2ecf20Sopenharmony_cidetermine whether the 'deletion dirent' under consideration can be
1288c2ecf20Sopenharmony_cidiscarded or whether it is still required to show that an inode has
1298c2ecf20Sopenharmony_cibeen unlinked. Because reading from the flash may sleep, the
1308c2ecf20Sopenharmony_cierase_completion_lock cannot be held, so an alternative, more
1318c2ecf20Sopenharmony_ciheavyweight lock was required to prevent the erase code from freeing
1328c2ecf20Sopenharmony_cithe jffs2_raw_node_ref structures in question while the garbage
1338c2ecf20Sopenharmony_cicollection code is looking at them.
1348c2ecf20Sopenharmony_ci
1358c2ecf20Sopenharmony_ciSuggestions for alternative solutions to this problem would be welcomed.
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ci
1388c2ecf20Sopenharmony_ci	wbuf_sem
1398c2ecf20Sopenharmony_ci	--------
1408c2ecf20Sopenharmony_ci
1418c2ecf20Sopenharmony_ciThis read/write semaphore protects against concurrent access to the
1428c2ecf20Sopenharmony_ciwrite-behind buffer ('wbuf') used for flash chips where we must write
1438c2ecf20Sopenharmony_ciin blocks. It protects both the contents of the wbuf and the metadata
1448c2ecf20Sopenharmony_ciwhich indicates which flash region (if any) is currently covered by 
1458c2ecf20Sopenharmony_cithe buffer.
1468c2ecf20Sopenharmony_ci
1478c2ecf20Sopenharmony_ciOrdering constraints:
1488c2ecf20Sopenharmony_ci	Lock wbuf_sem last, after the alloc_sem or and f->sem.
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ci
1518c2ecf20Sopenharmony_ci	c->xattr_sem
1528c2ecf20Sopenharmony_ci	------------
1538c2ecf20Sopenharmony_ci
1548c2ecf20Sopenharmony_ciThis read/write semaphore protects against concurrent access to the
1558c2ecf20Sopenharmony_cixattr related objects which include stuff in superblock and ic->xref.
1568c2ecf20Sopenharmony_ciIn read-only path, write-semaphore is too much exclusion. It's enough
1578c2ecf20Sopenharmony_ciby read-semaphore. But you must hold write-semaphore when updating,
1588c2ecf20Sopenharmony_cicreating or deleting any xattr related object.
1598c2ecf20Sopenharmony_ci
1608c2ecf20Sopenharmony_ciOnce xattr_sem released, there would be no assurance for the existence
1618c2ecf20Sopenharmony_ciof those objects. Thus, a series of processes is often required to retry,
1628c2ecf20Sopenharmony_ciwhen updating such a object is necessary under holding read semaphore.
1638c2ecf20Sopenharmony_ciFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and
1648c2ecf20Sopenharmony_cixdatum at first. But it retries this process with holding write-semaphore
1658c2ecf20Sopenharmony_ciafter release read-semaphore, if it's necessary to load name/value pair
1668c2ecf20Sopenharmony_cifrom medium.
1678c2ecf20Sopenharmony_ci
1688c2ecf20Sopenharmony_ciOrdering constraints:
1698c2ecf20Sopenharmony_ci	Lock xattr_sem last, after the alloc_sem.
170