18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci===================================
48c2ecf20Sopenharmony_ciFile management in the Linux kernel
58c2ecf20Sopenharmony_ci===================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciThis document describes how locking for files (struct file)
88c2ecf20Sopenharmony_ciand file descriptor table (struct files) works.
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciUp until 2.6.12, the file descriptor table has been protected
118c2ecf20Sopenharmony_ciwith a lock (files->file_lock) and reference count (files->count).
128c2ecf20Sopenharmony_ci->file_lock protected accesses to all the file related fields
138c2ecf20Sopenharmony_ciof the table. ->count was used for sharing the file descriptor
148c2ecf20Sopenharmony_citable between tasks cloned with CLONE_FILES flag. Typically
158c2ecf20Sopenharmony_cithis would be the case for posix threads. As with the common
168c2ecf20Sopenharmony_cirefcounting model in the kernel, the last task doing
178c2ecf20Sopenharmony_cia put_files_struct() frees the file descriptor (fd) table.
188c2ecf20Sopenharmony_ciThe files (struct file) themselves are protected using
198c2ecf20Sopenharmony_cireference count (->f_count).
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciIn the new lock-free model of file descriptor management,
228c2ecf20Sopenharmony_cithe reference counting is similar, but the locking is
238c2ecf20Sopenharmony_cibased on RCU. The file descriptor table contains multiple
248c2ecf20Sopenharmony_cielements - the fd sets (open_fds and close_on_exec, the
258c2ecf20Sopenharmony_ciarray of file pointers, the sizes of the sets and the array
268c2ecf20Sopenharmony_cietc.). In order for the updates to appear atomic to
278c2ecf20Sopenharmony_cia lock-free reader, all the elements of the file descriptor
288c2ecf20Sopenharmony_citable are in a separate structure - struct fdtable.
298c2ecf20Sopenharmony_cifiles_struct contains a pointer to struct fdtable through
308c2ecf20Sopenharmony_ciwhich the actual fd table is accessed. Initially the
318c2ecf20Sopenharmony_cifdtable is embedded in files_struct itself. On a subsequent
328c2ecf20Sopenharmony_ciexpansion of fdtable, a new fdtable structure is allocated
338c2ecf20Sopenharmony_ciand files->fdtab points to the new structure. The fdtable
348c2ecf20Sopenharmony_cistructure is freed with RCU and lock-free readers either
358c2ecf20Sopenharmony_cisee the old fdtable or the new fdtable making the update
368c2ecf20Sopenharmony_ciappear atomic. Here are the locking rules for
378c2ecf20Sopenharmony_cithe fdtable structure -
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_ci1. All references to the fdtable must be done through
408c2ecf20Sopenharmony_ci   the files_fdtable() macro::
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_ci	struct fdtable *fdt;
438c2ecf20Sopenharmony_ci
448c2ecf20Sopenharmony_ci	rcu_read_lock();
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ci	fdt = files_fdtable(files);
478c2ecf20Sopenharmony_ci	....
488c2ecf20Sopenharmony_ci	if (n <= fdt->max_fds)
498c2ecf20Sopenharmony_ci		....
508c2ecf20Sopenharmony_ci	...
518c2ecf20Sopenharmony_ci	rcu_read_unlock();
528c2ecf20Sopenharmony_ci
538c2ecf20Sopenharmony_ci   files_fdtable() uses rcu_dereference() macro which takes care of
548c2ecf20Sopenharmony_ci   the memory barrier requirements for lock-free dereference.
558c2ecf20Sopenharmony_ci   The fdtable pointer must be read within the read-side
568c2ecf20Sopenharmony_ci   critical section.
578c2ecf20Sopenharmony_ci
588c2ecf20Sopenharmony_ci2. Reading of the fdtable as described above must be protected
598c2ecf20Sopenharmony_ci   by rcu_read_lock()/rcu_read_unlock().
608c2ecf20Sopenharmony_ci
618c2ecf20Sopenharmony_ci3. For any update to the fd table, files->file_lock must
628c2ecf20Sopenharmony_ci   be held.
638c2ecf20Sopenharmony_ci
648c2ecf20Sopenharmony_ci4. To look up the file structure given an fd, a reader
658c2ecf20Sopenharmony_ci   must use either fcheck() or fcheck_files() APIs. These
668c2ecf20Sopenharmony_ci   take care of barrier requirements due to lock-free lookup.
678c2ecf20Sopenharmony_ci
688c2ecf20Sopenharmony_ci   An example::
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci	struct file *file;
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ci	rcu_read_lock();
738c2ecf20Sopenharmony_ci	file = fcheck(fd);
748c2ecf20Sopenharmony_ci	if (file) {
758c2ecf20Sopenharmony_ci		...
768c2ecf20Sopenharmony_ci	}
778c2ecf20Sopenharmony_ci	....
788c2ecf20Sopenharmony_ci	rcu_read_unlock();
798c2ecf20Sopenharmony_ci
808c2ecf20Sopenharmony_ci5. Handling of the file structures is special. Since the look-up
818c2ecf20Sopenharmony_ci   of the fd (fget()/fget_light()) are lock-free, it is possible
828c2ecf20Sopenharmony_ci   that look-up may race with the last put() operation on the
838c2ecf20Sopenharmony_ci   file structure. This is avoided using atomic_long_inc_not_zero()
848c2ecf20Sopenharmony_ci   on ->f_count::
858c2ecf20Sopenharmony_ci
868c2ecf20Sopenharmony_ci	rcu_read_lock();
878c2ecf20Sopenharmony_ci	file = fcheck_files(files, fd);
888c2ecf20Sopenharmony_ci	if (file) {
898c2ecf20Sopenharmony_ci		if (atomic_long_inc_not_zero(&file->f_count))
908c2ecf20Sopenharmony_ci			*fput_needed = 1;
918c2ecf20Sopenharmony_ci		else
928c2ecf20Sopenharmony_ci		/* Didn't get the reference, someone's freed */
938c2ecf20Sopenharmony_ci			file = NULL;
948c2ecf20Sopenharmony_ci	}
958c2ecf20Sopenharmony_ci	rcu_read_unlock();
968c2ecf20Sopenharmony_ci	....
978c2ecf20Sopenharmony_ci	return file;
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ci   atomic_long_inc_not_zero() detects if refcounts is already zero or
1008c2ecf20Sopenharmony_ci   goes to zero during increment. If it does, we fail
1018c2ecf20Sopenharmony_ci   fget()/fget_light().
1028c2ecf20Sopenharmony_ci
1038c2ecf20Sopenharmony_ci6. Since both fdtable and file structures can be looked up
1048c2ecf20Sopenharmony_ci   lock-free, they must be installed using rcu_assign_pointer()
1058c2ecf20Sopenharmony_ci   API. If they are looked up lock-free, rcu_dereference()
1068c2ecf20Sopenharmony_ci   must be used. However it is advisable to use files_fdtable()
1078c2ecf20Sopenharmony_ci   and fcheck()/fcheck_files() which take care of these issues.
1088c2ecf20Sopenharmony_ci
1098c2ecf20Sopenharmony_ci7. While updating, the fdtable pointer must be looked up while
1108c2ecf20Sopenharmony_ci   holding files->file_lock. If ->file_lock is dropped, then
1118c2ecf20Sopenharmony_ci   another thread expand the files thereby creating a new
1128c2ecf20Sopenharmony_ci   fdtable and making the earlier fdtable pointer stale.
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ci   For example::
1158c2ecf20Sopenharmony_ci
1168c2ecf20Sopenharmony_ci	spin_lock(&files->file_lock);
1178c2ecf20Sopenharmony_ci	fd = locate_fd(files, file, start);
1188c2ecf20Sopenharmony_ci	if (fd >= 0) {
1198c2ecf20Sopenharmony_ci		/* locate_fd() may have expanded fdtable, load the ptr */
1208c2ecf20Sopenharmony_ci		fdt = files_fdtable(files);
1218c2ecf20Sopenharmony_ci		__set_open_fd(fd, fdt);
1228c2ecf20Sopenharmony_ci		__clear_close_on_exec(fd, fdt);
1238c2ecf20Sopenharmony_ci		spin_unlock(&files->file_lock);
1248c2ecf20Sopenharmony_ci	.....
1258c2ecf20Sopenharmony_ci
1268c2ecf20Sopenharmony_ci   Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
1278c2ecf20Sopenharmony_ci   the fdtable pointer (fdt) must be loaded after locate_fd().
1288c2ecf20Sopenharmony_ci
129