18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci=================================== 48c2ecf20Sopenharmony_ciFile management in the Linux kernel 58c2ecf20Sopenharmony_ci=================================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciThis document describes how locking for files (struct file) 88c2ecf20Sopenharmony_ciand file descriptor table (struct files) works. 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciUp until 2.6.12, the file descriptor table has been protected 118c2ecf20Sopenharmony_ciwith a lock (files->file_lock) and reference count (files->count). 128c2ecf20Sopenharmony_ci->file_lock protected accesses to all the file related fields 138c2ecf20Sopenharmony_ciof the table. ->count was used for sharing the file descriptor 148c2ecf20Sopenharmony_citable between tasks cloned with CLONE_FILES flag. Typically 158c2ecf20Sopenharmony_cithis would be the case for posix threads. As with the common 168c2ecf20Sopenharmony_cirefcounting model in the kernel, the last task doing 178c2ecf20Sopenharmony_cia put_files_struct() frees the file descriptor (fd) table. 188c2ecf20Sopenharmony_ciThe files (struct file) themselves are protected using 198c2ecf20Sopenharmony_cireference count (->f_count). 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciIn the new lock-free model of file descriptor management, 228c2ecf20Sopenharmony_cithe reference counting is similar, but the locking is 238c2ecf20Sopenharmony_cibased on RCU. The file descriptor table contains multiple 248c2ecf20Sopenharmony_cielements - the fd sets (open_fds and close_on_exec, the 258c2ecf20Sopenharmony_ciarray of file pointers, the sizes of the sets and the array 268c2ecf20Sopenharmony_cietc.). In order for the updates to appear atomic to 278c2ecf20Sopenharmony_cia lock-free reader, all the elements of the file descriptor 288c2ecf20Sopenharmony_citable are in a separate structure - struct fdtable. 298c2ecf20Sopenharmony_cifiles_struct contains a pointer to struct fdtable through 308c2ecf20Sopenharmony_ciwhich the actual fd table is accessed. Initially the 318c2ecf20Sopenharmony_cifdtable is embedded in files_struct itself. On a subsequent 328c2ecf20Sopenharmony_ciexpansion of fdtable, a new fdtable structure is allocated 338c2ecf20Sopenharmony_ciand files->fdtab points to the new structure. The fdtable 348c2ecf20Sopenharmony_cistructure is freed with RCU and lock-free readers either 358c2ecf20Sopenharmony_cisee the old fdtable or the new fdtable making the update 368c2ecf20Sopenharmony_ciappear atomic. Here are the locking rules for 378c2ecf20Sopenharmony_cithe fdtable structure - 388c2ecf20Sopenharmony_ci 398c2ecf20Sopenharmony_ci1. All references to the fdtable must be done through 408c2ecf20Sopenharmony_ci the files_fdtable() macro:: 418c2ecf20Sopenharmony_ci 428c2ecf20Sopenharmony_ci struct fdtable *fdt; 438c2ecf20Sopenharmony_ci 448c2ecf20Sopenharmony_ci rcu_read_lock(); 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ci fdt = files_fdtable(files); 478c2ecf20Sopenharmony_ci .... 488c2ecf20Sopenharmony_ci if (n <= fdt->max_fds) 498c2ecf20Sopenharmony_ci .... 508c2ecf20Sopenharmony_ci ... 518c2ecf20Sopenharmony_ci rcu_read_unlock(); 528c2ecf20Sopenharmony_ci 538c2ecf20Sopenharmony_ci files_fdtable() uses rcu_dereference() macro which takes care of 548c2ecf20Sopenharmony_ci the memory barrier requirements for lock-free dereference. 558c2ecf20Sopenharmony_ci The fdtable pointer must be read within the read-side 568c2ecf20Sopenharmony_ci critical section. 578c2ecf20Sopenharmony_ci 588c2ecf20Sopenharmony_ci2. Reading of the fdtable as described above must be protected 598c2ecf20Sopenharmony_ci by rcu_read_lock()/rcu_read_unlock(). 608c2ecf20Sopenharmony_ci 618c2ecf20Sopenharmony_ci3. For any update to the fd table, files->file_lock must 628c2ecf20Sopenharmony_ci be held. 638c2ecf20Sopenharmony_ci 648c2ecf20Sopenharmony_ci4. To look up the file structure given an fd, a reader 658c2ecf20Sopenharmony_ci must use either fcheck() or fcheck_files() APIs. These 668c2ecf20Sopenharmony_ci take care of barrier requirements due to lock-free lookup. 678c2ecf20Sopenharmony_ci 688c2ecf20Sopenharmony_ci An example:: 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ci struct file *file; 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ci rcu_read_lock(); 738c2ecf20Sopenharmony_ci file = fcheck(fd); 748c2ecf20Sopenharmony_ci if (file) { 758c2ecf20Sopenharmony_ci ... 768c2ecf20Sopenharmony_ci } 778c2ecf20Sopenharmony_ci .... 788c2ecf20Sopenharmony_ci rcu_read_unlock(); 798c2ecf20Sopenharmony_ci 808c2ecf20Sopenharmony_ci5. Handling of the file structures is special. Since the look-up 818c2ecf20Sopenharmony_ci of the fd (fget()/fget_light()) are lock-free, it is possible 828c2ecf20Sopenharmony_ci that look-up may race with the last put() operation on the 838c2ecf20Sopenharmony_ci file structure. This is avoided using atomic_long_inc_not_zero() 848c2ecf20Sopenharmony_ci on ->f_count:: 858c2ecf20Sopenharmony_ci 868c2ecf20Sopenharmony_ci rcu_read_lock(); 878c2ecf20Sopenharmony_ci file = fcheck_files(files, fd); 888c2ecf20Sopenharmony_ci if (file) { 898c2ecf20Sopenharmony_ci if (atomic_long_inc_not_zero(&file->f_count)) 908c2ecf20Sopenharmony_ci *fput_needed = 1; 918c2ecf20Sopenharmony_ci else 928c2ecf20Sopenharmony_ci /* Didn't get the reference, someone's freed */ 938c2ecf20Sopenharmony_ci file = NULL; 948c2ecf20Sopenharmony_ci } 958c2ecf20Sopenharmony_ci rcu_read_unlock(); 968c2ecf20Sopenharmony_ci .... 978c2ecf20Sopenharmony_ci return file; 988c2ecf20Sopenharmony_ci 998c2ecf20Sopenharmony_ci atomic_long_inc_not_zero() detects if refcounts is already zero or 1008c2ecf20Sopenharmony_ci goes to zero during increment. If it does, we fail 1018c2ecf20Sopenharmony_ci fget()/fget_light(). 1028c2ecf20Sopenharmony_ci 1038c2ecf20Sopenharmony_ci6. Since both fdtable and file structures can be looked up 1048c2ecf20Sopenharmony_ci lock-free, they must be installed using rcu_assign_pointer() 1058c2ecf20Sopenharmony_ci API. If they are looked up lock-free, rcu_dereference() 1068c2ecf20Sopenharmony_ci must be used. However it is advisable to use files_fdtable() 1078c2ecf20Sopenharmony_ci and fcheck()/fcheck_files() which take care of these issues. 1088c2ecf20Sopenharmony_ci 1098c2ecf20Sopenharmony_ci7. While updating, the fdtable pointer must be looked up while 1108c2ecf20Sopenharmony_ci holding files->file_lock. If ->file_lock is dropped, then 1118c2ecf20Sopenharmony_ci another thread expand the files thereby creating a new 1128c2ecf20Sopenharmony_ci fdtable and making the earlier fdtable pointer stale. 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ci For example:: 1158c2ecf20Sopenharmony_ci 1168c2ecf20Sopenharmony_ci spin_lock(&files->file_lock); 1178c2ecf20Sopenharmony_ci fd = locate_fd(files, file, start); 1188c2ecf20Sopenharmony_ci if (fd >= 0) { 1198c2ecf20Sopenharmony_ci /* locate_fd() may have expanded fdtable, load the ptr */ 1208c2ecf20Sopenharmony_ci fdt = files_fdtable(files); 1218c2ecf20Sopenharmony_ci __set_open_fd(fd, fdt); 1228c2ecf20Sopenharmony_ci __clear_close_on_exec(fd, fdt); 1238c2ecf20Sopenharmony_ci spin_unlock(&files->file_lock); 1248c2ecf20Sopenharmony_ci ..... 1258c2ecf20Sopenharmony_ci 1268c2ecf20Sopenharmony_ci Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 1278c2ecf20Sopenharmony_ci the fdtable pointer (fdt) must be loaded after locate_fd(). 1288c2ecf20Sopenharmony_ci 129