162306a36Sopenharmony_ci.. _kernel_hacking_lock: 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci=========================== 462306a36Sopenharmony_ciUnreliable Guide To Locking 562306a36Sopenharmony_ci=========================== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ci:Author: Rusty Russell 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciIntroduction 1062306a36Sopenharmony_ci============ 1162306a36Sopenharmony_ci 1262306a36Sopenharmony_ciWelcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking 1362306a36Sopenharmony_ciissues. This document describes the locking systems in the Linux Kernel 1462306a36Sopenharmony_ciin 2.6. 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ciWith the wide availability of HyperThreading, and preemption in the 1762306a36Sopenharmony_ciLinux Kernel, everyone hacking on the kernel needs to know the 1862306a36Sopenharmony_cifundamentals of concurrency and locking for SMP. 1962306a36Sopenharmony_ci 2062306a36Sopenharmony_ciThe Problem With Concurrency 2162306a36Sopenharmony_ci============================ 2262306a36Sopenharmony_ci 2362306a36Sopenharmony_ci(Skip this if you know what a Race Condition is). 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciIn a normal program, you can increment a counter like so: 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ci:: 2862306a36Sopenharmony_ci 2962306a36Sopenharmony_ci very_important_count++; 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciThis is what they would expect to happen: 3362306a36Sopenharmony_ci 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ci.. table:: Expected Results 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 3862306a36Sopenharmony_ci | Instance 1 | Instance 2 | 3962306a36Sopenharmony_ci +====================================+====================================+ 4062306a36Sopenharmony_ci | read very_important_count (5) | | 4162306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 4262306a36Sopenharmony_ci | add 1 (6) | | 4362306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 4462306a36Sopenharmony_ci | write very_important_count (6) | | 4562306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 4662306a36Sopenharmony_ci | | read very_important_count (6) | 4762306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 4862306a36Sopenharmony_ci | | add 1 (7) | 4962306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 5062306a36Sopenharmony_ci | | write very_important_count (7) | 5162306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ciThis is what might happen: 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ci.. table:: Possible Results 5662306a36Sopenharmony_ci 5762306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 5862306a36Sopenharmony_ci | Instance 1 | Instance 2 | 5962306a36Sopenharmony_ci +====================================+====================================+ 6062306a36Sopenharmony_ci | read very_important_count (5) | | 6162306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 6262306a36Sopenharmony_ci | | read very_important_count (5) | 6362306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 6462306a36Sopenharmony_ci | add 1 (6) | | 6562306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 6662306a36Sopenharmony_ci | | add 1 (6) | 6762306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 6862306a36Sopenharmony_ci | write very_important_count (6) | | 6962306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 7062306a36Sopenharmony_ci | | write very_important_count (6) | 7162306a36Sopenharmony_ci +------------------------------------+------------------------------------+ 7262306a36Sopenharmony_ci 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciRace Conditions and Critical Regions 7562306a36Sopenharmony_ci------------------------------------ 7662306a36Sopenharmony_ci 7762306a36Sopenharmony_ciThis overlap, where the result depends on the relative timing of 7862306a36Sopenharmony_cimultiple tasks, is called a race condition. The piece of code containing 7962306a36Sopenharmony_cithe concurrency issue is called a critical region. And especially since 8062306a36Sopenharmony_ciLinux starting running on SMP machines, they became one of the major 8162306a36Sopenharmony_ciissues in kernel design and implementation. 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ciPreemption can have the same effect, even if there is only one CPU: by 8462306a36Sopenharmony_cipreempting one task during the critical region, we have exactly the same 8562306a36Sopenharmony_cirace condition. In this case the thread which preempts might run the 8662306a36Sopenharmony_cicritical region itself. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ciThe solution is to recognize when these simultaneous accesses occur, and 8962306a36Sopenharmony_ciuse locks to make sure that only one instance can enter the critical 9062306a36Sopenharmony_ciregion at any time. There are many friendly primitives in the Linux 9162306a36Sopenharmony_cikernel to help you do this. And then there are the unfriendly 9262306a36Sopenharmony_ciprimitives, but I'll pretend they don't exist. 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ciLocking in the Linux Kernel 9562306a36Sopenharmony_ci=========================== 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ciIf I could give you one piece of advice on locking: **keep it simple**. 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ciBe reluctant to introduce new locks. 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ciTwo Main Types of Kernel Locks: Spinlocks and Mutexes 10262306a36Sopenharmony_ci----------------------------------------------------- 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ciThere are two main types of kernel locks. The fundamental type is the 10562306a36Sopenharmony_cispinlock (``include/asm/spinlock.h``), which is a very simple 10662306a36Sopenharmony_cisingle-holder lock: if you can't get the spinlock, you keep trying 10762306a36Sopenharmony_ci(spinning) until you can. Spinlocks are very small and fast, and can be 10862306a36Sopenharmony_ciused anywhere. 10962306a36Sopenharmony_ci 11062306a36Sopenharmony_ciThe second type is a mutex (``include/linux/mutex.h``): it is like a 11162306a36Sopenharmony_cispinlock, but you may block holding a mutex. If you can't lock a mutex, 11262306a36Sopenharmony_ciyour task will suspend itself, and be woken up when the mutex is 11362306a36Sopenharmony_cireleased. This means the CPU can do something else while you are 11462306a36Sopenharmony_ciwaiting. There are many cases when you simply can't sleep (see 11562306a36Sopenharmony_ci`What Functions Are Safe To Call From Interrupts?`_), 11662306a36Sopenharmony_ciand so have to use a spinlock instead. 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_ciNeither type of lock is recursive: see 11962306a36Sopenharmony_ci`Deadlock: Simple and Advanced`_. 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ciLocks and Uniprocessor Kernels 12262306a36Sopenharmony_ci------------------------------ 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ciFor kernels compiled without ``CONFIG_SMP``, and without 12562306a36Sopenharmony_ci``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent 12662306a36Sopenharmony_cidesign decision: when no-one else can run at the same time, there is no 12762306a36Sopenharmony_cireason to have a lock. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ciIf the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT`` 13062306a36Sopenharmony_ciis set, then spinlocks simply disable preemption, which is sufficient to 13162306a36Sopenharmony_ciprevent any races. For most purposes, we can think of preemption as 13262306a36Sopenharmony_ciequivalent to SMP, and not worry about it separately. 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_ciYou should always test your locking code with ``CONFIG_SMP`` and 13562306a36Sopenharmony_ci``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box, 13662306a36Sopenharmony_cibecause it will still catch some kinds of locking bugs. 13762306a36Sopenharmony_ci 13862306a36Sopenharmony_ciMutexes still exist, because they are required for synchronization 13962306a36Sopenharmony_cibetween user contexts, as we will see below. 14062306a36Sopenharmony_ci 14162306a36Sopenharmony_ciLocking Only In User Context 14262306a36Sopenharmony_ci---------------------------- 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciIf you have a data structure which is only ever accessed from user 14562306a36Sopenharmony_cicontext, then you can use a simple mutex (``include/linux/mutex.h``) to 14662306a36Sopenharmony_ciprotect it. This is the most trivial case: you initialize the mutex. 14762306a36Sopenharmony_ciThen you can call mutex_lock_interruptible() to grab the 14862306a36Sopenharmony_cimutex, and mutex_unlock() to release it. There is also a 14962306a36Sopenharmony_cimutex_lock(), which should be avoided, because it will 15062306a36Sopenharmony_cinot return if a signal is received. 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciExample: ``net/netfilter/nf_sockopt.c`` allows registration of new 15362306a36Sopenharmony_cisetsockopt() and getsockopt() calls, with 15462306a36Sopenharmony_cinf_register_sockopt(). Registration and de-registration 15562306a36Sopenharmony_ciare only done on module load and unload (and boot time, where there is 15662306a36Sopenharmony_cino concurrency), and the list of registrations is only consulted for an 15762306a36Sopenharmony_ciunknown setsockopt() or getsockopt() system 15862306a36Sopenharmony_cicall. The ``nf_sockopt_mutex`` is perfect to protect this, especially 15962306a36Sopenharmony_cisince the setsockopt and getsockopt calls may well sleep. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciLocking Between User Context and Softirqs 16262306a36Sopenharmony_ci----------------------------------------- 16362306a36Sopenharmony_ci 16462306a36Sopenharmony_ciIf a softirq shares data with user context, you have two problems. 16562306a36Sopenharmony_ciFirstly, the current user context can be interrupted by a softirq, and 16662306a36Sopenharmony_cisecondly, the critical region could be entered from another CPU. This is 16762306a36Sopenharmony_ciwhere spin_lock_bh() (``include/linux/spinlock.h``) is 16862306a36Sopenharmony_ciused. It disables softirqs on that CPU, then grabs the lock. 16962306a36Sopenharmony_cispin_unlock_bh() does the reverse. (The '_bh' suffix is 17062306a36Sopenharmony_cia historical reference to "Bottom Halves", the old name for software 17162306a36Sopenharmony_ciinterrupts. It should really be called spin_lock_softirq()' in a 17262306a36Sopenharmony_ciperfect world). 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ciNote that you can also use spin_lock_irq() or 17562306a36Sopenharmony_cispin_lock_irqsave() here, which stop hardware interrupts 17662306a36Sopenharmony_cias well: see `Hard IRQ Context`_. 17762306a36Sopenharmony_ci 17862306a36Sopenharmony_ciThis works perfectly for UP as well: the spin lock vanishes, and this 17962306a36Sopenharmony_cimacro simply becomes local_bh_disable() 18062306a36Sopenharmony_ci(``include/linux/interrupt.h``), which protects you from the softirq 18162306a36Sopenharmony_cibeing run. 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ciLocking Between User Context and Tasklets 18462306a36Sopenharmony_ci----------------------------------------- 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_ciThis is exactly the same as above, because tasklets are actually run 18762306a36Sopenharmony_cifrom a softirq. 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ciLocking Between User Context and Timers 19062306a36Sopenharmony_ci--------------------------------------- 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ciThis, too, is exactly the same as above, because timers are actually run 19362306a36Sopenharmony_cifrom a softirq. From a locking point of view, tasklets and timers are 19462306a36Sopenharmony_ciidentical. 19562306a36Sopenharmony_ci 19662306a36Sopenharmony_ciLocking Between Tasklets/Timers 19762306a36Sopenharmony_ci------------------------------- 19862306a36Sopenharmony_ci 19962306a36Sopenharmony_ciSometimes a tasklet or timer might want to share data with another 20062306a36Sopenharmony_citasklet or timer. 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ciThe Same Tasklet/Timer 20362306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 20462306a36Sopenharmony_ci 20562306a36Sopenharmony_ciSince a tasklet is never run on two CPUs at once, you don't need to 20662306a36Sopenharmony_ciworry about your tasklet being reentrant (running twice at once), even 20762306a36Sopenharmony_cion SMP. 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ciDifferent Tasklets/Timers 21062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciIf another tasklet/timer wants to share data with your tasklet or timer 21362306a36Sopenharmony_ci, you will both need to use spin_lock() and 21462306a36Sopenharmony_cispin_unlock() calls. spin_lock_bh() is 21562306a36Sopenharmony_ciunnecessary here, as you are already in a tasklet, and none will be run 21662306a36Sopenharmony_cion the same CPU. 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ciLocking Between Softirqs 21962306a36Sopenharmony_ci------------------------ 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ciOften a softirq might want to share data with itself or a tasklet/timer. 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ciThe Same Softirq 22462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~ 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ciThe same softirq can run on the other CPUs: you can use a per-CPU array 22762306a36Sopenharmony_ci(see `Per-CPU Data`_) for better performance. If you're 22862306a36Sopenharmony_cigoing so far as to use a softirq, you probably care about scalable 22962306a36Sopenharmony_ciperformance enough to justify the extra complexity. 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ciYou'll need to use spin_lock() and 23262306a36Sopenharmony_cispin_unlock() for shared data. 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ciDifferent Softirqs 23562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 23662306a36Sopenharmony_ci 23762306a36Sopenharmony_ciYou'll need to use spin_lock() and 23862306a36Sopenharmony_cispin_unlock() for shared data, whether it be a timer, 23962306a36Sopenharmony_citasklet, different softirq or the same or another softirq: any of them 24062306a36Sopenharmony_cicould be running on a different CPU. 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciHard IRQ Context 24362306a36Sopenharmony_ci================ 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciHardware interrupts usually communicate with a tasklet or softirq. 24662306a36Sopenharmony_ciFrequently this involves putting work in a queue, which the softirq will 24762306a36Sopenharmony_citake out. 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ciLocking Between Hard IRQ and Softirqs/Tasklets 25062306a36Sopenharmony_ci---------------------------------------------- 25162306a36Sopenharmony_ci 25262306a36Sopenharmony_ciIf a hardware irq handler shares data with a softirq, you have two 25362306a36Sopenharmony_ciconcerns. Firstly, the softirq processing can be interrupted by a 25462306a36Sopenharmony_cihardware interrupt, and secondly, the critical region could be entered 25562306a36Sopenharmony_ciby a hardware interrupt on another CPU. This is where 25662306a36Sopenharmony_cispin_lock_irq() is used. It is defined to disable 25762306a36Sopenharmony_ciinterrupts on that cpu, then grab the lock. 25862306a36Sopenharmony_cispin_unlock_irq() does the reverse. 25962306a36Sopenharmony_ci 26062306a36Sopenharmony_ciThe irq handler does not need to use spin_lock_irq(), because 26162306a36Sopenharmony_cithe softirq cannot run while the irq handler is running: it can use 26262306a36Sopenharmony_cispin_lock(), which is slightly faster. The only exception 26362306a36Sopenharmony_ciwould be if a different hardware irq handler uses the same lock: 26462306a36Sopenharmony_cispin_lock_irq() will stop that from interrupting us. 26562306a36Sopenharmony_ci 26662306a36Sopenharmony_ciThis works perfectly for UP as well: the spin lock vanishes, and this 26762306a36Sopenharmony_cimacro simply becomes local_irq_disable() 26862306a36Sopenharmony_ci(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH 26962306a36Sopenharmony_cibeing run. 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_cispin_lock_irqsave() (``include/linux/spinlock.h``) is a 27262306a36Sopenharmony_civariant which saves whether interrupts were on or off in a flags word, 27362306a36Sopenharmony_ciwhich is passed to spin_unlock_irqrestore(). This means 27462306a36Sopenharmony_cithat the same code can be used inside an hard irq handler (where 27562306a36Sopenharmony_ciinterrupts are already off) and in softirqs (where the irq disabling is 27662306a36Sopenharmony_cirequired). 27762306a36Sopenharmony_ci 27862306a36Sopenharmony_ciNote that softirqs (and hence tasklets and timers) are run on return 27962306a36Sopenharmony_cifrom hardware interrupts, so spin_lock_irq() also stops 28062306a36Sopenharmony_cithese. In that sense, spin_lock_irqsave() is the most 28162306a36Sopenharmony_cigeneral and powerful locking function. 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ciLocking Between Two Hard IRQ Handlers 28462306a36Sopenharmony_ci------------------------------------- 28562306a36Sopenharmony_ci 28662306a36Sopenharmony_ciIt is rare to have to share data between two IRQ handlers, but if you 28762306a36Sopenharmony_cido, spin_lock_irqsave() should be used: it is 28862306a36Sopenharmony_ciarchitecture-specific whether all interrupts are disabled inside irq 28962306a36Sopenharmony_cihandlers themselves. 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ciCheat Sheet For Locking 29262306a36Sopenharmony_ci======================= 29362306a36Sopenharmony_ci 29462306a36Sopenharmony_ciPete Zaitcev gives the following summary: 29562306a36Sopenharmony_ci 29662306a36Sopenharmony_ci- If you are in a process context (any syscall) and want to lock other 29762306a36Sopenharmony_ci process out, use a mutex. You can take a mutex and sleep 29862306a36Sopenharmony_ci (``copy_from_user()`` or ``kmalloc(x,GFP_KERNEL)``). 29962306a36Sopenharmony_ci 30062306a36Sopenharmony_ci- Otherwise (== data can be touched in an interrupt), use 30162306a36Sopenharmony_ci spin_lock_irqsave() and 30262306a36Sopenharmony_ci spin_unlock_irqrestore(). 30362306a36Sopenharmony_ci 30462306a36Sopenharmony_ci- Avoid holding spinlock for more than 5 lines of code and across any 30562306a36Sopenharmony_ci function call (except accessors like readb()). 30662306a36Sopenharmony_ci 30762306a36Sopenharmony_ciTable of Minimum Requirements 30862306a36Sopenharmony_ci----------------------------- 30962306a36Sopenharmony_ci 31062306a36Sopenharmony_ciThe following table lists the **minimum** locking requirements between 31162306a36Sopenharmony_civarious contexts. In some cases, the same context can only be running on 31262306a36Sopenharmony_cione CPU at a time, so no locking is required for that context (eg. a 31362306a36Sopenharmony_ciparticular thread can only run on one CPU at a time, but if it needs 31462306a36Sopenharmony_cishares data with another thread, locking is required). 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ciRemember the advice above: you can always use 31762306a36Sopenharmony_cispin_lock_irqsave(), which is a superset of all other 31862306a36Sopenharmony_cispinlock primitives. 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ci============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 32162306a36Sopenharmony_ci. IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B 32262306a36Sopenharmony_ci============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 32362306a36Sopenharmony_ciIRQ Handler A None 32462306a36Sopenharmony_ciIRQ Handler B SLIS None 32562306a36Sopenharmony_ciSoftirq A SLI SLI SL 32662306a36Sopenharmony_ciSoftirq B SLI SLI SL SL 32762306a36Sopenharmony_ciTasklet A SLI SLI SL SL None 32862306a36Sopenharmony_ciTasklet B SLI SLI SL SL SL None 32962306a36Sopenharmony_ciTimer A SLI SLI SL SL SL SL None 33062306a36Sopenharmony_ciTimer B SLI SLI SL SL SL SL SL None 33162306a36Sopenharmony_ciUser Context A SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH None 33262306a36Sopenharmony_ciUser Context B SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH MLI None 33362306a36Sopenharmony_ci============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 33462306a36Sopenharmony_ci 33562306a36Sopenharmony_ciTable: Table of Locking Requirements 33662306a36Sopenharmony_ci 33762306a36Sopenharmony_ci+--------+----------------------------+ 33862306a36Sopenharmony_ci| SLIS | spin_lock_irqsave | 33962306a36Sopenharmony_ci+--------+----------------------------+ 34062306a36Sopenharmony_ci| SLI | spin_lock_irq | 34162306a36Sopenharmony_ci+--------+----------------------------+ 34262306a36Sopenharmony_ci| SL | spin_lock | 34362306a36Sopenharmony_ci+--------+----------------------------+ 34462306a36Sopenharmony_ci| SLBH | spin_lock_bh | 34562306a36Sopenharmony_ci+--------+----------------------------+ 34662306a36Sopenharmony_ci| MLI | mutex_lock_interruptible | 34762306a36Sopenharmony_ci+--------+----------------------------+ 34862306a36Sopenharmony_ci 34962306a36Sopenharmony_ciTable: Legend for Locking Requirements Table 35062306a36Sopenharmony_ci 35162306a36Sopenharmony_ciThe trylock Functions 35262306a36Sopenharmony_ci===================== 35362306a36Sopenharmony_ci 35462306a36Sopenharmony_ciThere are functions that try to acquire a lock only once and immediately 35562306a36Sopenharmony_cireturn a value telling about success or failure to acquire the lock. 35662306a36Sopenharmony_ciThey can be used if you need no access to the data protected with the 35762306a36Sopenharmony_cilock when some other thread is holding the lock. You should acquire the 35862306a36Sopenharmony_cilock later if you then need access to the data protected with the lock. 35962306a36Sopenharmony_ci 36062306a36Sopenharmony_cispin_trylock() does not spin but returns non-zero if it 36162306a36Sopenharmony_ciacquires the spinlock on the first try or 0 if not. This function can be 36262306a36Sopenharmony_ciused in all contexts like spin_lock(): you must have 36362306a36Sopenharmony_cidisabled the contexts that might interrupt you and acquire the spin 36462306a36Sopenharmony_cilock. 36562306a36Sopenharmony_ci 36662306a36Sopenharmony_cimutex_trylock() does not suspend your task but returns 36762306a36Sopenharmony_cinon-zero if it could lock the mutex on the first try or 0 if not. This 36862306a36Sopenharmony_cifunction cannot be safely used in hardware or software interrupt 36962306a36Sopenharmony_cicontexts despite not sleeping. 37062306a36Sopenharmony_ci 37162306a36Sopenharmony_ciCommon Examples 37262306a36Sopenharmony_ci=============== 37362306a36Sopenharmony_ci 37462306a36Sopenharmony_ciLet's step through a simple example: a cache of number to name mappings. 37562306a36Sopenharmony_ciThe cache keeps a count of how often each of the objects is used, and 37662306a36Sopenharmony_ciwhen it gets full, throws out the least used one. 37762306a36Sopenharmony_ci 37862306a36Sopenharmony_ciAll In User Context 37962306a36Sopenharmony_ci------------------- 38062306a36Sopenharmony_ci 38162306a36Sopenharmony_ciFor our first example, we assume that all operations are in user context 38262306a36Sopenharmony_ci(ie. from system calls), so we can sleep. This means we can use a mutex 38362306a36Sopenharmony_cito protect the cache and all the objects within it. Here's the code:: 38462306a36Sopenharmony_ci 38562306a36Sopenharmony_ci #include <linux/list.h> 38662306a36Sopenharmony_ci #include <linux/slab.h> 38762306a36Sopenharmony_ci #include <linux/string.h> 38862306a36Sopenharmony_ci #include <linux/mutex.h> 38962306a36Sopenharmony_ci #include <asm/errno.h> 39062306a36Sopenharmony_ci 39162306a36Sopenharmony_ci struct object 39262306a36Sopenharmony_ci { 39362306a36Sopenharmony_ci struct list_head list; 39462306a36Sopenharmony_ci int id; 39562306a36Sopenharmony_ci char name[32]; 39662306a36Sopenharmony_ci int popularity; 39762306a36Sopenharmony_ci }; 39862306a36Sopenharmony_ci 39962306a36Sopenharmony_ci /* Protects the cache, cache_num, and the objects within it */ 40062306a36Sopenharmony_ci static DEFINE_MUTEX(cache_lock); 40162306a36Sopenharmony_ci static LIST_HEAD(cache); 40262306a36Sopenharmony_ci static unsigned int cache_num = 0; 40362306a36Sopenharmony_ci #define MAX_CACHE_SIZE 10 40462306a36Sopenharmony_ci 40562306a36Sopenharmony_ci /* Must be holding cache_lock */ 40662306a36Sopenharmony_ci static struct object *__cache_find(int id) 40762306a36Sopenharmony_ci { 40862306a36Sopenharmony_ci struct object *i; 40962306a36Sopenharmony_ci 41062306a36Sopenharmony_ci list_for_each_entry(i, &cache, list) 41162306a36Sopenharmony_ci if (i->id == id) { 41262306a36Sopenharmony_ci i->popularity++; 41362306a36Sopenharmony_ci return i; 41462306a36Sopenharmony_ci } 41562306a36Sopenharmony_ci return NULL; 41662306a36Sopenharmony_ci } 41762306a36Sopenharmony_ci 41862306a36Sopenharmony_ci /* Must be holding cache_lock */ 41962306a36Sopenharmony_ci static void __cache_delete(struct object *obj) 42062306a36Sopenharmony_ci { 42162306a36Sopenharmony_ci BUG_ON(!obj); 42262306a36Sopenharmony_ci list_del(&obj->list); 42362306a36Sopenharmony_ci kfree(obj); 42462306a36Sopenharmony_ci cache_num--; 42562306a36Sopenharmony_ci } 42662306a36Sopenharmony_ci 42762306a36Sopenharmony_ci /* Must be holding cache_lock */ 42862306a36Sopenharmony_ci static void __cache_add(struct object *obj) 42962306a36Sopenharmony_ci { 43062306a36Sopenharmony_ci list_add(&obj->list, &cache); 43162306a36Sopenharmony_ci if (++cache_num > MAX_CACHE_SIZE) { 43262306a36Sopenharmony_ci struct object *i, *outcast = NULL; 43362306a36Sopenharmony_ci list_for_each_entry(i, &cache, list) { 43462306a36Sopenharmony_ci if (!outcast || i->popularity < outcast->popularity) 43562306a36Sopenharmony_ci outcast = i; 43662306a36Sopenharmony_ci } 43762306a36Sopenharmony_ci __cache_delete(outcast); 43862306a36Sopenharmony_ci } 43962306a36Sopenharmony_ci } 44062306a36Sopenharmony_ci 44162306a36Sopenharmony_ci int cache_add(int id, const char *name) 44262306a36Sopenharmony_ci { 44362306a36Sopenharmony_ci struct object *obj; 44462306a36Sopenharmony_ci 44562306a36Sopenharmony_ci if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 44662306a36Sopenharmony_ci return -ENOMEM; 44762306a36Sopenharmony_ci 44862306a36Sopenharmony_ci strscpy(obj->name, name, sizeof(obj->name)); 44962306a36Sopenharmony_ci obj->id = id; 45062306a36Sopenharmony_ci obj->popularity = 0; 45162306a36Sopenharmony_ci 45262306a36Sopenharmony_ci mutex_lock(&cache_lock); 45362306a36Sopenharmony_ci __cache_add(obj); 45462306a36Sopenharmony_ci mutex_unlock(&cache_lock); 45562306a36Sopenharmony_ci return 0; 45662306a36Sopenharmony_ci } 45762306a36Sopenharmony_ci 45862306a36Sopenharmony_ci void cache_delete(int id) 45962306a36Sopenharmony_ci { 46062306a36Sopenharmony_ci mutex_lock(&cache_lock); 46162306a36Sopenharmony_ci __cache_delete(__cache_find(id)); 46262306a36Sopenharmony_ci mutex_unlock(&cache_lock); 46362306a36Sopenharmony_ci } 46462306a36Sopenharmony_ci 46562306a36Sopenharmony_ci int cache_find(int id, char *name) 46662306a36Sopenharmony_ci { 46762306a36Sopenharmony_ci struct object *obj; 46862306a36Sopenharmony_ci int ret = -ENOENT; 46962306a36Sopenharmony_ci 47062306a36Sopenharmony_ci mutex_lock(&cache_lock); 47162306a36Sopenharmony_ci obj = __cache_find(id); 47262306a36Sopenharmony_ci if (obj) { 47362306a36Sopenharmony_ci ret = 0; 47462306a36Sopenharmony_ci strcpy(name, obj->name); 47562306a36Sopenharmony_ci } 47662306a36Sopenharmony_ci mutex_unlock(&cache_lock); 47762306a36Sopenharmony_ci return ret; 47862306a36Sopenharmony_ci } 47962306a36Sopenharmony_ci 48062306a36Sopenharmony_ciNote that we always make sure we have the cache_lock when we add, 48162306a36Sopenharmony_cidelete, or look up the cache: both the cache infrastructure itself and 48262306a36Sopenharmony_cithe contents of the objects are protected by the lock. In this case it's 48362306a36Sopenharmony_cieasy, since we copy the data for the user, and never let them access the 48462306a36Sopenharmony_ciobjects directly. 48562306a36Sopenharmony_ci 48662306a36Sopenharmony_ciThere is a slight (and common) optimization here: in 48762306a36Sopenharmony_cicache_add() we set up the fields of the object before 48862306a36Sopenharmony_cigrabbing the lock. This is safe, as no-one else can access it until we 48962306a36Sopenharmony_ciput it in cache. 49062306a36Sopenharmony_ci 49162306a36Sopenharmony_ciAccessing From Interrupt Context 49262306a36Sopenharmony_ci-------------------------------- 49362306a36Sopenharmony_ci 49462306a36Sopenharmony_ciNow consider the case where cache_find() can be called 49562306a36Sopenharmony_cifrom interrupt context: either a hardware interrupt or a softirq. An 49662306a36Sopenharmony_ciexample would be a timer which deletes object from the cache. 49762306a36Sopenharmony_ci 49862306a36Sopenharmony_ciThe change is shown below, in standard patch format: the ``-`` are lines 49962306a36Sopenharmony_ciwhich are taken away, and the ``+`` are lines which are added. 50062306a36Sopenharmony_ci 50162306a36Sopenharmony_ci:: 50262306a36Sopenharmony_ci 50362306a36Sopenharmony_ci --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 50462306a36Sopenharmony_ci +++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 50562306a36Sopenharmony_ci @@ -12,7 +12,7 @@ 50662306a36Sopenharmony_ci int popularity; 50762306a36Sopenharmony_ci }; 50862306a36Sopenharmony_ci 50962306a36Sopenharmony_ci -static DEFINE_MUTEX(cache_lock); 51062306a36Sopenharmony_ci +static DEFINE_SPINLOCK(cache_lock); 51162306a36Sopenharmony_ci static LIST_HEAD(cache); 51262306a36Sopenharmony_ci static unsigned int cache_num = 0; 51362306a36Sopenharmony_ci #define MAX_CACHE_SIZE 10 51462306a36Sopenharmony_ci @@ -55,6 +55,7 @@ 51562306a36Sopenharmony_ci int cache_add(int id, const char *name) 51662306a36Sopenharmony_ci { 51762306a36Sopenharmony_ci struct object *obj; 51862306a36Sopenharmony_ci + unsigned long flags; 51962306a36Sopenharmony_ci 52062306a36Sopenharmony_ci if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 52162306a36Sopenharmony_ci return -ENOMEM; 52262306a36Sopenharmony_ci @@ -63,30 +64,33 @@ 52362306a36Sopenharmony_ci obj->id = id; 52462306a36Sopenharmony_ci obj->popularity = 0; 52562306a36Sopenharmony_ci 52662306a36Sopenharmony_ci - mutex_lock(&cache_lock); 52762306a36Sopenharmony_ci + spin_lock_irqsave(&cache_lock, flags); 52862306a36Sopenharmony_ci __cache_add(obj); 52962306a36Sopenharmony_ci - mutex_unlock(&cache_lock); 53062306a36Sopenharmony_ci + spin_unlock_irqrestore(&cache_lock, flags); 53162306a36Sopenharmony_ci return 0; 53262306a36Sopenharmony_ci } 53362306a36Sopenharmony_ci 53462306a36Sopenharmony_ci void cache_delete(int id) 53562306a36Sopenharmony_ci { 53662306a36Sopenharmony_ci - mutex_lock(&cache_lock); 53762306a36Sopenharmony_ci + unsigned long flags; 53862306a36Sopenharmony_ci + 53962306a36Sopenharmony_ci + spin_lock_irqsave(&cache_lock, flags); 54062306a36Sopenharmony_ci __cache_delete(__cache_find(id)); 54162306a36Sopenharmony_ci - mutex_unlock(&cache_lock); 54262306a36Sopenharmony_ci + spin_unlock_irqrestore(&cache_lock, flags); 54362306a36Sopenharmony_ci } 54462306a36Sopenharmony_ci 54562306a36Sopenharmony_ci int cache_find(int id, char *name) 54662306a36Sopenharmony_ci { 54762306a36Sopenharmony_ci struct object *obj; 54862306a36Sopenharmony_ci int ret = -ENOENT; 54962306a36Sopenharmony_ci + unsigned long flags; 55062306a36Sopenharmony_ci 55162306a36Sopenharmony_ci - mutex_lock(&cache_lock); 55262306a36Sopenharmony_ci + spin_lock_irqsave(&cache_lock, flags); 55362306a36Sopenharmony_ci obj = __cache_find(id); 55462306a36Sopenharmony_ci if (obj) { 55562306a36Sopenharmony_ci ret = 0; 55662306a36Sopenharmony_ci strcpy(name, obj->name); 55762306a36Sopenharmony_ci } 55862306a36Sopenharmony_ci - mutex_unlock(&cache_lock); 55962306a36Sopenharmony_ci + spin_unlock_irqrestore(&cache_lock, flags); 56062306a36Sopenharmony_ci return ret; 56162306a36Sopenharmony_ci } 56262306a36Sopenharmony_ci 56362306a36Sopenharmony_ciNote that the spin_lock_irqsave() will turn off 56462306a36Sopenharmony_ciinterrupts if they are on, otherwise does nothing (if we are already in 56562306a36Sopenharmony_cian interrupt handler), hence these functions are safe to call from any 56662306a36Sopenharmony_cicontext. 56762306a36Sopenharmony_ci 56862306a36Sopenharmony_ciUnfortunately, cache_add() calls kmalloc() 56962306a36Sopenharmony_ciwith the ``GFP_KERNEL`` flag, which is only legal in user context. I 57062306a36Sopenharmony_cihave assumed that cache_add() is still only called in 57162306a36Sopenharmony_ciuser context, otherwise this should become a parameter to 57262306a36Sopenharmony_cicache_add(). 57362306a36Sopenharmony_ci 57462306a36Sopenharmony_ciExposing Objects Outside This File 57562306a36Sopenharmony_ci---------------------------------- 57662306a36Sopenharmony_ci 57762306a36Sopenharmony_ciIf our objects contained more information, it might not be sufficient to 57862306a36Sopenharmony_cicopy the information in and out: other parts of the code might want to 57962306a36Sopenharmony_cikeep pointers to these objects, for example, rather than looking up the 58062306a36Sopenharmony_ciid every time. This produces two problems. 58162306a36Sopenharmony_ci 58262306a36Sopenharmony_ciThe first problem is that we use the ``cache_lock`` to protect objects: 58362306a36Sopenharmony_ciwe'd need to make this non-static so the rest of the code can use it. 58462306a36Sopenharmony_ciThis makes locking trickier, as it is no longer all in one place. 58562306a36Sopenharmony_ci 58662306a36Sopenharmony_ciThe second problem is the lifetime problem: if another structure keeps a 58762306a36Sopenharmony_cipointer to an object, it presumably expects that pointer to remain 58862306a36Sopenharmony_civalid. Unfortunately, this is only guaranteed while you hold the lock, 58962306a36Sopenharmony_ciotherwise someone might call cache_delete() and even 59062306a36Sopenharmony_ciworse, add another object, re-using the same address. 59162306a36Sopenharmony_ci 59262306a36Sopenharmony_ciAs there is only one lock, you can't hold it forever: no-one else would 59362306a36Sopenharmony_ciget any work done. 59462306a36Sopenharmony_ci 59562306a36Sopenharmony_ciThe solution to this problem is to use a reference count: everyone who 59662306a36Sopenharmony_cihas a pointer to the object increases it when they first get the object, 59762306a36Sopenharmony_ciand drops the reference count when they're finished with it. Whoever 59862306a36Sopenharmony_cidrops it to zero knows it is unused, and can actually delete it. 59962306a36Sopenharmony_ci 60062306a36Sopenharmony_ciHere is the code:: 60162306a36Sopenharmony_ci 60262306a36Sopenharmony_ci --- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 60362306a36Sopenharmony_ci +++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 60462306a36Sopenharmony_ci @@ -7,6 +7,7 @@ 60562306a36Sopenharmony_ci struct object 60662306a36Sopenharmony_ci { 60762306a36Sopenharmony_ci struct list_head list; 60862306a36Sopenharmony_ci + unsigned int refcnt; 60962306a36Sopenharmony_ci int id; 61062306a36Sopenharmony_ci char name[32]; 61162306a36Sopenharmony_ci int popularity; 61262306a36Sopenharmony_ci @@ -17,6 +18,35 @@ 61362306a36Sopenharmony_ci static unsigned int cache_num = 0; 61462306a36Sopenharmony_ci #define MAX_CACHE_SIZE 10 61562306a36Sopenharmony_ci 61662306a36Sopenharmony_ci +static void __object_put(struct object *obj) 61762306a36Sopenharmony_ci +{ 61862306a36Sopenharmony_ci + if (--obj->refcnt == 0) 61962306a36Sopenharmony_ci + kfree(obj); 62062306a36Sopenharmony_ci +} 62162306a36Sopenharmony_ci + 62262306a36Sopenharmony_ci +static void __object_get(struct object *obj) 62362306a36Sopenharmony_ci +{ 62462306a36Sopenharmony_ci + obj->refcnt++; 62562306a36Sopenharmony_ci +} 62662306a36Sopenharmony_ci + 62762306a36Sopenharmony_ci +void object_put(struct object *obj) 62862306a36Sopenharmony_ci +{ 62962306a36Sopenharmony_ci + unsigned long flags; 63062306a36Sopenharmony_ci + 63162306a36Sopenharmony_ci + spin_lock_irqsave(&cache_lock, flags); 63262306a36Sopenharmony_ci + __object_put(obj); 63362306a36Sopenharmony_ci + spin_unlock_irqrestore(&cache_lock, flags); 63462306a36Sopenharmony_ci +} 63562306a36Sopenharmony_ci + 63662306a36Sopenharmony_ci +void object_get(struct object *obj) 63762306a36Sopenharmony_ci +{ 63862306a36Sopenharmony_ci + unsigned long flags; 63962306a36Sopenharmony_ci + 64062306a36Sopenharmony_ci + spin_lock_irqsave(&cache_lock, flags); 64162306a36Sopenharmony_ci + __object_get(obj); 64262306a36Sopenharmony_ci + spin_unlock_irqrestore(&cache_lock, flags); 64362306a36Sopenharmony_ci +} 64462306a36Sopenharmony_ci + 64562306a36Sopenharmony_ci /* Must be holding cache_lock */ 64662306a36Sopenharmony_ci static struct object *__cache_find(int id) 64762306a36Sopenharmony_ci { 64862306a36Sopenharmony_ci @@ -35,6 +65,7 @@ 64962306a36Sopenharmony_ci { 65062306a36Sopenharmony_ci BUG_ON(!obj); 65162306a36Sopenharmony_ci list_del(&obj->list); 65262306a36Sopenharmony_ci + __object_put(obj); 65362306a36Sopenharmony_ci cache_num--; 65462306a36Sopenharmony_ci } 65562306a36Sopenharmony_ci 65662306a36Sopenharmony_ci @@ -63,6 +94,7 @@ 65762306a36Sopenharmony_ci strscpy(obj->name, name, sizeof(obj->name)); 65862306a36Sopenharmony_ci obj->id = id; 65962306a36Sopenharmony_ci obj->popularity = 0; 66062306a36Sopenharmony_ci + obj->refcnt = 1; /* The cache holds a reference */ 66162306a36Sopenharmony_ci 66262306a36Sopenharmony_ci spin_lock_irqsave(&cache_lock, flags); 66362306a36Sopenharmony_ci __cache_add(obj); 66462306a36Sopenharmony_ci @@ -79,18 +111,15 @@ 66562306a36Sopenharmony_ci spin_unlock_irqrestore(&cache_lock, flags); 66662306a36Sopenharmony_ci } 66762306a36Sopenharmony_ci 66862306a36Sopenharmony_ci -int cache_find(int id, char *name) 66962306a36Sopenharmony_ci +struct object *cache_find(int id) 67062306a36Sopenharmony_ci { 67162306a36Sopenharmony_ci struct object *obj; 67262306a36Sopenharmony_ci - int ret = -ENOENT; 67362306a36Sopenharmony_ci unsigned long flags; 67462306a36Sopenharmony_ci 67562306a36Sopenharmony_ci spin_lock_irqsave(&cache_lock, flags); 67662306a36Sopenharmony_ci obj = __cache_find(id); 67762306a36Sopenharmony_ci - if (obj) { 67862306a36Sopenharmony_ci - ret = 0; 67962306a36Sopenharmony_ci - strcpy(name, obj->name); 68062306a36Sopenharmony_ci - } 68162306a36Sopenharmony_ci + if (obj) 68262306a36Sopenharmony_ci + __object_get(obj); 68362306a36Sopenharmony_ci spin_unlock_irqrestore(&cache_lock, flags); 68462306a36Sopenharmony_ci - return ret; 68562306a36Sopenharmony_ci + return obj; 68662306a36Sopenharmony_ci } 68762306a36Sopenharmony_ci 68862306a36Sopenharmony_ciWe encapsulate the reference counting in the standard 'get' and 'put' 68962306a36Sopenharmony_cifunctions. Now we can return the object itself from 69062306a36Sopenharmony_cicache_find() which has the advantage that the user can 69162306a36Sopenharmony_cinow sleep holding the object (eg. to copy_to_user() to 69262306a36Sopenharmony_ciname to userspace). 69362306a36Sopenharmony_ci 69462306a36Sopenharmony_ciThe other point to note is that I said a reference should be held for 69562306a36Sopenharmony_cievery pointer to the object: thus the reference count is 1 when first 69662306a36Sopenharmony_ciinserted into the cache. In some versions the framework does not hold a 69762306a36Sopenharmony_cireference count, but they are more complicated. 69862306a36Sopenharmony_ci 69962306a36Sopenharmony_ciUsing Atomic Operations For The Reference Count 70062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 70162306a36Sopenharmony_ci 70262306a36Sopenharmony_ciIn practice, :c:type:`atomic_t` would usually be used for refcnt. There are a 70362306a36Sopenharmony_cinumber of atomic operations defined in ``include/asm/atomic.h``: these 70462306a36Sopenharmony_ciare guaranteed to be seen atomically from all CPUs in the system, so no 70562306a36Sopenharmony_cilock is required. In this case, it is simpler than using spinlocks, 70662306a36Sopenharmony_cialthough for anything non-trivial using spinlocks is clearer. The 70762306a36Sopenharmony_ciatomic_inc() and atomic_dec_and_test() 70862306a36Sopenharmony_ciare used instead of the standard increment and decrement operators, and 70962306a36Sopenharmony_cithe lock is no longer used to protect the reference count itself. 71062306a36Sopenharmony_ci 71162306a36Sopenharmony_ci:: 71262306a36Sopenharmony_ci 71362306a36Sopenharmony_ci --- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 71462306a36Sopenharmony_ci +++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 71562306a36Sopenharmony_ci @@ -7,7 +7,7 @@ 71662306a36Sopenharmony_ci struct object 71762306a36Sopenharmony_ci { 71862306a36Sopenharmony_ci struct list_head list; 71962306a36Sopenharmony_ci - unsigned int refcnt; 72062306a36Sopenharmony_ci + atomic_t refcnt; 72162306a36Sopenharmony_ci int id; 72262306a36Sopenharmony_ci char name[32]; 72362306a36Sopenharmony_ci int popularity; 72462306a36Sopenharmony_ci @@ -18,33 +18,15 @@ 72562306a36Sopenharmony_ci static unsigned int cache_num = 0; 72662306a36Sopenharmony_ci #define MAX_CACHE_SIZE 10 72762306a36Sopenharmony_ci 72862306a36Sopenharmony_ci -static void __object_put(struct object *obj) 72962306a36Sopenharmony_ci -{ 73062306a36Sopenharmony_ci - if (--obj->refcnt == 0) 73162306a36Sopenharmony_ci - kfree(obj); 73262306a36Sopenharmony_ci -} 73362306a36Sopenharmony_ci - 73462306a36Sopenharmony_ci -static void __object_get(struct object *obj) 73562306a36Sopenharmony_ci -{ 73662306a36Sopenharmony_ci - obj->refcnt++; 73762306a36Sopenharmony_ci -} 73862306a36Sopenharmony_ci - 73962306a36Sopenharmony_ci void object_put(struct object *obj) 74062306a36Sopenharmony_ci { 74162306a36Sopenharmony_ci - unsigned long flags; 74262306a36Sopenharmony_ci - 74362306a36Sopenharmony_ci - spin_lock_irqsave(&cache_lock, flags); 74462306a36Sopenharmony_ci - __object_put(obj); 74562306a36Sopenharmony_ci - spin_unlock_irqrestore(&cache_lock, flags); 74662306a36Sopenharmony_ci + if (atomic_dec_and_test(&obj->refcnt)) 74762306a36Sopenharmony_ci + kfree(obj); 74862306a36Sopenharmony_ci } 74962306a36Sopenharmony_ci 75062306a36Sopenharmony_ci void object_get(struct object *obj) 75162306a36Sopenharmony_ci { 75262306a36Sopenharmony_ci - unsigned long flags; 75362306a36Sopenharmony_ci - 75462306a36Sopenharmony_ci - spin_lock_irqsave(&cache_lock, flags); 75562306a36Sopenharmony_ci - __object_get(obj); 75662306a36Sopenharmony_ci - spin_unlock_irqrestore(&cache_lock, flags); 75762306a36Sopenharmony_ci + atomic_inc(&obj->refcnt); 75862306a36Sopenharmony_ci } 75962306a36Sopenharmony_ci 76062306a36Sopenharmony_ci /* Must be holding cache_lock */ 76162306a36Sopenharmony_ci @@ -65,7 +47,7 @@ 76262306a36Sopenharmony_ci { 76362306a36Sopenharmony_ci BUG_ON(!obj); 76462306a36Sopenharmony_ci list_del(&obj->list); 76562306a36Sopenharmony_ci - __object_put(obj); 76662306a36Sopenharmony_ci + object_put(obj); 76762306a36Sopenharmony_ci cache_num--; 76862306a36Sopenharmony_ci } 76962306a36Sopenharmony_ci 77062306a36Sopenharmony_ci @@ -94,7 +76,7 @@ 77162306a36Sopenharmony_ci strscpy(obj->name, name, sizeof(obj->name)); 77262306a36Sopenharmony_ci obj->id = id; 77362306a36Sopenharmony_ci obj->popularity = 0; 77462306a36Sopenharmony_ci - obj->refcnt = 1; /* The cache holds a reference */ 77562306a36Sopenharmony_ci + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 77662306a36Sopenharmony_ci 77762306a36Sopenharmony_ci spin_lock_irqsave(&cache_lock, flags); 77862306a36Sopenharmony_ci __cache_add(obj); 77962306a36Sopenharmony_ci @@ -119,7 +101,7 @@ 78062306a36Sopenharmony_ci spin_lock_irqsave(&cache_lock, flags); 78162306a36Sopenharmony_ci obj = __cache_find(id); 78262306a36Sopenharmony_ci if (obj) 78362306a36Sopenharmony_ci - __object_get(obj); 78462306a36Sopenharmony_ci + object_get(obj); 78562306a36Sopenharmony_ci spin_unlock_irqrestore(&cache_lock, flags); 78662306a36Sopenharmony_ci return obj; 78762306a36Sopenharmony_ci } 78862306a36Sopenharmony_ci 78962306a36Sopenharmony_ciProtecting The Objects Themselves 79062306a36Sopenharmony_ci--------------------------------- 79162306a36Sopenharmony_ci 79262306a36Sopenharmony_ciIn these examples, we assumed that the objects (except the reference 79362306a36Sopenharmony_cicounts) never changed once they are created. If we wanted to allow the 79462306a36Sopenharmony_ciname to change, there are three possibilities: 79562306a36Sopenharmony_ci 79662306a36Sopenharmony_ci- You can make ``cache_lock`` non-static, and tell people to grab that 79762306a36Sopenharmony_ci lock before changing the name in any object. 79862306a36Sopenharmony_ci 79962306a36Sopenharmony_ci- You can provide a cache_obj_rename() which grabs this 80062306a36Sopenharmony_ci lock and changes the name for the caller, and tell everyone to use 80162306a36Sopenharmony_ci that function. 80262306a36Sopenharmony_ci 80362306a36Sopenharmony_ci- You can make the ``cache_lock`` protect only the cache itself, and 80462306a36Sopenharmony_ci use another lock to protect the name. 80562306a36Sopenharmony_ci 80662306a36Sopenharmony_ciTheoretically, you can make the locks as fine-grained as one lock for 80762306a36Sopenharmony_cievery field, for every object. In practice, the most common variants 80862306a36Sopenharmony_ciare: 80962306a36Sopenharmony_ci 81062306a36Sopenharmony_ci- One lock which protects the infrastructure (the ``cache`` list in 81162306a36Sopenharmony_ci this example) and all the objects. This is what we have done so far. 81262306a36Sopenharmony_ci 81362306a36Sopenharmony_ci- One lock which protects the infrastructure (including the list 81462306a36Sopenharmony_ci pointers inside the objects), and one lock inside the object which 81562306a36Sopenharmony_ci protects the rest of that object. 81662306a36Sopenharmony_ci 81762306a36Sopenharmony_ci- Multiple locks to protect the infrastructure (eg. one lock per hash 81862306a36Sopenharmony_ci chain), possibly with a separate per-object lock. 81962306a36Sopenharmony_ci 82062306a36Sopenharmony_ciHere is the "lock-per-object" implementation: 82162306a36Sopenharmony_ci 82262306a36Sopenharmony_ci:: 82362306a36Sopenharmony_ci 82462306a36Sopenharmony_ci --- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 82562306a36Sopenharmony_ci +++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 82662306a36Sopenharmony_ci @@ -6,11 +6,17 @@ 82762306a36Sopenharmony_ci 82862306a36Sopenharmony_ci struct object 82962306a36Sopenharmony_ci { 83062306a36Sopenharmony_ci + /* These two protected by cache_lock. */ 83162306a36Sopenharmony_ci struct list_head list; 83262306a36Sopenharmony_ci + int popularity; 83362306a36Sopenharmony_ci + 83462306a36Sopenharmony_ci atomic_t refcnt; 83562306a36Sopenharmony_ci + 83662306a36Sopenharmony_ci + /* Doesn't change once created. */ 83762306a36Sopenharmony_ci int id; 83862306a36Sopenharmony_ci + 83962306a36Sopenharmony_ci + spinlock_t lock; /* Protects the name */ 84062306a36Sopenharmony_ci char name[32]; 84162306a36Sopenharmony_ci - int popularity; 84262306a36Sopenharmony_ci }; 84362306a36Sopenharmony_ci 84462306a36Sopenharmony_ci static DEFINE_SPINLOCK(cache_lock); 84562306a36Sopenharmony_ci @@ -77,6 +84,7 @@ 84662306a36Sopenharmony_ci obj->id = id; 84762306a36Sopenharmony_ci obj->popularity = 0; 84862306a36Sopenharmony_ci atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 84962306a36Sopenharmony_ci + spin_lock_init(&obj->lock); 85062306a36Sopenharmony_ci 85162306a36Sopenharmony_ci spin_lock_irqsave(&cache_lock, flags); 85262306a36Sopenharmony_ci __cache_add(obj); 85362306a36Sopenharmony_ci 85462306a36Sopenharmony_ciNote that I decide that the popularity count should be protected by the 85562306a36Sopenharmony_ci``cache_lock`` rather than the per-object lock: this is because it (like 85662306a36Sopenharmony_cithe :c:type:`struct list_head <list_head>` inside the object) 85762306a36Sopenharmony_ciis logically part of the infrastructure. This way, I don't need to grab 85862306a36Sopenharmony_cithe lock of every object in __cache_add() when seeking 85962306a36Sopenharmony_cithe least popular. 86062306a36Sopenharmony_ci 86162306a36Sopenharmony_ciI also decided that the id member is unchangeable, so I don't need to 86262306a36Sopenharmony_cigrab each object lock in __cache_find() to examine the 86362306a36Sopenharmony_ciid: the object lock is only used by a caller who wants to read or write 86462306a36Sopenharmony_cithe name field. 86562306a36Sopenharmony_ci 86662306a36Sopenharmony_ciNote also that I added a comment describing what data was protected by 86762306a36Sopenharmony_ciwhich locks. This is extremely important, as it describes the runtime 86862306a36Sopenharmony_cibehavior of the code, and can be hard to gain from just reading. And as 86962306a36Sopenharmony_ciAlan Cox says, “Lock data, not code”. 87062306a36Sopenharmony_ci 87162306a36Sopenharmony_ciCommon Problems 87262306a36Sopenharmony_ci=============== 87362306a36Sopenharmony_ci 87462306a36Sopenharmony_ciDeadlock: Simple and Advanced 87562306a36Sopenharmony_ci----------------------------- 87662306a36Sopenharmony_ci 87762306a36Sopenharmony_ciThere is a coding bug where a piece of code tries to grab a spinlock 87862306a36Sopenharmony_citwice: it will spin forever, waiting for the lock to be released 87962306a36Sopenharmony_ci(spinlocks, rwlocks and mutexes are not recursive in Linux). This is 88062306a36Sopenharmony_citrivial to diagnose: not a 88162306a36Sopenharmony_cistay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem. 88262306a36Sopenharmony_ci 88362306a36Sopenharmony_ciFor a slightly more complex case, imagine you have a region shared by a 88462306a36Sopenharmony_cisoftirq and user context. If you use a spin_lock() call 88562306a36Sopenharmony_cito protect it, it is possible that the user context will be interrupted 88662306a36Sopenharmony_ciby the softirq while it holds the lock, and the softirq will then spin 88762306a36Sopenharmony_ciforever trying to get the same lock. 88862306a36Sopenharmony_ci 88962306a36Sopenharmony_ciBoth of these are called deadlock, and as shown above, it can occur even 89062306a36Sopenharmony_ciwith a single CPU (although not on UP compiles, since spinlocks vanish 89162306a36Sopenharmony_cion kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data 89262306a36Sopenharmony_cicorruption in the second example). 89362306a36Sopenharmony_ci 89462306a36Sopenharmony_ciThis complete lockup is easy to diagnose: on SMP boxes the watchdog 89562306a36Sopenharmony_citimer or compiling with ``DEBUG_SPINLOCK`` set 89662306a36Sopenharmony_ci(``include/linux/spinlock.h``) will show this up immediately when it 89762306a36Sopenharmony_cihappens. 89862306a36Sopenharmony_ci 89962306a36Sopenharmony_ciA more complex problem is the so-called 'deadly embrace', involving two 90062306a36Sopenharmony_cior more locks. Say you have a hash table: each entry in the table is a 90162306a36Sopenharmony_cispinlock, and a chain of hashed objects. Inside a softirq handler, you 90262306a36Sopenharmony_cisometimes want to alter an object from one place in the hash to another: 90362306a36Sopenharmony_ciyou grab the spinlock of the old hash chain and the spinlock of the new 90462306a36Sopenharmony_cihash chain, and delete the object from the old one, and insert it in the 90562306a36Sopenharmony_cinew one. 90662306a36Sopenharmony_ci 90762306a36Sopenharmony_ciThere are two problems here. First, if your code ever tries to move the 90862306a36Sopenharmony_ciobject to the same chain, it will deadlock with itself as it tries to 90962306a36Sopenharmony_cilock it twice. Secondly, if the same softirq on another CPU is trying to 91062306a36Sopenharmony_cimove another object in the reverse direction, the following could 91162306a36Sopenharmony_cihappen: 91262306a36Sopenharmony_ci 91362306a36Sopenharmony_ci+-----------------------+-----------------------+ 91462306a36Sopenharmony_ci| CPU 1 | CPU 2 | 91562306a36Sopenharmony_ci+=======================+=======================+ 91662306a36Sopenharmony_ci| Grab lock A -> OK | Grab lock B -> OK | 91762306a36Sopenharmony_ci+-----------------------+-----------------------+ 91862306a36Sopenharmony_ci| Grab lock B -> spin | Grab lock A -> spin | 91962306a36Sopenharmony_ci+-----------------------+-----------------------+ 92062306a36Sopenharmony_ci 92162306a36Sopenharmony_ciTable: Consequences 92262306a36Sopenharmony_ci 92362306a36Sopenharmony_ciThe two CPUs will spin forever, waiting for the other to give up their 92462306a36Sopenharmony_cilock. It will look, smell, and feel like a crash. 92562306a36Sopenharmony_ci 92662306a36Sopenharmony_ciPreventing Deadlock 92762306a36Sopenharmony_ci------------------- 92862306a36Sopenharmony_ci 92962306a36Sopenharmony_ciTextbooks will tell you that if you always lock in the same order, you 93062306a36Sopenharmony_ciwill never get this kind of deadlock. Practice will tell you that this 93162306a36Sopenharmony_ciapproach doesn't scale: when I create a new lock, I don't understand 93262306a36Sopenharmony_cienough of the kernel to figure out where in the 5000 lock hierarchy it 93362306a36Sopenharmony_ciwill fit. 93462306a36Sopenharmony_ci 93562306a36Sopenharmony_ciThe best locks are encapsulated: they never get exposed in headers, and 93662306a36Sopenharmony_ciare never held around calls to non-trivial functions outside the same 93762306a36Sopenharmony_cifile. You can read through this code and see that it will never 93862306a36Sopenharmony_cideadlock, because it never tries to grab another lock while it has that 93962306a36Sopenharmony_cione. People using your code don't even need to know you are using a 94062306a36Sopenharmony_cilock. 94162306a36Sopenharmony_ci 94262306a36Sopenharmony_ciA classic problem here is when you provide callbacks or hooks: if you 94362306a36Sopenharmony_cicall these with the lock held, you risk simple deadlock, or a deadly 94462306a36Sopenharmony_ciembrace (who knows what the callback will do?). 94562306a36Sopenharmony_ci 94662306a36Sopenharmony_ciOverzealous Prevention Of Deadlocks 94762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 94862306a36Sopenharmony_ci 94962306a36Sopenharmony_ciDeadlocks are problematic, but not as bad as data corruption. Code which 95062306a36Sopenharmony_cigrabs a read lock, searches a list, fails to find what it wants, drops 95162306a36Sopenharmony_cithe read lock, grabs a write lock and inserts the object has a race 95262306a36Sopenharmony_cicondition. 95362306a36Sopenharmony_ci 95462306a36Sopenharmony_ciRacing Timers: A Kernel Pastime 95562306a36Sopenharmony_ci------------------------------- 95662306a36Sopenharmony_ci 95762306a36Sopenharmony_ciTimers can produce their own special problems with races. Consider a 95862306a36Sopenharmony_cicollection of objects (list, hash, etc) where each object has a timer 95962306a36Sopenharmony_ciwhich is due to destroy it. 96062306a36Sopenharmony_ci 96162306a36Sopenharmony_ciIf you want to destroy the entire collection (say on module removal), 96262306a36Sopenharmony_ciyou might do the following:: 96362306a36Sopenharmony_ci 96462306a36Sopenharmony_ci /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 96562306a36Sopenharmony_ci HUNGARIAN NOTATION */ 96662306a36Sopenharmony_ci spin_lock_bh(&list_lock); 96762306a36Sopenharmony_ci 96862306a36Sopenharmony_ci while (list) { 96962306a36Sopenharmony_ci struct foo *next = list->next; 97062306a36Sopenharmony_ci timer_delete(&list->timer); 97162306a36Sopenharmony_ci kfree(list); 97262306a36Sopenharmony_ci list = next; 97362306a36Sopenharmony_ci } 97462306a36Sopenharmony_ci 97562306a36Sopenharmony_ci spin_unlock_bh(&list_lock); 97662306a36Sopenharmony_ci 97762306a36Sopenharmony_ci 97862306a36Sopenharmony_ciSooner or later, this will crash on SMP, because a timer can have just 97962306a36Sopenharmony_cigone off before the spin_lock_bh(), and it will only get 98062306a36Sopenharmony_cithe lock after we spin_unlock_bh(), and then try to free 98162306a36Sopenharmony_cithe element (which has already been freed!). 98262306a36Sopenharmony_ci 98362306a36Sopenharmony_ciThis can be avoided by checking the result of 98462306a36Sopenharmony_citimer_delete(): if it returns 1, the timer has been deleted. 98562306a36Sopenharmony_ciIf 0, it means (in this case) that it is currently running, so we can 98662306a36Sopenharmony_cido:: 98762306a36Sopenharmony_ci 98862306a36Sopenharmony_ci retry: 98962306a36Sopenharmony_ci spin_lock_bh(&list_lock); 99062306a36Sopenharmony_ci 99162306a36Sopenharmony_ci while (list) { 99262306a36Sopenharmony_ci struct foo *next = list->next; 99362306a36Sopenharmony_ci if (!timer_delete(&list->timer)) { 99462306a36Sopenharmony_ci /* Give timer a chance to delete this */ 99562306a36Sopenharmony_ci spin_unlock_bh(&list_lock); 99662306a36Sopenharmony_ci goto retry; 99762306a36Sopenharmony_ci } 99862306a36Sopenharmony_ci kfree(list); 99962306a36Sopenharmony_ci list = next; 100062306a36Sopenharmony_ci } 100162306a36Sopenharmony_ci 100262306a36Sopenharmony_ci spin_unlock_bh(&list_lock); 100362306a36Sopenharmony_ci 100462306a36Sopenharmony_ci 100562306a36Sopenharmony_ciAnother common problem is deleting timers which restart themselves (by 100662306a36Sopenharmony_cicalling add_timer() at the end of their timer function). 100762306a36Sopenharmony_ciBecause this is a fairly common case which is prone to races, you should 100862306a36Sopenharmony_ciuse timer_delete_sync() (``include/linux/timer.h``) to handle this case. 100962306a36Sopenharmony_ci 101062306a36Sopenharmony_ciBefore freeing a timer, timer_shutdown() or timer_shutdown_sync() should be 101162306a36Sopenharmony_cicalled which will keep it from being rearmed. Any subsequent attempt to 101262306a36Sopenharmony_cirearm the timer will be silently ignored by the core code. 101362306a36Sopenharmony_ci 101462306a36Sopenharmony_ci 101562306a36Sopenharmony_ciLocking Speed 101662306a36Sopenharmony_ci============= 101762306a36Sopenharmony_ci 101862306a36Sopenharmony_ciThere are three main things to worry about when considering speed of 101962306a36Sopenharmony_cisome code which does locking. First is concurrency: how many things are 102062306a36Sopenharmony_cigoing to be waiting while someone else is holding a lock. Second is the 102162306a36Sopenharmony_citime taken to actually acquire and release an uncontended lock. Third is 102262306a36Sopenharmony_ciusing fewer, or smarter locks. I'm assuming that the lock is used fairly 102362306a36Sopenharmony_cioften: otherwise, you wouldn't be concerned about efficiency. 102462306a36Sopenharmony_ci 102562306a36Sopenharmony_ciConcurrency depends on how long the lock is usually held: you should 102662306a36Sopenharmony_cihold the lock for as long as needed, but no longer. In the cache 102762306a36Sopenharmony_ciexample, we always create the object without the lock held, and then 102862306a36Sopenharmony_cigrab the lock only when we are ready to insert it in the list. 102962306a36Sopenharmony_ci 103062306a36Sopenharmony_ciAcquisition times depend on how much damage the lock operations do to 103162306a36Sopenharmony_cithe pipeline (pipeline stalls) and how likely it is that this CPU was 103262306a36Sopenharmony_cithe last one to grab the lock (ie. is the lock cache-hot for this CPU): 103362306a36Sopenharmony_cion a machine with more CPUs, this likelihood drops fast. Consider a 103462306a36Sopenharmony_ci700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic 103562306a36Sopenharmony_ciincrement takes about 58ns, a lock which is cache-hot on this CPU takes 103662306a36Sopenharmony_ci160ns, and a cacheline transfer from another CPU takes an additional 170 103762306a36Sopenharmony_cito 360ns. (These figures from Paul McKenney's `Linux Journal RCU 103862306a36Sopenharmony_ciarticle <http://www.linuxjournal.com/article.php?sid=6993>`__). 103962306a36Sopenharmony_ci 104062306a36Sopenharmony_ciThese two aims conflict: holding a lock for a short time might be done 104162306a36Sopenharmony_ciby splitting locks into parts (such as in our final per-object-lock 104262306a36Sopenharmony_ciexample), but this increases the number of lock acquisitions, and the 104362306a36Sopenharmony_ciresults are often slower than having a single lock. This is another 104462306a36Sopenharmony_cireason to advocate locking simplicity. 104562306a36Sopenharmony_ci 104662306a36Sopenharmony_ciThe third concern is addressed below: there are some methods to reduce 104762306a36Sopenharmony_cithe amount of locking which needs to be done. 104862306a36Sopenharmony_ci 104962306a36Sopenharmony_ciRead/Write Lock Variants 105062306a36Sopenharmony_ci------------------------ 105162306a36Sopenharmony_ci 105262306a36Sopenharmony_ciBoth spinlocks and mutexes have read/write variants: ``rwlock_t`` and 105362306a36Sopenharmony_ci:c:type:`struct rw_semaphore <rw_semaphore>`. These divide 105462306a36Sopenharmony_ciusers into two classes: the readers and the writers. If you are only 105562306a36Sopenharmony_cireading the data, you can get a read lock, but to write to the data you 105662306a36Sopenharmony_cineed the write lock. Many people can hold a read lock, but a writer must 105762306a36Sopenharmony_cibe sole holder. 105862306a36Sopenharmony_ci 105962306a36Sopenharmony_ciIf your code divides neatly along reader/writer lines (as our cache code 106062306a36Sopenharmony_cidoes), and the lock is held by readers for significant lengths of time, 106162306a36Sopenharmony_ciusing these locks can help. They are slightly slower than the normal 106262306a36Sopenharmony_cilocks though, so in practice ``rwlock_t`` is not usually worthwhile. 106362306a36Sopenharmony_ci 106462306a36Sopenharmony_ciAvoiding Locks: Read Copy Update 106562306a36Sopenharmony_ci-------------------------------- 106662306a36Sopenharmony_ci 106762306a36Sopenharmony_ciThere is a special method of read/write locking called Read Copy Update. 106862306a36Sopenharmony_ciUsing RCU, the readers can avoid taking a lock altogether: as we expect 106962306a36Sopenharmony_ciour cache to be read more often than updated (otherwise the cache is a 107062306a36Sopenharmony_ciwaste of time), it is a candidate for this optimization. 107162306a36Sopenharmony_ci 107262306a36Sopenharmony_ciHow do we get rid of read locks? Getting rid of read locks means that 107362306a36Sopenharmony_ciwriters may be changing the list underneath the readers. That is 107462306a36Sopenharmony_ciactually quite simple: we can read a linked list while an element is 107562306a36Sopenharmony_cibeing added if the writer adds the element very carefully. For example, 107662306a36Sopenharmony_ciadding ``new`` to a single linked list called ``list``:: 107762306a36Sopenharmony_ci 107862306a36Sopenharmony_ci new->next = list->next; 107962306a36Sopenharmony_ci wmb(); 108062306a36Sopenharmony_ci list->next = new; 108162306a36Sopenharmony_ci 108262306a36Sopenharmony_ci 108362306a36Sopenharmony_ciThe wmb() is a write memory barrier. It ensures that the 108462306a36Sopenharmony_cifirst operation (setting the new element's ``next`` pointer) is complete 108562306a36Sopenharmony_ciand will be seen by all CPUs, before the second operation is (putting 108662306a36Sopenharmony_cithe new element into the list). This is important, since modern 108762306a36Sopenharmony_cicompilers and modern CPUs can both reorder instructions unless told 108862306a36Sopenharmony_ciotherwise: we want a reader to either not see the new element at all, or 108962306a36Sopenharmony_cisee the new element with the ``next`` pointer correctly pointing at the 109062306a36Sopenharmony_cirest of the list. 109162306a36Sopenharmony_ci 109262306a36Sopenharmony_ciFortunately, there is a function to do this for standard 109362306a36Sopenharmony_ci:c:type:`struct list_head <list_head>` lists: 109462306a36Sopenharmony_cilist_add_rcu() (``include/linux/list.h``). 109562306a36Sopenharmony_ci 109662306a36Sopenharmony_ciRemoving an element from the list is even simpler: we replace the 109762306a36Sopenharmony_cipointer to the old element with a pointer to its successor, and readers 109862306a36Sopenharmony_ciwill either see it, or skip over it. 109962306a36Sopenharmony_ci 110062306a36Sopenharmony_ci:: 110162306a36Sopenharmony_ci 110262306a36Sopenharmony_ci list->next = old->next; 110362306a36Sopenharmony_ci 110462306a36Sopenharmony_ci 110562306a36Sopenharmony_ciThere is list_del_rcu() (``include/linux/list.h``) which 110662306a36Sopenharmony_cidoes this (the normal version poisons the old object, which we don't 110762306a36Sopenharmony_ciwant). 110862306a36Sopenharmony_ci 110962306a36Sopenharmony_ciThe reader must also be careful: some CPUs can look through the ``next`` 111062306a36Sopenharmony_cipointer to start reading the contents of the next element early, but 111162306a36Sopenharmony_cidon't realize that the pre-fetched contents is wrong when the ``next`` 111262306a36Sopenharmony_cipointer changes underneath them. Once again, there is a 111362306a36Sopenharmony_cilist_for_each_entry_rcu() (``include/linux/list.h``) 111462306a36Sopenharmony_cito help you. Of course, writers can just use 111562306a36Sopenharmony_cilist_for_each_entry(), since there cannot be two 111662306a36Sopenharmony_cisimultaneous writers. 111762306a36Sopenharmony_ci 111862306a36Sopenharmony_ciOur final dilemma is this: when can we actually destroy the removed 111962306a36Sopenharmony_cielement? Remember, a reader might be stepping through this element in 112062306a36Sopenharmony_cithe list right now: if we free this element and the ``next`` pointer 112162306a36Sopenharmony_cichanges, the reader will jump off into garbage and crash. We need to 112262306a36Sopenharmony_ciwait until we know that all the readers who were traversing the list 112362306a36Sopenharmony_ciwhen we deleted the element are finished. We use 112462306a36Sopenharmony_cicall_rcu() to register a callback which will actually 112562306a36Sopenharmony_cidestroy the object once all pre-existing readers are finished. 112662306a36Sopenharmony_ciAlternatively, synchronize_rcu() may be used to block 112762306a36Sopenharmony_ciuntil all pre-existing are finished. 112862306a36Sopenharmony_ci 112962306a36Sopenharmony_ciBut how does Read Copy Update know when the readers are finished? The 113062306a36Sopenharmony_cimethod is this: firstly, the readers always traverse the list inside 113162306a36Sopenharmony_circu_read_lock()/rcu_read_unlock() pairs: 113262306a36Sopenharmony_cithese simply disable preemption so the reader won't go to sleep while 113362306a36Sopenharmony_cireading the list. 113462306a36Sopenharmony_ci 113562306a36Sopenharmony_ciRCU then waits until every other CPU has slept at least once: since 113662306a36Sopenharmony_cireaders cannot sleep, we know that any readers which were traversing the 113762306a36Sopenharmony_cilist during the deletion are finished, and the callback is triggered. 113862306a36Sopenharmony_ciThe real Read Copy Update code is a little more optimized than this, but 113962306a36Sopenharmony_cithis is the fundamental idea. 114062306a36Sopenharmony_ci 114162306a36Sopenharmony_ci:: 114262306a36Sopenharmony_ci 114362306a36Sopenharmony_ci --- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 114462306a36Sopenharmony_ci +++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 114562306a36Sopenharmony_ci @@ -1,15 +1,18 @@ 114662306a36Sopenharmony_ci #include <linux/list.h> 114762306a36Sopenharmony_ci #include <linux/slab.h> 114862306a36Sopenharmony_ci #include <linux/string.h> 114962306a36Sopenharmony_ci +#include <linux/rcupdate.h> 115062306a36Sopenharmony_ci #include <linux/mutex.h> 115162306a36Sopenharmony_ci #include <asm/errno.h> 115262306a36Sopenharmony_ci 115362306a36Sopenharmony_ci struct object 115462306a36Sopenharmony_ci { 115562306a36Sopenharmony_ci - /* These two protected by cache_lock. */ 115662306a36Sopenharmony_ci + /* This is protected by RCU */ 115762306a36Sopenharmony_ci struct list_head list; 115862306a36Sopenharmony_ci int popularity; 115962306a36Sopenharmony_ci 116062306a36Sopenharmony_ci + struct rcu_head rcu; 116162306a36Sopenharmony_ci + 116262306a36Sopenharmony_ci atomic_t refcnt; 116362306a36Sopenharmony_ci 116462306a36Sopenharmony_ci /* Doesn't change once created. */ 116562306a36Sopenharmony_ci @@ -40,7 +43,7 @@ 116662306a36Sopenharmony_ci { 116762306a36Sopenharmony_ci struct object *i; 116862306a36Sopenharmony_ci 116962306a36Sopenharmony_ci - list_for_each_entry(i, &cache, list) { 117062306a36Sopenharmony_ci + list_for_each_entry_rcu(i, &cache, list) { 117162306a36Sopenharmony_ci if (i->id == id) { 117262306a36Sopenharmony_ci i->popularity++; 117362306a36Sopenharmony_ci return i; 117462306a36Sopenharmony_ci @@ -49,19 +52,25 @@ 117562306a36Sopenharmony_ci return NULL; 117662306a36Sopenharmony_ci } 117762306a36Sopenharmony_ci 117862306a36Sopenharmony_ci +/* Final discard done once we know no readers are looking. */ 117962306a36Sopenharmony_ci +static void cache_delete_rcu(void *arg) 118062306a36Sopenharmony_ci +{ 118162306a36Sopenharmony_ci + object_put(arg); 118262306a36Sopenharmony_ci +} 118362306a36Sopenharmony_ci + 118462306a36Sopenharmony_ci /* Must be holding cache_lock */ 118562306a36Sopenharmony_ci static void __cache_delete(struct object *obj) 118662306a36Sopenharmony_ci { 118762306a36Sopenharmony_ci BUG_ON(!obj); 118862306a36Sopenharmony_ci - list_del(&obj->list); 118962306a36Sopenharmony_ci - object_put(obj); 119062306a36Sopenharmony_ci + list_del_rcu(&obj->list); 119162306a36Sopenharmony_ci cache_num--; 119262306a36Sopenharmony_ci + call_rcu(&obj->rcu, cache_delete_rcu); 119362306a36Sopenharmony_ci } 119462306a36Sopenharmony_ci 119562306a36Sopenharmony_ci /* Must be holding cache_lock */ 119662306a36Sopenharmony_ci static void __cache_add(struct object *obj) 119762306a36Sopenharmony_ci { 119862306a36Sopenharmony_ci - list_add(&obj->list, &cache); 119962306a36Sopenharmony_ci + list_add_rcu(&obj->list, &cache); 120062306a36Sopenharmony_ci if (++cache_num > MAX_CACHE_SIZE) { 120162306a36Sopenharmony_ci struct object *i, *outcast = NULL; 120262306a36Sopenharmony_ci list_for_each_entry(i, &cache, list) { 120362306a36Sopenharmony_ci @@ -104,12 +114,11 @@ 120462306a36Sopenharmony_ci struct object *cache_find(int id) 120562306a36Sopenharmony_ci { 120662306a36Sopenharmony_ci struct object *obj; 120762306a36Sopenharmony_ci - unsigned long flags; 120862306a36Sopenharmony_ci 120962306a36Sopenharmony_ci - spin_lock_irqsave(&cache_lock, flags); 121062306a36Sopenharmony_ci + rcu_read_lock(); 121162306a36Sopenharmony_ci obj = __cache_find(id); 121262306a36Sopenharmony_ci if (obj) 121362306a36Sopenharmony_ci object_get(obj); 121462306a36Sopenharmony_ci - spin_unlock_irqrestore(&cache_lock, flags); 121562306a36Sopenharmony_ci + rcu_read_unlock(); 121662306a36Sopenharmony_ci return obj; 121762306a36Sopenharmony_ci } 121862306a36Sopenharmony_ci 121962306a36Sopenharmony_ciNote that the reader will alter the popularity member in 122062306a36Sopenharmony_ci__cache_find(), and now it doesn't hold a lock. One 122162306a36Sopenharmony_cisolution would be to make it an ``atomic_t``, but for this usage, we 122262306a36Sopenharmony_cidon't really care about races: an approximate result is good enough, so 122362306a36Sopenharmony_ciI didn't change it. 122462306a36Sopenharmony_ci 122562306a36Sopenharmony_ciThe result is that cache_find() requires no 122662306a36Sopenharmony_cisynchronization with any other functions, so is almost as fast on SMP as 122762306a36Sopenharmony_ciit would be on UP. 122862306a36Sopenharmony_ci 122962306a36Sopenharmony_ciThere is a further optimization possible here: remember our original 123062306a36Sopenharmony_cicache code, where there were no reference counts and the caller simply 123162306a36Sopenharmony_ciheld the lock whenever using the object? This is still possible: if you 123262306a36Sopenharmony_cihold the lock, no one can delete the object, so you don't need to get 123362306a36Sopenharmony_ciand put the reference count. 123462306a36Sopenharmony_ci 123562306a36Sopenharmony_ciNow, because the 'read lock' in RCU is simply disabling preemption, a 123662306a36Sopenharmony_cicaller which always has preemption disabled between calling 123762306a36Sopenharmony_cicache_find() and object_put() does not 123862306a36Sopenharmony_cineed to actually get and put the reference count: we could expose 123962306a36Sopenharmony_ci__cache_find() by making it non-static, and such 124062306a36Sopenharmony_cicallers could simply call that. 124162306a36Sopenharmony_ci 124262306a36Sopenharmony_ciThe benefit here is that the reference count is not written to: the 124362306a36Sopenharmony_ciobject is not altered in any way, which is much faster on SMP machines 124462306a36Sopenharmony_cidue to caching. 124562306a36Sopenharmony_ci 124662306a36Sopenharmony_ciPer-CPU Data 124762306a36Sopenharmony_ci------------ 124862306a36Sopenharmony_ci 124962306a36Sopenharmony_ciAnother technique for avoiding locking which is used fairly widely is to 125062306a36Sopenharmony_ciduplicate information for each CPU. For example, if you wanted to keep a 125162306a36Sopenharmony_cicount of a common condition, you could use a spin lock and a single 125262306a36Sopenharmony_cicounter. Nice and simple. 125362306a36Sopenharmony_ci 125462306a36Sopenharmony_ciIf that was too slow (it's usually not, but if you've got a really big 125562306a36Sopenharmony_cimachine to test on and can show that it is), you could instead use a 125662306a36Sopenharmony_cicounter for each CPU, then none of them need an exclusive lock. See 125762306a36Sopenharmony_ciDEFINE_PER_CPU(), get_cpu_var() and 125862306a36Sopenharmony_ciput_cpu_var() (``include/linux/percpu.h``). 125962306a36Sopenharmony_ci 126062306a36Sopenharmony_ciOf particular use for simple per-cpu counters is the ``local_t`` type, 126162306a36Sopenharmony_ciand the cpu_local_inc() and related functions, which are 126262306a36Sopenharmony_cimore efficient than simple code on some architectures 126362306a36Sopenharmony_ci(``include/asm/local.h``). 126462306a36Sopenharmony_ci 126562306a36Sopenharmony_ciNote that there is no simple, reliable way of getting an exact value of 126662306a36Sopenharmony_cisuch a counter, without introducing more locks. This is not a problem 126762306a36Sopenharmony_cifor some uses. 126862306a36Sopenharmony_ci 126962306a36Sopenharmony_ciData Which Mostly Used By An IRQ Handler 127062306a36Sopenharmony_ci---------------------------------------- 127162306a36Sopenharmony_ci 127262306a36Sopenharmony_ciIf data is always accessed from within the same IRQ handler, you don't 127362306a36Sopenharmony_cineed a lock at all: the kernel already guarantees that the irq handler 127462306a36Sopenharmony_ciwill not run simultaneously on multiple CPUs. 127562306a36Sopenharmony_ci 127662306a36Sopenharmony_ciManfred Spraul points out that you can still do this, even if the data 127762306a36Sopenharmony_ciis very occasionally accessed in user context or softirqs/tasklets. The 127862306a36Sopenharmony_ciirq handler doesn't use a lock, and all other accesses are done as so:: 127962306a36Sopenharmony_ci 128062306a36Sopenharmony_ci mutex_lock(&lock); 128162306a36Sopenharmony_ci disable_irq(irq); 128262306a36Sopenharmony_ci ... 128362306a36Sopenharmony_ci enable_irq(irq); 128462306a36Sopenharmony_ci mutex_unlock(&lock); 128562306a36Sopenharmony_ci 128662306a36Sopenharmony_ciThe disable_irq() prevents the irq handler from running 128762306a36Sopenharmony_ci(and waits for it to finish if it's currently running on other CPUs). 128862306a36Sopenharmony_ciThe spinlock prevents any other accesses happening at the same time. 128962306a36Sopenharmony_ciNaturally, this is slower than just a spin_lock_irq() 129062306a36Sopenharmony_cicall, so it only makes sense if this type of access happens extremely 129162306a36Sopenharmony_cirarely. 129262306a36Sopenharmony_ci 129362306a36Sopenharmony_ciWhat Functions Are Safe To Call From Interrupts? 129462306a36Sopenharmony_ci================================================ 129562306a36Sopenharmony_ci 129662306a36Sopenharmony_ciMany functions in the kernel sleep (ie. call schedule()) directly or 129762306a36Sopenharmony_ciindirectly: you can never call them while holding a spinlock, or with 129862306a36Sopenharmony_cipreemption disabled. This also means you need to be in user context: 129962306a36Sopenharmony_cicalling them from an interrupt is illegal. 130062306a36Sopenharmony_ci 130162306a36Sopenharmony_ciSome Functions Which Sleep 130262306a36Sopenharmony_ci-------------------------- 130362306a36Sopenharmony_ci 130462306a36Sopenharmony_ciThe most common ones are listed below, but you usually have to read the 130562306a36Sopenharmony_cicode to find out if other calls are safe. If everyone else who calls it 130662306a36Sopenharmony_cican sleep, you probably need to be able to sleep, too. In particular, 130762306a36Sopenharmony_ciregistration and deregistration functions usually expect to be called 130862306a36Sopenharmony_cifrom user context, and can sleep. 130962306a36Sopenharmony_ci 131062306a36Sopenharmony_ci- Accesses to userspace: 131162306a36Sopenharmony_ci 131262306a36Sopenharmony_ci - copy_from_user() 131362306a36Sopenharmony_ci 131462306a36Sopenharmony_ci - copy_to_user() 131562306a36Sopenharmony_ci 131662306a36Sopenharmony_ci - get_user() 131762306a36Sopenharmony_ci 131862306a36Sopenharmony_ci - put_user() 131962306a36Sopenharmony_ci 132062306a36Sopenharmony_ci- kmalloc(GP_KERNEL) <kmalloc>` 132162306a36Sopenharmony_ci 132262306a36Sopenharmony_ci- mutex_lock_interruptible() and 132362306a36Sopenharmony_ci mutex_lock() 132462306a36Sopenharmony_ci 132562306a36Sopenharmony_ci There is a mutex_trylock() which does not sleep. 132662306a36Sopenharmony_ci Still, it must not be used inside interrupt context since its 132762306a36Sopenharmony_ci implementation is not safe for that. mutex_unlock() 132862306a36Sopenharmony_ci will also never sleep. It cannot be used in interrupt context either 132962306a36Sopenharmony_ci since a mutex must be released by the same task that acquired it. 133062306a36Sopenharmony_ci 133162306a36Sopenharmony_ciSome Functions Which Don't Sleep 133262306a36Sopenharmony_ci-------------------------------- 133362306a36Sopenharmony_ci 133462306a36Sopenharmony_ciSome functions are safe to call from any context, or holding almost any 133562306a36Sopenharmony_cilock. 133662306a36Sopenharmony_ci 133762306a36Sopenharmony_ci- printk() 133862306a36Sopenharmony_ci 133962306a36Sopenharmony_ci- kfree() 134062306a36Sopenharmony_ci 134162306a36Sopenharmony_ci- add_timer() and timer_delete() 134262306a36Sopenharmony_ci 134362306a36Sopenharmony_ciMutex API reference 134462306a36Sopenharmony_ci=================== 134562306a36Sopenharmony_ci 134662306a36Sopenharmony_ci.. kernel-doc:: include/linux/mutex.h 134762306a36Sopenharmony_ci :internal: 134862306a36Sopenharmony_ci 134962306a36Sopenharmony_ci.. kernel-doc:: kernel/locking/mutex.c 135062306a36Sopenharmony_ci :export: 135162306a36Sopenharmony_ci 135262306a36Sopenharmony_ciFutex API reference 135362306a36Sopenharmony_ci=================== 135462306a36Sopenharmony_ci 135562306a36Sopenharmony_ci.. kernel-doc:: kernel/futex/core.c 135662306a36Sopenharmony_ci :internal: 135762306a36Sopenharmony_ci 135862306a36Sopenharmony_ci.. kernel-doc:: kernel/futex/futex.h 135962306a36Sopenharmony_ci :internal: 136062306a36Sopenharmony_ci 136162306a36Sopenharmony_ci.. kernel-doc:: kernel/futex/pi.c 136262306a36Sopenharmony_ci :internal: 136362306a36Sopenharmony_ci 136462306a36Sopenharmony_ci.. kernel-doc:: kernel/futex/requeue.c 136562306a36Sopenharmony_ci :internal: 136662306a36Sopenharmony_ci 136762306a36Sopenharmony_ci.. kernel-doc:: kernel/futex/waitwake.c 136862306a36Sopenharmony_ci :internal: 136962306a36Sopenharmony_ci 137062306a36Sopenharmony_ciFurther reading 137162306a36Sopenharmony_ci=============== 137262306a36Sopenharmony_ci 137362306a36Sopenharmony_ci- ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking 137462306a36Sopenharmony_ci tutorial in the kernel sources. 137562306a36Sopenharmony_ci 137662306a36Sopenharmony_ci- Unix Systems for Modern Architectures: Symmetric Multiprocessing and 137762306a36Sopenharmony_ci Caching for Kernel Programmers: 137862306a36Sopenharmony_ci 137962306a36Sopenharmony_ci Curt Schimmel's very good introduction to kernel level locking (not 138062306a36Sopenharmony_ci written for Linux, but nearly everything applies). The book is 138162306a36Sopenharmony_ci expensive, but really worth every penny to understand SMP locking. 138262306a36Sopenharmony_ci [ISBN: 0201633388] 138362306a36Sopenharmony_ci 138462306a36Sopenharmony_ciThanks 138562306a36Sopenharmony_ci====== 138662306a36Sopenharmony_ci 138762306a36Sopenharmony_ciThanks to Telsa Gwynne for DocBooking, neatening and adding style. 138862306a36Sopenharmony_ci 138962306a36Sopenharmony_ciThanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras, 139062306a36Sopenharmony_ciRuedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev, 139162306a36Sopenharmony_ciJames Morris, Robert Love, Paul McKenney, John Ashby for proofreading, 139262306a36Sopenharmony_cicorrecting, flaming, commenting. 139362306a36Sopenharmony_ci 139462306a36Sopenharmony_ciThanks to the cabal for having no influence on this document. 139562306a36Sopenharmony_ci 139662306a36Sopenharmony_ciGlossary 139762306a36Sopenharmony_ci======== 139862306a36Sopenharmony_ci 139962306a36Sopenharmony_cipreemption 140062306a36Sopenharmony_ci Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user 140162306a36Sopenharmony_ci context inside the kernel would not preempt each other (ie. you had that 140262306a36Sopenharmony_ci CPU until you gave it up, except for interrupts). With the addition of 140362306a36Sopenharmony_ci ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher 140462306a36Sopenharmony_ci priority tasks can "cut in": spinlocks were changed to disable 140562306a36Sopenharmony_ci preemption, even on UP. 140662306a36Sopenharmony_ci 140762306a36Sopenharmony_cibh 140862306a36Sopenharmony_ci Bottom Half: for historical reasons, functions with '_bh' in them often 140962306a36Sopenharmony_ci now refer to any software interrupt, e.g. spin_lock_bh() 141062306a36Sopenharmony_ci blocks any software interrupt on the current CPU. Bottom halves are 141162306a36Sopenharmony_ci deprecated, and will eventually be replaced by tasklets. Only one bottom 141262306a36Sopenharmony_ci half will be running at any time. 141362306a36Sopenharmony_ci 141462306a36Sopenharmony_ciHardware Interrupt / Hardware IRQ 141562306a36Sopenharmony_ci Hardware interrupt request. in_hardirq() returns true in a 141662306a36Sopenharmony_ci hardware interrupt handler. 141762306a36Sopenharmony_ci 141862306a36Sopenharmony_ciInterrupt Context 141962306a36Sopenharmony_ci Not user context: processing a hardware irq or software irq. Indicated 142062306a36Sopenharmony_ci by the in_interrupt() macro returning true. 142162306a36Sopenharmony_ci 142262306a36Sopenharmony_ciSMP 142362306a36Sopenharmony_ci Symmetric Multi-Processor: kernels compiled for multiple-CPU machines. 142462306a36Sopenharmony_ci (``CONFIG_SMP=y``). 142562306a36Sopenharmony_ci 142662306a36Sopenharmony_ciSoftware Interrupt / softirq 142762306a36Sopenharmony_ci Software interrupt handler. in_hardirq() returns false; 142862306a36Sopenharmony_ci in_softirq() returns true. Tasklets and softirqs both 142962306a36Sopenharmony_ci fall into the category of 'software interrupts'. 143062306a36Sopenharmony_ci 143162306a36Sopenharmony_ci Strictly speaking a softirq is one of up to 32 enumerated software 143262306a36Sopenharmony_ci interrupts which can run on multiple CPUs at once. Sometimes used to 143362306a36Sopenharmony_ci refer to tasklets as well (ie. all software interrupts). 143462306a36Sopenharmony_ci 143562306a36Sopenharmony_citasklet 143662306a36Sopenharmony_ci A dynamically-registrable software interrupt, which is guaranteed to 143762306a36Sopenharmony_ci only run on one CPU at a time. 143862306a36Sopenharmony_ci 143962306a36Sopenharmony_citimer 144062306a36Sopenharmony_ci A dynamically-registrable software interrupt, which is run at (or close 144162306a36Sopenharmony_ci to) a given time. When running, it is just like a tasklet (in fact, they 144262306a36Sopenharmony_ci are called from the ``TIMER_SOFTIRQ``). 144362306a36Sopenharmony_ci 144462306a36Sopenharmony_ciUP 144562306a36Sopenharmony_ci Uni-Processor: Non-SMP. (``CONFIG_SMP=n``). 144662306a36Sopenharmony_ci 144762306a36Sopenharmony_ciUser Context 144862306a36Sopenharmony_ci The kernel executing on behalf of a particular process (ie. a system 144962306a36Sopenharmony_ci call or trap) or kernel thread. You can tell which process with the 145062306a36Sopenharmony_ci ``current`` macro.) Not to be confused with userspace. Can be 145162306a36Sopenharmony_ci interrupted by software or hardware interrupts. 145262306a36Sopenharmony_ci 145362306a36Sopenharmony_ciUserspace 145462306a36Sopenharmony_ci A process executing its own code outside the kernel. 1455