162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============= 462306a36Sopenharmony_ciFalse Sharing 562306a36Sopenharmony_ci============= 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciWhat is False Sharing 862306a36Sopenharmony_ci===================== 962306a36Sopenharmony_ciFalse sharing is related with cache mechanism of maintaining the data 1062306a36Sopenharmony_cicoherence of one cache line stored in multiple CPU's caches; then 1162306a36Sopenharmony_ciacademic definition for it is in [1]_. Consider a struct with a 1262306a36Sopenharmony_cirefcount and a string:: 1362306a36Sopenharmony_ci 1462306a36Sopenharmony_ci struct foo { 1562306a36Sopenharmony_ci refcount_t refcount; 1662306a36Sopenharmony_ci ... 1762306a36Sopenharmony_ci char name[16]; 1862306a36Sopenharmony_ci } ____cacheline_internodealigned_in_smp; 1962306a36Sopenharmony_ci 2062306a36Sopenharmony_ciMember 'refcount'(A) and 'name'(B) _share_ one cache line like below:: 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ci +-----------+ +-----------+ 2362306a36Sopenharmony_ci | CPU 0 | | CPU 1 | 2462306a36Sopenharmony_ci +-----------+ +-----------+ 2562306a36Sopenharmony_ci / | 2662306a36Sopenharmony_ci / | 2762306a36Sopenharmony_ci V V 2862306a36Sopenharmony_ci +----------------------+ +----------------------+ 2962306a36Sopenharmony_ci | A B | Cache 0 | A B | Cache 1 3062306a36Sopenharmony_ci +----------------------+ +----------------------+ 3162306a36Sopenharmony_ci | | 3262306a36Sopenharmony_ci ---------------------------+------------------+----------------------------- 3362306a36Sopenharmony_ci | | 3462306a36Sopenharmony_ci +----------------------+ 3562306a36Sopenharmony_ci | | 3662306a36Sopenharmony_ci +----------------------+ 3762306a36Sopenharmony_ci Main Memory | A B | 3862306a36Sopenharmony_ci +----------------------+ 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ci'refcount' is modified frequently, but 'name' is set once at object 4162306a36Sopenharmony_cicreation time and is never modified. When many CPUs access 'foo' at 4262306a36Sopenharmony_cithe same time, with 'refcount' being only bumped by one CPU frequently 4362306a36Sopenharmony_ciand 'name' being read by other CPUs, all those reading CPUs have to 4462306a36Sopenharmony_cireload the whole cache line over and over due to the 'sharing', even 4562306a36Sopenharmony_cithough 'name' is never changed. 4662306a36Sopenharmony_ci 4762306a36Sopenharmony_ciThere are many real-world cases of performance regressions caused by 4862306a36Sopenharmony_cifalse sharing. One of these is a rw_semaphore 'mmap_lock' inside 4962306a36Sopenharmony_cimm_struct struct, whose cache line layout change triggered a 5062306a36Sopenharmony_ciregression and Linus analyzed in [2]_. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ciThere are two key factors for a harmful false sharing: 5362306a36Sopenharmony_ci 5462306a36Sopenharmony_ci* A global datum accessed (shared) by many CPUs 5562306a36Sopenharmony_ci* In the concurrent accesses to the data, there is at least one write 5662306a36Sopenharmony_ci operation: write/write or write/read cases. 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciThe sharing could be from totally unrelated kernel components, or 5962306a36Sopenharmony_cidifferent code paths of the same kernel component. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ciFalse Sharing Pitfalls 6362306a36Sopenharmony_ci====================== 6462306a36Sopenharmony_ciBack in time when one platform had only one or a few CPUs, hot data 6562306a36Sopenharmony_cimembers could be purposely put in the same cache line to make them 6662306a36Sopenharmony_cicache hot and save cacheline/TLB, like a lock and the data protected 6762306a36Sopenharmony_ciby it. But for recent large system with hundreds of CPUs, this may 6862306a36Sopenharmony_cinot work when the lock is heavily contended, as the lock owner CPU 6962306a36Sopenharmony_cicould write to the data, while other CPUs are busy spinning the lock. 7062306a36Sopenharmony_ci 7162306a36Sopenharmony_ciLooking at past cases, there are several frequently occurring patterns 7262306a36Sopenharmony_cifor false sharing: 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ci* lock (spinlock/mutex/semaphore) and data protected by it are 7562306a36Sopenharmony_ci purposely put in one cache line. 7662306a36Sopenharmony_ci* global data being put together in one cache line. Some kernel 7762306a36Sopenharmony_ci subsystems have many global parameters of small size (4 bytes), 7862306a36Sopenharmony_ci which can easily be grouped together and put into one cache line. 7962306a36Sopenharmony_ci* data members of a big data structure randomly sitting together 8062306a36Sopenharmony_ci without being noticed (cache line is usually 64 bytes or more), 8162306a36Sopenharmony_ci like 'mem_cgroup' struct. 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ciFollowing 'mitigation' section provides real-world examples. 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ciFalse sharing could easily happen unless they are intentionally 8662306a36Sopenharmony_cichecked, and it is valuable to run specific tools for performance 8762306a36Sopenharmony_cicritical workloads to detect false sharing affecting performance case 8862306a36Sopenharmony_ciand optimize accordingly. 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ci 9162306a36Sopenharmony_ciHow to detect and analyze False Sharing 9262306a36Sopenharmony_ci======================================== 9362306a36Sopenharmony_ciperf record/report/stat are widely used for performance tuning, and 9462306a36Sopenharmony_cionce hotspots are detected, tools like 'perf-c2c' and 'pahole' can 9562306a36Sopenharmony_cibe further used to detect and pinpoint the possible false sharing 9662306a36Sopenharmony_cidata structures. 'addr2line' is also good at decoding instruction 9762306a36Sopenharmony_cipointer when there are multiple layers of inline functions. 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ciperf-c2c can capture the cache lines with most false sharing hits, 10062306a36Sopenharmony_cidecoded functions (line number of file) accessing that cache line, 10162306a36Sopenharmony_ciand in-line offset of the data. Simple commands are:: 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ci $ perf c2c record -ag sleep 3 10462306a36Sopenharmony_ci $ perf c2c report --call-graph none -k vmlinux 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciWhen running above during testing will-it-scale's tlb_flush1 case, 10762306a36Sopenharmony_ciperf reports something like:: 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ci Total records : 1658231 11062306a36Sopenharmony_ci Locked Load/Store Operations : 89439 11162306a36Sopenharmony_ci Load Operations : 623219 11262306a36Sopenharmony_ci Load Local HITM : 92117 11362306a36Sopenharmony_ci Load Remote HITM : 139 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ci #---------------------------------------------------------------------- 11662306a36Sopenharmony_ci 4 0 2374 0 0 0 0xff1100088366d880 11762306a36Sopenharmony_ci #---------------------------------------------------------------------- 11862306a36Sopenharmony_ci 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1 11962306a36Sopenharmony_ci 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1 12062306a36Sopenharmony_ci 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1 12162306a36Sopenharmony_ci 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1 12262306a36Sopenharmony_ci 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ciA nice introduction for perf-c2c is [3]_. 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ci'pahole' decodes data structure layouts delimited in cache line 12762306a36Sopenharmony_cigranularity. Users can match the offset in perf-c2c output with 12862306a36Sopenharmony_cipahole's decoding to locate the exact data members. For global 12962306a36Sopenharmony_cidata, users can search the data address in System.map. 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciPossible Mitigations 13362306a36Sopenharmony_ci==================== 13462306a36Sopenharmony_ciFalse sharing does not always need to be mitigated. False sharing 13562306a36Sopenharmony_cimitigations should balance performance gains with complexity and 13662306a36Sopenharmony_cispace consumption. Sometimes, lower performance is OK, and it's 13762306a36Sopenharmony_ciunnecessary to hyper-optimize every rarely used data structure or 13862306a36Sopenharmony_cia cold data path. 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciFalse sharing hurting performance cases are seen more frequently with 14162306a36Sopenharmony_cicore count increasing. Because of these detrimental effects, many 14262306a36Sopenharmony_cipatches have been proposed across variety of subsystems (like 14362306a36Sopenharmony_cinetworking and memory management) and merged. Some common mitigations 14462306a36Sopenharmony_ci(with examples) are: 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ci* Separate hot global data in its own dedicated cache line, even if it 14762306a36Sopenharmony_ci is just a 'short' type. The downside is more consumption of memory, 14862306a36Sopenharmony_ci cache line and TLB entries. 14962306a36Sopenharmony_ci 15062306a36Sopenharmony_ci - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated") 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ci* Reorganize the data structure, separate the interfering members to 15362306a36Sopenharmony_ci different cache lines. One downside is it may introduce new false 15462306a36Sopenharmony_ci sharing of other members. 15562306a36Sopenharmony_ci 15662306a36Sopenharmony_ci - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing") 15762306a36Sopenharmony_ci 15862306a36Sopenharmony_ci* Replace 'write' with 'read' when possible, especially in loops. 15962306a36Sopenharmony_ci Like for some global variable, use compare(read)-then-write instead 16062306a36Sopenharmony_ci of unconditional write. For example, use:: 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ci if (!test_bit(XXX)) 16362306a36Sopenharmony_ci set_bit(XXX); 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ci instead of directly "set_bit(XXX);", similarly for atomic_t data:: 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ci if (atomic_read(XXX) == AAA) 16862306a36Sopenharmony_ci atomic_set(XXX, BBB); 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_ci - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing") 17162306a36Sopenharmony_ci - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") 17262306a36Sopenharmony_ci 17362306a36Sopenharmony_ci* Turn hot global data to 'per-cpu data + global data' when possible, 17462306a36Sopenharmony_ci or reasonably increase the threshold for syncing per-cpu data to 17562306a36Sopenharmony_ci global data, to reduce or postpone the 'write' to that global data. 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ci - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses") 17862306a36Sopenharmony_ci - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") 17962306a36Sopenharmony_ci 18062306a36Sopenharmony_ciSurely, all mitigations should be carefully verified to not cause side 18162306a36Sopenharmony_cieffects. To avoid introducing false sharing when coding, it's better 18262306a36Sopenharmony_cito: 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ci* Be aware of cache line boundaries 18562306a36Sopenharmony_ci* Group mostly read-only fields together 18662306a36Sopenharmony_ci* Group things that are written at the same time together 18762306a36Sopenharmony_ci* Separate frequently read and frequently written fields on 18862306a36Sopenharmony_ci different cache lines. 18962306a36Sopenharmony_ci 19062306a36Sopenharmony_ciand better add a comment stating the false sharing consideration. 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ciOne note is, sometimes even after a severe false sharing is detected 19362306a36Sopenharmony_ciand solved, the performance may still have no obvious improvement as 19462306a36Sopenharmony_cithe hotspot switches to a new place. 19562306a36Sopenharmony_ci 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ciMiscellaneous 19862306a36Sopenharmony_ci============= 19962306a36Sopenharmony_ciOne open issue is that kernel has an optional data structure 20062306a36Sopenharmony_cirandomization mechanism, which also randomizes the situation of cache 20162306a36Sopenharmony_ciline sharing of data members. 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci.. [1] https://en.wikipedia.org/wiki/False_sharing 20562306a36Sopenharmony_ci.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/ 20662306a36Sopenharmony_ci.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/ 207