162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============
462306a36Sopenharmony_ciFalse Sharing
562306a36Sopenharmony_ci=============
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciWhat is False Sharing
862306a36Sopenharmony_ci=====================
962306a36Sopenharmony_ciFalse sharing is related with cache mechanism of maintaining the data
1062306a36Sopenharmony_cicoherence of one cache line stored in multiple CPU's caches; then
1162306a36Sopenharmony_ciacademic definition for it is in [1]_. Consider a struct with a
1262306a36Sopenharmony_cirefcount and a string::
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ci	struct foo {
1562306a36Sopenharmony_ci		refcount_t refcount;
1662306a36Sopenharmony_ci		...
1762306a36Sopenharmony_ci		char name[16];
1862306a36Sopenharmony_ci	} ____cacheline_internodealigned_in_smp;
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciMember 'refcount'(A) and 'name'(B) _share_ one cache line like below::
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ci                +-----------+                     +-----------+
2362306a36Sopenharmony_ci                |   CPU 0   |                     |   CPU 1   |
2462306a36Sopenharmony_ci                +-----------+                     +-----------+
2562306a36Sopenharmony_ci               /                                        |
2662306a36Sopenharmony_ci              /                                         |
2762306a36Sopenharmony_ci             V                                          V
2862306a36Sopenharmony_ci         +----------------------+             +----------------------+
2962306a36Sopenharmony_ci         | A      B             | Cache 0     | A       B            | Cache 1
3062306a36Sopenharmony_ci         +----------------------+             +----------------------+
3162306a36Sopenharmony_ci                             |                  |
3262306a36Sopenharmony_ci  ---------------------------+------------------+-----------------------------
3362306a36Sopenharmony_ci                             |                  |
3462306a36Sopenharmony_ci                           +----------------------+
3562306a36Sopenharmony_ci                           |                      |
3662306a36Sopenharmony_ci                           +----------------------+
3762306a36Sopenharmony_ci              Main Memory  | A       B            |
3862306a36Sopenharmony_ci                           +----------------------+
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ci'refcount' is modified frequently, but 'name' is set once at object
4162306a36Sopenharmony_cicreation time and is never modified.  When many CPUs access 'foo' at
4262306a36Sopenharmony_cithe same time, with 'refcount' being only bumped by one CPU frequently
4362306a36Sopenharmony_ciand 'name' being read by other CPUs, all those reading CPUs have to
4462306a36Sopenharmony_cireload the whole cache line over and over due to the 'sharing', even
4562306a36Sopenharmony_cithough 'name' is never changed.
4662306a36Sopenharmony_ci
4762306a36Sopenharmony_ciThere are many real-world cases of performance regressions caused by
4862306a36Sopenharmony_cifalse sharing.  One of these is a rw_semaphore 'mmap_lock' inside
4962306a36Sopenharmony_cimm_struct struct, whose cache line layout change triggered a
5062306a36Sopenharmony_ciregression and Linus analyzed in [2]_.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ciThere are two key factors for a harmful false sharing:
5362306a36Sopenharmony_ci
5462306a36Sopenharmony_ci* A global datum accessed (shared) by many CPUs
5562306a36Sopenharmony_ci* In the concurrent accesses to the data, there is at least one write
5662306a36Sopenharmony_ci  operation: write/write or write/read cases.
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciThe sharing could be from totally unrelated kernel components, or
5962306a36Sopenharmony_cidifferent code paths of the same kernel component.
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ciFalse Sharing Pitfalls
6362306a36Sopenharmony_ci======================
6462306a36Sopenharmony_ciBack in time when one platform had only one or a few CPUs, hot data
6562306a36Sopenharmony_cimembers could be purposely put in the same cache line to make them
6662306a36Sopenharmony_cicache hot and save cacheline/TLB, like a lock and the data protected
6762306a36Sopenharmony_ciby it.  But for recent large system with hundreds of CPUs, this may
6862306a36Sopenharmony_cinot work when the lock is heavily contended, as the lock owner CPU
6962306a36Sopenharmony_cicould write to the data, while other CPUs are busy spinning the lock.
7062306a36Sopenharmony_ci
7162306a36Sopenharmony_ciLooking at past cases, there are several frequently occurring patterns
7262306a36Sopenharmony_cifor false sharing:
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ci* lock (spinlock/mutex/semaphore) and data protected by it are
7562306a36Sopenharmony_ci  purposely put in one cache line.
7662306a36Sopenharmony_ci* global data being put together in one cache line. Some kernel
7762306a36Sopenharmony_ci  subsystems have many global parameters of small size (4 bytes),
7862306a36Sopenharmony_ci  which can easily be grouped together and put into one cache line.
7962306a36Sopenharmony_ci* data members of a big data structure randomly sitting together
8062306a36Sopenharmony_ci  without being noticed (cache line is usually 64 bytes or more),
8162306a36Sopenharmony_ci  like 'mem_cgroup' struct.
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ciFollowing 'mitigation' section provides real-world examples.
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciFalse sharing could easily happen unless they are intentionally
8662306a36Sopenharmony_cichecked, and it is valuable to run specific tools for performance
8762306a36Sopenharmony_cicritical workloads to detect false sharing affecting performance case
8862306a36Sopenharmony_ciand optimize accordingly.
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ci
9162306a36Sopenharmony_ciHow to detect and analyze False Sharing
9262306a36Sopenharmony_ci========================================
9362306a36Sopenharmony_ciperf record/report/stat are widely used for performance tuning, and
9462306a36Sopenharmony_cionce hotspots are detected, tools like 'perf-c2c' and 'pahole' can
9562306a36Sopenharmony_cibe further used to detect and pinpoint the possible false sharing
9662306a36Sopenharmony_cidata structures.  'addr2line' is also good at decoding instruction
9762306a36Sopenharmony_cipointer when there are multiple layers of inline functions.
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ciperf-c2c can capture the cache lines with most false sharing hits,
10062306a36Sopenharmony_cidecoded functions (line number of file) accessing that cache line,
10162306a36Sopenharmony_ciand in-line offset of the data. Simple commands are::
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ci  $ perf c2c record -ag sleep 3
10462306a36Sopenharmony_ci  $ perf c2c report --call-graph none -k vmlinux
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciWhen running above during testing will-it-scale's tlb_flush1 case,
10762306a36Sopenharmony_ciperf reports something like::
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ci  Total records                     :    1658231
11062306a36Sopenharmony_ci  Locked Load/Store Operations      :      89439
11162306a36Sopenharmony_ci  Load Operations                   :     623219
11262306a36Sopenharmony_ci  Load Local HITM                   :      92117
11362306a36Sopenharmony_ci  Load Remote HITM                  :        139
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ci  #----------------------------------------------------------------------
11662306a36Sopenharmony_ci      4        0     2374        0        0        0  0xff1100088366d880
11762306a36Sopenharmony_ci  #----------------------------------------------------------------------
11862306a36Sopenharmony_ci    0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
11962306a36Sopenharmony_ci    0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
12062306a36Sopenharmony_ci    0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
12162306a36Sopenharmony_ci    0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
12262306a36Sopenharmony_ci    0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ciA nice introduction for perf-c2c is [3]_.
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ci'pahole' decodes data structure layouts delimited in cache line
12762306a36Sopenharmony_cigranularity.  Users can match the offset in perf-c2c output with
12862306a36Sopenharmony_cipahole's decoding to locate the exact data members.  For global
12962306a36Sopenharmony_cidata, users can search the data address in System.map.
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciPossible Mitigations
13362306a36Sopenharmony_ci====================
13462306a36Sopenharmony_ciFalse sharing does not always need to be mitigated.  False sharing
13562306a36Sopenharmony_cimitigations should balance performance gains with complexity and
13662306a36Sopenharmony_cispace consumption.  Sometimes, lower performance is OK, and it's
13762306a36Sopenharmony_ciunnecessary to hyper-optimize every rarely used data structure or
13862306a36Sopenharmony_cia cold data path.
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ciFalse sharing hurting performance cases are seen more frequently with
14162306a36Sopenharmony_cicore count increasing.  Because of these detrimental effects, many
14262306a36Sopenharmony_cipatches have been proposed across variety of subsystems (like
14362306a36Sopenharmony_cinetworking and memory management) and merged.  Some common mitigations
14462306a36Sopenharmony_ci(with examples) are:
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ci* Separate hot global data in its own dedicated cache line, even if it
14762306a36Sopenharmony_ci  is just a 'short' type. The downside is more consumption of memory,
14862306a36Sopenharmony_ci  cache line and TLB entries.
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ci  - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ci* Reorganize the data structure, separate the interfering members to
15362306a36Sopenharmony_ci  different cache lines.  One downside is it may introduce new false
15462306a36Sopenharmony_ci  sharing of other members.
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ci  - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ci* Replace 'write' with 'read' when possible, especially in loops.
15962306a36Sopenharmony_ci  Like for some global variable, use compare(read)-then-write instead
16062306a36Sopenharmony_ci  of unconditional write. For example, use::
16162306a36Sopenharmony_ci
16262306a36Sopenharmony_ci	if (!test_bit(XXX))
16362306a36Sopenharmony_ci		set_bit(XXX);
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ci  instead of directly "set_bit(XXX);", similarly for atomic_t data::
16662306a36Sopenharmony_ci
16762306a36Sopenharmony_ci	if (atomic_read(XXX) == AAA)
16862306a36Sopenharmony_ci		atomic_set(XXX, BBB);
16962306a36Sopenharmony_ci
17062306a36Sopenharmony_ci  - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
17162306a36Sopenharmony_ci  - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ci* Turn hot global data to 'per-cpu data + global data' when possible,
17462306a36Sopenharmony_ci  or reasonably increase the threshold for syncing per-cpu data to
17562306a36Sopenharmony_ci  global data, to reduce or postpone the 'write' to that global data.
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ci  - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
17862306a36Sopenharmony_ci  - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
17962306a36Sopenharmony_ci
18062306a36Sopenharmony_ciSurely, all mitigations should be carefully verified to not cause side
18162306a36Sopenharmony_cieffects.  To avoid introducing false sharing when coding, it's better
18262306a36Sopenharmony_cito:
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ci* Be aware of cache line boundaries
18562306a36Sopenharmony_ci* Group mostly read-only fields together
18662306a36Sopenharmony_ci* Group things that are written at the same time together
18762306a36Sopenharmony_ci* Separate frequently read and frequently written fields on
18862306a36Sopenharmony_ci  different cache lines.
18962306a36Sopenharmony_ci
19062306a36Sopenharmony_ciand better add a comment stating the false sharing consideration.
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ciOne note is, sometimes even after a severe false sharing is detected
19362306a36Sopenharmony_ciand solved, the performance may still have no obvious improvement as
19462306a36Sopenharmony_cithe hotspot switches to a new place.
19562306a36Sopenharmony_ci
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciMiscellaneous
19862306a36Sopenharmony_ci=============
19962306a36Sopenharmony_ciOne open issue is that kernel has an optional data structure
20062306a36Sopenharmony_cirandomization mechanism, which also randomizes the situation of cache
20162306a36Sopenharmony_ciline sharing of data members.
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ci.. [1] https://en.wikipedia.org/wiki/False_sharing
20562306a36Sopenharmony_ci.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
20662306a36Sopenharmony_ci.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
207