Lines Matching +full:write +full:- +full:data
1 .. SPDX-License-Identifier: GPL-2.0
9 False sharing is related with cache mechanism of maintaining the data
22 +-----------+ +-----------+
24 +-----------+ +-----------+
28 +----------------------+ +----------------------+
30 +----------------------+ +----------------------+
32 ---------------------------+------------------+-----------------------------
34 +----------------------+
36 +----------------------+
38 +----------------------+
47 There are many real-world cases of performance regressions caused by
55 * In the concurrent accesses to the data, there is at least one write
56 operation: write/write or write/read cases.
64 Back in time when one platform had only one or a few CPUs, hot data
66 cache hot and save cacheline/TLB, like a lock and the data protected
69 could write to the data, while other CPUs are busy spinning the lock.
74 * lock (spinlock/mutex/semaphore) and data protected by it are
76 * global data being put together in one cache line. Some kernel
79 * data members of a big data structure randomly sitting together
83 Following 'mitigation' section provides real-world examples.
94 once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
96 data structures. 'addr2line' is also good at decoding instruction
99 perf-c2c can capture the cache lines with most false sharing hits,
101 and in-line offset of the data. Simple commands are::
103 $ perf c2c record -ag sleep 3
104 $ perf c2c report --call-graph none -k vmlinux
106 When running above during testing will-it-scale's tlb_flush1 case,
115 #----------------------------------------------------------------------
117 #----------------------------------------------------------------------
124 A nice introduction for perf-c2c is [3]_.
126 'pahole' decodes data structure layouts delimited in cache line
127 granularity. Users can match the offset in perf-c2c output with
128 pahole's decoding to locate the exact data members. For global
129 data, users can search the data address in System.map.
137 unnecessary to hyper-optimize every rarely used data structure or
138 a cold data path.
146 * Separate hot global data in its own dedicated cache line, even if it
150 - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
152 * Reorganize the data structure, separate the interfering members to
156 - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
158 * Replace 'write' with 'read' when possible, especially in loops.
159 Like for some global variable, use compare(read)-then-write instead
160 of unconditional write. For example, use::
165 instead of directly "set_bit(XXX);", similarly for atomic_t data::
170 …- Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false shari…
171 - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
173 * Turn hot global data to 'per-cpu data + global data' when possible,
174 or reasonably increase the threshold for syncing per-cpu data to
175 global data, to reduce or postpone the 'write' to that global data.
177 - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
178 - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
185 * Group mostly read-only fields together
199 One open issue is that kernel has an optional data structure
201 line sharing of data members.
205 .. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.…
206 .. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/