• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1[/
2          Copyright Oliver Kowalke 2017.
3 Distributed under the Boost Software License, Version 1.0.
4    (See accompanying file LICENSE_1_0.txt or copy at
5          http://www.boost.org/LICENSE_1_0.txt
6]
7
8[#tuning]
9[section:tuning Tuning]
10
11[heading Disable synchronization]
12
13With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
14compiler[s] command line, synchronization between fibers (in different
15threads) is disabled. This is acceptable if the application is single threaded
16and/or fibers are not synchronized between threads.
17
18
19[heading Memory allocation]
20
21Memory allocation algorithm is significant for performance in a multithreaded
22environment, especially for __boost_fiber__ where fiber stacks are allocated on
23the heap. The default user-level memory allocator (UMA) of glibc is ptmalloc2
24but it can be replaced by another UMA that fit better for the concret work-load
25For instance Google[s]
26[@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] enables a
27better performance at the ['skynet] microbenchmark than glibc[s] default memory
28allocator.
29
30
31[heading Scheduling strategies]
32
33The fibers in a thread are coordinated by a fiber manager. Fibers trade control
34cooperatively, rather than preemptively.
35Depending on the work-load several strategies of scheduling the fibers are
36possible [footnote 1024cores.net:
37[@http://www.1024cores.net/home/scalable-architecture/task-scheduling-strategies Task Scheduling Strategies]]
38that can be implmented on behalf of __algo__.
39
40* work-stealing: ready fibers are hold in a local queue, when the
41  fiber-scheduler's local queue runs out of ready fibers, it randomly
42  selects another fiber-scheduler and tries to steal a ready fiber from the
43  victim (implemented in __work_stealing__ and __numa_work_stealing__)
44
45* work-requesting: ready fibers are hold in a local queue, when the
46  fiber-scheduler's local queue runs out of ready fibers, it randomly
47  selects another fiber-scheduler and requests for a ready fibers, the victim
48  fiber-scheduler sends a ready-fiber back
49
50* work-sharing: ready fibers are hold in a global queue, fiber-scheduler
51  concurrently push and pop ready fibers to/from the global queue
52  (implemented in __shared_work__)
53
54* work-distribution: fibers that became ready are proactivly distributed to
55  idle fiber-schedulers or fiber-schedulers with low load
56
57* work-balancing: a dedicated (helper) fiber-scheduler periodically collects
58 informations about all fiber-scheduler running in other threads and
59 re-distributes ready fibers among them
60
61
62[heading TTAS locks]
63
64Boost.Fiber uses internally spinlocks to protect critical regions if fibers
65running on different threads interact.
66Spinlocks are implemented as TTAS (test-test-and-set) locks, i.e. the spinlock
67tests the lock before calling an atomic exchange. This strategy helps to reduce
68the cache line invalidations triggered by acquiring/releasing the lock.
69
70
71[heading Spin-wait loop]
72
73A lock is considered under contention, if a thread repeatedly fails to acquire
74the lock because some other thread was faster.
75Waiting for a short time lets other threads finish before trying to enter the
76critical section again. While busy waiting on the lock, relaxing the CPU (via
77pause/yield mnemonic) gives the CPU a hint that the code is in a spin-wait loop.
78
79* prevents expensive pipeline flushes (speculatively executed load and compare
80  instructions are not pushed to pipeline)
81* another hardware thread (simultaneous multithreading) can get time slice
82* it does delay a few CPU cycles, but this is necessary to prevent starvation
83
84It is obvious that this strategy is useless on single core systems because the
85lock can only released if the thread gives up its time slice in order to let
86other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE replaces the CPU
87hints (pause/yield mnemonic) by informing the operating system
88(via `std::this_thread_yield()`) that the thread gives up its time slice and
89the operating system switches to another thread.
90
91
92[heading Exponential back-off]
93
94The macro BOOST_FIBERS_RETRY_THRESHOLD determines how many times the CPU
95iterates in the spin-wait loop before yielding the thread or blocking in
96futex-wait.
97The spinlock tracks how many times the thread failed to acquire the lock.
98The higher the contention, the longer the thread should back-off.
99A ["Binary Exponential Backoff] algorithm together with a randomized contention
100window is utilized for this purpose.
101BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD determines the upper limit of the
102contention window (expressed as the exponent for basis of two).
103
104
105[heading Speculative execution (hardware transactional memory)]
106
107Boost.Fiber uses spinlocks to protect critical regions that can be used
108together with transactional memory (see section [link speculation Speculative
109execution]).
110
111[note TXS is enabled if property `htm=tsx` is specified at b2 command-line and
112`BOOST_USE_TSX` is applied to the compiler.]
113
114[note A TSX-transaction will be aborted if the floating point state is modified
115inside a critical region. As a consequence floating point operations, e.g.
116tore/load of floating point related registers during a fiber (context) switch
117are disabled.]
118
119
120[heading NUMA systems]
121
122Modern multi-socket systems are usually designed as [link numa NUMA systems].
123A suitable fiber scheduler like __numa_work_stealing__ reduces
124remote memory access (latence).
125
126
127[heading Parameters]
128
129[table Parameters that migh be defiend at compiler's command line
130    [
131        [Parameter]
132        [Default value]
133        [Effect on Boost.Fiber]
134    ]
135    [
136        [BOOST_FIBERS_NO_ATOMICS]
137        [-]
138        [no multithreading support, all atomics removed, no synchronization
139        between fibers running in different threads]
140    ]
141    [
142        [BOOST_FIBERS_SPINLOCK_STD_MUTEX]
143        [-]
144        [`std::mutex` used inside spinlock]
145    ]
146    [
147        [BOOST_FIBERS_SPINLOCK_TTAS]
148        [+]
149        [spinlock with test-test-and-swap on shared variable]
150    ]
151    [
152        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
153        [-]
154        [spinlock with test-test-and-swap on shared variable, adaptive retries
155        while busy waiting]
156    ]
157    [
158        [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
159        [-]
160        [spinlock with test-test-and-swap on shared variable, suspend on
161        futex after certain number of retries]
162    ]
163    [
164        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
165        [-]
166        [spinlock with test-test-and-swap on shared variable, while busy
167        waiting adaptive retries, suspend on futex certain amount of retries]
168    ]
169    [
170        [BOOST_FIBERS_SPINLOCK_TTAS + BOOST_USE_TSX]
171        [-]
172        [spinlock with test-test-and-swap and speculative execution (Intel TSX
173        required)]
174    ]
175    [
176        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE + BOOST_USE_TSX]
177        [-]
178        [spinlock with test-test-and-swap on shared variable, adaptive retries
179        while busy waiting and speculative execution (Intel TSX required)]
180    ]
181    [
182        [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX + BOOST_USE_TSX]
183        [-]
184        [spinlock with test-test-and-swap on shared variable, suspend on
185        futex after certain number of retries and speculative execution
186        (Intel TSX required)]
187    ]
188    [
189        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX + BOOST_USE_TSX]
190        [-]
191        [spinlock with test-test-and-swap on shared variable, while busy
192        waiting adaptive retries, suspend on futex certain amount of retries
193        and speculative execution (Intel TSX required)]
194    ]
195    [
196        [BOOST_FIBERS_SPIN_SINGLE_CORE]
197        [-]
198        [on single core machines with multiple threads, yield thread
199        (`std::this_thread::yield()`) after collisions]
200    ]
201    [
202        [BOOST_FIBERS_RETRY_THRESHOLD]
203        [64]
204        [max number of retries while busy spinning, the use fallback]
205    ]
206    [
207        [BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD]
208        [16]
209        [max size of collisions window, expressed as exponent for the basis of
210        two]
211    ]
212    [
213        [BOOST_FIBERS_SPIN_BEFORE_SLEEP0]
214        [32]
215        [max number of retries that relax the processor before the thread
216        sleeps for 0s]
217    ]
218    [
219        [BOOST_FIBERS_SPIN_BEFORE_YIELD]
220        [64]
221        [max number of retries where the thread sleeps for 0s before yield
222        thread (`std::this_thread::yield()`)]
223    ]
224]
225
226[endsect]
227