1[/ 2 Copyright Oliver Kowalke 2017. 3 Distributed under the Boost Software License, Version 1.0. 4 (See accompanying file LICENSE_1_0.txt or copy at 5 http://www.boost.org/LICENSE_1_0.txt 6] 7 8[#numa] 9[section:numa NUMA] 10 11Modern micro-processors contain integrated memory controllers that are connected 12via channels to the memory. Accessing the memory can be organized in two kinds:[br] 13Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA). 14 15In contrast to UMA, that provides a centralized pool of memory (and thus does 16not scale after a certain number of processors), a NUMA architecture divides the 17memory into local and remote memory relative to the micro-processor.[br] 18Local memory is directly attached to the processor's integrated memory controller. 19Memory connected to the memory controller of another micro-processor (multi-socket 20systems) is considered as remote memory. If a memory controller access remote memory 21it has to traverse the interconnect[footnote On x86 the interconnection is implemented 22by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.] and 23connect to the remote memory controller.[br] 24Thus accessing remote memory adds additional latency overhead to local memory access. 25Because of the different memory locations, a NUMA-system experiences ['non-uniform] 26memory access time.[br] 27As a consequence the best performance is achieved by keeping the memory access 28local. 29 30[$../../../../libs/fiber/doc/NUMA.png [align center]] 31 32 33[heading NUMA support in Boost.Fiber] 34 35Because only a subset of the NUMA-functionality is exposed by several operating systems, 36Boost.Fiber provides only a minimalistic NUMA API. 37 38[important In order to enable NUMA support, b2 property `numa=on` must be specified 39and linked against additional library `libboost_fiber_numa.so`.] 40[important MinGW using pthread implementation is not supported on Windows.] 41 42[table Supported functionality/operating systems 43 [ 44 [] 45 [AIX] 46 [FreeBSD] 47 [HP/UX] 48 [Linux] 49 [Solaris] 50 [Windows] 51 ] 52 [ 53 [pin thread] 54 [+] 55 [+] 56 [+] 57 [+] 58 [+] 59 [+] 60 ] 61 [ 62 [logical CPUs/NUMA nodes] 63 [+] 64 [+] 65 [+] 66 [+] 67 [+] 68 [+[footnote Windows organizes logical cpus in groups of 64; boost.fiber maps 69 {group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).]] 70 ] 71 [ 72 [NUMA node distance] 73 [-] 74 [-] 75 [-] 76 [+] 77 [-] 78 [-] 79 ] 80 [ 81 [tested on] 82 [AIX 7.2] 83 [FreeBSD 11] 84 [-] 85 [Arch Linux (4.10.13)] 86 [OpenIndiana HIPSTER] 87 [Windows 10] 88 ] 89] 90 91In order to keep the memory access local as possible, the NUMA topology must be evaluated. 92 93 std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology(); 94 for ( auto n : topo) { 95 std::cout << "node: " << n.id << " | "; 96 std::cout << "cpus: "; 97 for ( auto cpu_id : n.logical_cpus) { 98 std::cout << cpu_id << " "; 99 } 100 std::cout << "| distance: "; 101 for ( auto d : n.distance) { 102 std::cout << d << " "; 103 } 104 std::cout << std::endl; 105 } 106 std::cout << "done" << std::endl; 107 108 output: 109 node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21 110 node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10 111 done 112 113The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong 11416 logical cpus. The distance measures the costs to access the memory of another NUMA-node. 115A NUMA-node has always a distance `10` to itself (lowest possible value).[br] 116The position in the array corresponds with the NUMA-node ID. 117 118Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling 119algorithm __numa_work_stealing__ pins the thread that runs the fiber scheduler to 120a logical cpu. This prevents the operating system scheduler to move the thread to another 121logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical 122cpu part of another NUMA-node. 123 124 void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) { 125 // thread registers itself at work-stealing scheduler 126 boost::fibers::use_scheduling_algorithm< boost::fibers::numa::algo::work_stealing >( cpu_id, node_id, topo); 127 ... 128 } 129 130 // evaluate the NUMA topology 131 std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology(); 132 // start-thread runs on NUMA-node `0` 133 auto node = topo[0]; 134 // start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0` 135 auto start_cpu_id = * node.logical_cpus.begin(); 136 // start worker-threads first 137 std::vector< std::thread > threads; 138 for ( auto & node : topo) { 139 for ( std::uint32_t cpu_id : node.logical_cpus) { 140 // exclude start-thread 141 if ( start_cpu_id != cpu_id) { 142 // spawn thread 143 threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) ); 144 } 145 } 146 } 147 // start-thread registers itself on work-stealing scheduler 148 boost::fibers::use_scheduling_algorithm< boost::fibers::numa::algo::work_stealing >( start_cpu_id, node.id, topo); 149 ... 150 151The example evaluates the NUMA topology with `boost::fibers::numa::topology()` 152and spawns for each logical cpu a thread. Each spawned thread installs the 153NUMA-aware work-stealing scheduler. The scheduler pins the thread to the 154logical cpu that was specified at construction.[br] 155If the local queue of one thread runs out of ready fibers, the thread tries to 156steal a ready fiber from another thread running at logical cpu that belong to 157the same NUMA-node (local memory access). If no fiber could be stolen, the 158thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote 159memory access). 160 161 162[heading Synopsis] 163 164 #include <boost/fiber/numa/pin_thread.hpp> 165 #include <boost/fiber/numa/topology.hpp> 166 167 namespace boost { 168 namespace fibers { 169 namespace numa { 170 171 struct node { 172 std::uint32_t id; 173 std::set< std::uint32_t > logical_cpus; 174 std::vector< std::uint32_t > distance; 175 }; 176 bool operator<( node const&, node const&) noexcept; 177 178 std::vector< node > topology(); 179 180 void pin_thread( std::uint32_t); 181 void pin_thread( std::uint32_t, std::thread::native_handle_type); 182 183 }}} 184 185 #include <boost/fiber/numa/algo/work_stealing.hpp> 186 187 namespace boost { 188 namespace fibers { 189 namespace numa { 190 namespace algo { 191 192 class work_stealing; 193 194 }}} 195 196 197[ns_class_heading numa..node] 198 199 #include <boost/fiber/numa/topology.hpp> 200 201 namespace boost { 202 namespace fibers { 203 namespace numa { 204 205 struct node { 206 std::uint32_t id; 207 std::set< std::uint32_t > logical_cpus; 208 std::vector< std::uint32_t > distance; 209 }; 210 bool operator<( node const&, node const&) noexcept; 211 212 }}} 213 214[ns_data_member_heading numa..node..id] 215 216 std::uint32_t id; 217 218[variablelist 219[[Effects:] [ID of the NUMA-node]] 220] 221 222[ns_data_member_heading numa..node..logical_cpus] 223 224 std::set< std::uint32_t > logical_cpus; 225 226[variablelist 227[[Effects:] [set of logical cpu IDs belonging to the NUMA-node]] 228] 229 230[ns_data_member_heading numa..node..distance] 231 232 std::vector< std::uint32_t > distance; 233 234[variablelist 235[[Effects:] [The distance between NUMA-nodes describe the cots of accessing the 236remote memory.]] 237[[Note:] [A NUMA-node has a distance of `10` to itself, remote NUMA-nodes 238have a distance > `10`. The index in the array corresponds to the ID `id` 239of the NUMA-node. At the moment only Linux returns the correct distances, 240for all other operating systems remote NUMA-nodes get a default value of 241`20`.]] 242] 243 244[ns_operator_heading numa..node..operator_less..operator<] 245 246 bool operator<( node const& lhs, node const& rhs) const noexcept; 247 248[variablelist 249[[Returns:] [`true` if `lhs != rhs` is true and the 250implementation-defined total order of `node::id` values places `lhs` before 251`rhs`, false otherwise.]] 252[[Throws:] [Nothing.]] 253] 254 255 256[ns_function_heading numa..topology] 257 258 #include <boost/fiber/numa/topology.hpp> 259 260 namespace boost { 261 namespace fibers { 262 namespace numa { 263 264 std::vector< node > topology(); 265 266 }}} 267 268[variablelist 269[[Effects:] [Evaluates the NUMA topology.]] 270[[Returns:] [a vector of NUMA-nodes describing the NUMA architecture of the 271system (each element represents a NUMA-node).]] 272[[Throws:] [`system_error`]] 273] 274 275 276[ns_function_heading numa..pin_thread] 277 278 #include <boost/fiber/numa/pin_thread.hpp> 279 280 namespace boost { 281 namespace fibers { 282 namespace numa { 283 284 void pin_thread( std::uint32_t cpu_id); 285 void pin_thread( std::uint32_t cpu_id, std::thread::native_handle_type h); 286 287 }}} 288 289[variablelist 290[[Effects:] [First version pins `this thread` to the logical cpu with ID `cpu_id`, e.g. 291the operating system scheduler will not migrate the thread to another logical cpu. 292The second variant pins the thread with the native ID `h` to logical cpu with ID `cpu_id`.]] 293[[Throws:] [`system_error`]] 294] 295 296 297[ns_class_heading numa..work_stealing] 298 299This class implements __algo__; the thread running this scheduler is pinned to the given 300logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen 301from other schedulers that run on logical cpus that belong to the same NUMA-node (local 302memory access).[br] 303If no ready fibers can be stolen from the local NUMA-node, the algorithm selects 304schedulers running on other NUMA-nodes (remote memory access).[br] 305The victim scheduler (from which a ready fiber is stolen) is selected at random. 306 307 #include <boost/fiber/numa/algo/work_stealing.hpp> 308 309 namespace boost { 310 namespace fibers { 311 namespace numa { 312 namespace algo { 313 314 class work_stealing : public algorithm { 315 public: 316 work_stealing( std::uint32_t cpu_id, 317 std::uint32_t node_id, 318 std::vector< boost::fibers::numa::node > const& topo, 319 bool suspend = false); 320 321 work_stealing( work_stealing const&) = delete; 322 work_stealing( work_stealing &&) = delete; 323 324 work_stealing & operator=( work_stealing const&) = delete; 325 work_stealing & operator=( work_stealing &&) = delete; 326 327 virtual void awakened( context *) noexcept; 328 329 virtual context * pick_next() noexcept; 330 331 virtual bool has_ready_fibers() const noexcept; 332 333 virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept; 334 335 virtual void notify() noexcept; 336 }; 337 338 }}}} 339 340[heading Constructor] 341 342 work_stealing( std::uint32_t cpu_id, std::uint32_t node_id, 343 std::vector< boost::fibers::numa::node > const& topo, 344 bool suspend = false); 345 346[variablelist 347[[Effects:] [Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID 348`cpu_id`. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers 349using `topology` (represents the NUMA-topology of the system).]] 350[[Throws:] [`system_error`]] 351[[Note:][If `suspend` is set to `true`, then the scheduler suspends if no ready fiber could be stolen. 352The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or 353fiber scheduler).]] 354] 355 356[ns_member_heading numa..work_stealing..awakened] 357 358 virtual void awakened( context * f) noexcept; 359 360[variablelist 361[[Effects:] [Enqueues fiber `f` onto the shared ready queue.]] 362[[Throws:] [Nothing.]] 363] 364 365[ns_member_heading numa..work_stealing..pick_next] 366 367 virtual context * pick_next() noexcept; 368 369[variablelist 370[[Returns:] [the fiber at the head of the ready queue, or `nullptr` if the 371queue is empty.]] 372[[Throws:] [Nothing.]] 373[[Note:] [Placing ready fibers onto the tail of the sahred queue, and returning them 374from the head of that queue, shares the thread between ready fibers in 375round-robin fashion.]] 376] 377 378[ns_member_heading numa..work_stealing..has_ready_fibers] 379 380 virtual bool has_ready_fibers() const noexcept; 381 382[variablelist 383[[Returns:] [`true` if scheduler has fibers ready to run.]] 384[[Throws:] [Nothing.]] 385] 386 387[ns_member_heading numa..work_stealing..suspend_until] 388 389 virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept; 390 391[variablelist 392[[Effects:] [Informs `work_stealing` that no ready fiber will be available until 393time-point `abs_time`. This implementation blocks in 394[@http://en.cppreference.com/w/cpp/thread/condition_variable/wait_until 395`std::condition_variable::wait_until()`].]] 396[[Throws:] [Nothing.]] 397] 398 399[ns_member_heading numa..work_stealing..notify] 400 401 virtual void notify() noexcept = 0; 402 403[variablelist 404[[Effects:] [Wake up a pending call to [member_link 405work_stealing..suspend_until], some fibers might be ready. This implementation 406wakes `suspend_until()` via 407[@http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all 408`std::condition_variable::notify_all()`].]] 409[[Throws:] [Nothing.]] 410] 411 412[endsect] 413