1[/=========================================================================== 2 Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters 3 4 5 Distributed under the Boost Software License, Version 1.0 6 See accompanying file LICENSE_1_0.txt or copy at 7 http://www.boost.org/LICENSE_1_0.txt 8=============================================================================/] 9 10[section:block_indirect_sort 3.1- block_indirect_sort] 11 12 13[section:block_description Description] 14[: 15 16[*BLOCK_INDIRECT_SORT] is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library. 17 18The most important characteristics of this algorithm are the *speed* and the *low memory consumption*. 19 20[*[teletype] 21`` 22 | | | | 23 Algorithm |Stable | Additional memory |Best, average, and worst case | 24 ----------------------+-------+------------------------+------------------------------+ 25 block_indirect_sort | no |block_size * num_threads| N, N LogN, N LogN | 26 | | | | 27`` 28] 29 30The block_size is an internal parameter of the algorithm, which in order to achieve the 31highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128. 32 33 34[*[teletype] 35`` 36 | | | | | | | | 37object size (bytes) | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 - | 38--------------------------------+--------+---------+---------+---------+---------+---------+----------+ 39block_size (number of elements) | 4096 | 2048 | 1024 | 768 | 512 | 256 | 128 | 40 | | | | | | | | 41`` 42] 43] 44[endsect] 45[br] 46[section:block_benchmark Benchmark] 47[: 48Sorting 100 000 000 64 bits numbers, the measured memory used was: 49[*[teletype] 50`` 51 | | | 52 Algorithm | Time (secs) | Memory used | 53 ----------------------------------+-------------+-------------+ 54 Open MP Parallel Sort | 1.1990 | 1564 MB | 55 Threading Building Blocks (TBB) | 1.6411 | 789 MB | 56 Block Indirect Sort | 0.9270 | 790 MB | 57 | | | 58`` 59] 60] 61[endsect] 62[br] 63 64[section:block_programming Programming] 65[: 66[h4[_Thread specification]] 67[: 68This algorithm has an integer parameter indicating the *number of threads* to use in the sorting process, 69which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of 70the machine where the program is running provided by std::thread::hardware_concurrency(). 71 72If the number is 1 or 0, the algorithm runs with only 1 thread. 73 74The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater 75than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads, 76and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). 77If this value is 0, the program is executed with 1 thread. 78] 79[h4[_Programming]] 80[: 81You only need to include the file boost/sort/sort.hpp to use this algorithm 82 83The algorithm runs in the namespace boost::sort 84 85[c++] 86`` 87 #include <boost/sort/sort.hpp> 88 89 90 91 template <class iter_t> 92 void block_indirect_sort (iter_t first, iter_t last); 93 94 template <class iter_t, typename compare> 95 void block_indirect_sort (iter_t first, iter_t last, compare comp); 96 97 template <class iter_t> 98 void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread); 99 100 template <class iter_t, typename compare> 101 void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread); 102`` 103 104This algorithm needs a *C++11 compliant compiler*. You don't need any other code or library. With older compilers correct operation is not guaranteed. 105 106If the number of threads is unspecified, use the result of std::thread::hardware_concurrency() 107 108This algorithm uses a *comparison object*, in the same way as the standard library sort 109algorithms. If not defined, the comparison object is std::less, which uses 110the < operator internally. 111 112This algorithm is [*exception safe], meaning that, the exceptions generated by the algorithm 113guarantee the integrity of the objects to sort, but not their relative order. If the exception 114is generated inside the objects (in the move or in the copy constructor.. ) the results can be 115unpredictable. 116] 117] 118[endsect] 119[br] 120[section:block_internal Internal Description] 121[: 122 123There are two primary categories of parallelization in sorting algorithms. 124 125 126[h4[_Subdivision Algorithms]] 127 128 129[: Filter the data and generate two or more parts. Each part obtained is 130filtered and divided by other threads. This filter and division process is done 131until the size of the part is smaller than a predefined size, then is sorted by a single thread. 132 133The algorithm most frequently used in the filter and sorting is quick sort 134 135These algorithms are fast with a small number of threads, but are inefficient 136with a great number of threads. Examples of this category are 137 138*Intel Threading Building Blocks (TBB) 139 140*Microsoft PPL Parallel Sort. 141 142] 143[h4[_Merging Algorithms]] 144 145[: 146 147Divide the data in parts, and each part is sorted by a thread. When 148the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the 149merge, usually the same size as the data. 150 151With a small number of threads, these algorithms have similar speed to 152the subdivision algorithms, but with many threads are much faster. 153 154Examples of this category are 155 156*GCC Parallel Sort (based on OpenMP) 157 158*Microsoft PPL Parallel Buffered Sort 159 160This generates an *undesirable duality*. With a small number of threads the optimal algorithm is not the optimal for a big number of threads. 161 162For this reason, the SW designed for a *small machine* is *inadequate* for a *big machine* and vice versa. 163 164Using only *merging algorithms*, has the *problem of the additional memory* used, usually of the same size as the data. 165] 166 167[h4[_New Parallel Sort Algorithm (Block Indirect Sort)]] 168[: 169 170This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm. 171 172*With small number of threads, it is a subdivision algorithm. 173 174*With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads). 175 176This algorithm *eliminates the duality*. The same code has *optimal performance* with a small and a big number of threads. 177 178The number of threads to use is evaluated in each execution. 179When the program runs with a *small number of threads* the algorithm 180internally uses a *subdivision algorithm* and has similar performance to TBB, and when run with *many threads*, 181it internally uses the *new algorithm* and has the performance of GCC Parallel Sort, with the additional advantage of *reduced memory consumption*. 182 183] 184] 185[endsect] 186[br] 187 188[section:design_process Design Process] 189 190[: 191[h4[_Initial Idea]] 192[: 193The *initial idea*, was to build a *merge algorithm*, to be *fast with many threads, with a low additional memory*. 194 195This algortihm is *not based in any other idea or algorithm*. The technique used in the algorithm (indirect blocks) *is new, and had been conceived and implemented for this algorithm*. 196 197As sample of their results, we can see the the sorting 100 000 000 64 bits numbers, ramdomly generated, 198in a machine with 12 threads. 199 200[*[teletype] 201`` 202 | | | 203 Algorithm | Time (secs) | Memory used | 204 ----------------------------------+-------------+-------------+ 205 Open MP Parallel Sort | 1.1990 | 1564 MB | 206 Threading Building Blocks (TBB) | 1.6411 | 789 MB | 207 Block Indirect Sort | 0.9270 | 790 MB | 208 | | | 209`` 210] 211 212 213The best words about this algorithm are expressed by their [*[link sort.parallel.linux_parallel Benchmarks results]] 214 215] 216[h4[_Design process]] 217[: 218The process had been *long and very hard*, mainly, by the uncertainty about if the ideas are correct and run 219so fast as need for to be useful. This is complicated by the fact that we can’t be sure of the efficiency until the last part 220of the code is done and the first benchmark has run. 221 222But it had been a *very exciting process*, each time a problem is resolved, a new internal algorithm is designed, 223tested …, and see, step by step, the advance of the process. 224 225I discovered *many new problems* during this process, unknown until now, which forced me to design *new internal algorithms* to resolve them, 226and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm 227written to resolve and parallelize the internal problems. 228 229 230If you are interested in a detailed description of the algorithm, you can find it here : [* [@../papers/block_indirect_sort_en.pdf block_indirect_sort_en.pdf]]. 231 232] 233 234] 235[endsect] 236[endsect] 237 238