• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1[/===========================================================================
2 Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
3
4
5 Distributed under the Boost Software License, Version 1.0
6 See accompanying file LICENSE_1_0.txt or copy at
7 http://www.boost.org/LICENSE_1_0.txt
8=============================================================================/]
9
10[section:block_indirect_sort 3.1- block_indirect_sort]
11
12
13[section:block_description Description]
14[:
15
16[*BLOCK_INDIRECT_SORT] is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.
17
18The most important characteristics of this algorithm are the *speed* and the *low memory consumption*.
19
20[*[teletype]
21``
22                          |       |                        |                              |
23    Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
24    ----------------------+-------+------------------------+------------------------------+
25    block_indirect_sort   |  no   |block_size * num_threads|     N, N LogN, N LogN        |
26                          |       |                        |                              |
27``
28]
29
30The block_size is an internal parameter of the algorithm, which in order to achieve the
31highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.
32
33
34[*[teletype]
35``
36                                |        |         |         |         |         |         |          |
37object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
38--------------------------------+--------+---------+---------+---------+---------+---------+----------+
39block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
40                                |        |         |         |         |         |         |          |
41``
42]
43]
44[endsect]
45[br]
46[section:block_benchmark Benchmark]
47[:
48Sorting 100 000 000  64 bits numbers, the measured memory used was:
49[*[teletype]
50``
51                                      |             |             |
52    Algorithm                         | Time (secs) | Memory used |
53    ----------------------------------+-------------+-------------+
54    Open MP Parallel Sort             |   1.1990    |  1564 MB    |
55    Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
56    Block Indirect Sort               |   0.9270    |   790 MB    |
57                                      |             |             |
58``
59]
60]
61[endsect]
62[br]
63
64[section:block_programming Programming]
65[:
66[h4[_Thread specification]]
67[:
68This algorithm has an integer parameter indicating the *number of threads* to use in the sorting process,
69which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
70the machine where the program is running provided by std::thread::hardware_concurrency().
71
72If the number is 1 or 0, the algorithm runs with only 1 thread.
73
74The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater
75than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads,
76and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ).
77If this value is 0, the program is executed with 1 thread.
78]
79[h4[_Programming]]
80[:
81You only need to include the file boost/sort/sort.hpp to use this algorithm
82
83The algorithm runs in the namespace boost::sort
84
85[c++]
86``
87    #include <boost/sort/sort.hpp>
88
89
90
91    template <class iter_t>
92    void block_indirect_sort (iter_t first, iter_t last);
93
94    template <class iter_t, typename compare>
95    void block_indirect_sort (iter_t first, iter_t last, compare comp);
96
97    template <class iter_t>
98    void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);
99
100    template <class iter_t, typename compare>
101    void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
102``
103
104This algorithm needs a *C++11 compliant compiler*. You don't need any other code or library. With older compilers correct operation is not guaranteed.
105
106If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()
107
108This algorithm uses a *comparison object*, in the same way as the standard library sort
109algorithms. If not defined, the comparison object is std::less, which uses
110the < operator internally.
111
112This algorithm is [*exception safe],  meaning that, the exceptions generated by the algorithm
113guarantee the integrity of the objects to sort, but not their relative order. If the exception
114is generated inside the objects (in the move or in the copy constructor.. ) the results can be
115unpredictable.
116]
117]
118[endsect]
119[br]
120[section:block_internal Internal Description]
121[:
122
123There are two primary categories of parallelization in sorting algorithms.
124
125
126[h4[_Subdivision Algorithms]]
127
128
129[: Filter the data and generate two or more parts. Each part obtained is
130filtered and divided by other threads. This filter and division process is done
131until the size of the part is smaller than a predefined size, then is sorted by a single thread.
132
133The algorithm most frequently used in the filter and sorting is quick sort
134
135These algorithms are fast with a small number of threads, but are inefficient
136with a great number of threads. Examples of this category are
137
138*Intel Threading Building Blocks (TBB)
139
140*Microsoft PPL Parallel Sort.
141
142]
143[h4[_Merging Algorithms]]
144
145[:
146
147Divide the data in parts, and each part is sorted by a thread. When
148the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the
149merge, usually the same size as the data.
150
151With a small number of threads, these algorithms have similar speed to
152the subdivision algorithms, but with many threads are much faster.
153
154Examples of this category are
155
156*GCC Parallel Sort (based on OpenMP)
157
158*Microsoft PPL Parallel Buffered Sort
159
160This generates an *undesirable duality*. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.
161
162For this reason, the SW designed for a *small machine* is *inadequate* for a *big machine* and vice versa.
163
164Using only *merging algorithms*, has the *problem of the additional memory* used, usually of the same size as the data.
165]
166
167[h4[_New Parallel Sort Algorithm (Block Indirect Sort)]]
168[:
169
170This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.
171
172*With small number of threads, it is a subdivision algorithm.
173
174*With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).
175
176This algorithm *eliminates the duality*. The same code has *optimal performance* with a small and a big number of threads.
177
178The number of threads to use is evaluated in each execution.
179When the program runs  with a *small number of threads* the algorithm
180internally uses a *subdivision algorithm* and has similar performance to  TBB, and when run with *many threads*,
181it internally uses the *new algorithm* and has the performance of GCC Parallel Sort, with the additional advantage of *reduced memory consumption*.
182
183]
184]
185[endsect]
186[br]
187
188[section:design_process Design Process]
189
190[:
191[h4[_Initial Idea]]
192[:
193The *initial idea*, was to build a *merge algorithm*, to be *fast with many threads, with a low additional memory*.
194
195This algortihm is *not based in any other idea or algorithm*. The technique used in the algorithm (indirect blocks) *is new, and had been conceived and implemented for this algorithm*.
196
197As sample of their results, we can see the the sorting 100 000 000  64 bits numbers, ramdomly generated,
198in a machine with 12 threads.
199
200[*[teletype]
201``
202                                      |             |             |
203    Algorithm                         | Time (secs) | Memory used |
204    ----------------------------------+-------------+-------------+
205    Open MP Parallel Sort             |   1.1990    |  1564 MB    |
206    Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
207    Block Indirect Sort               |   0.9270    |   790 MB    |
208                                      |             |             |
209``
210]
211
212
213The best words about this algorithm are expressed by their [*[link sort.parallel.linux_parallel Benchmarks results]]
214
215]
216[h4[_Design process]]
217[:
218The process had been *long and very hard*, mainly, by the uncertainty about if the ideas are correct and run
219so fast as need for to be useful.  This is complicated by the fact that we can’t be sure of the efficiency until the last part
220of the code is done and the first benchmark has run.
221
222But it had been a *very exciting process*, each time a problem is resolved, a new internal algorithm is designed,
223tested …,  and  see, step by step, the advance of the process.
224
225I discovered *many new problems* during this process, unknown until now, which forced me to design *new internal algorithms* to resolve them,
226and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm
227written to resolve and parallelize the internal problems.
228
229
230If you are interested in a detailed description of the algorithm, you can find it here : [* [@../papers/block_indirect_sort_en.pdf block_indirect_sort_en.pdf]].
231
232]
233
234]
235[endsect]
236[endsect]
237
238