1[/=========================================================================== 2 Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters 3 4 5 Distributed under the Boost Software License, Version 1.0 6 See accompanying file LICENSE_1_0.txt or copy at 7 http://www.boost.org/LICENSE_1_0.txt 8=============================================================================/] 9 10[section:spreadsort 2.1.-Spreadsort] 11 12[/ Some composite templates] 13[template super[x]'''<superscript>'''[x]'''</superscript>'''] 14[template sub[x]'''<subscript>'''[x]'''</subscript>'''] 15[template floor[x]'''⌊'''[x]'''⌋'''] 16[template floorlr[x][lfloor][x][rfloor]] 17[template ceil[x] '''⌈'''[x]'''⌉'''] 18 19[/ Required for autoindexing] 20[import ../../../tools/auto_index/include/auto_index_helpers.qbk] 21[/ Must be first included file!] 22 23[/Files containing quickbook snippets] 24[import ../example/charstringsample.cpp] 25[import ../example/stringfunctorsample.cpp] 26[import ../example/reverseintsample.cpp] 27[import ../example/rightshiftsample.cpp] 28[import ../example/int64.cpp] 29[import ../example/floatfunctorsample.cpp] 30[import ../example/generalizedstruct.cpp] 31 32[import html4_symbols.qbk] [/ Provides various useful squiggles.] 33 34[def __spreadsort [@http://en.wikipedia.org/wiki/Spreadsort spreadsort]] 35[def __introsort [@http://en.wikipedia.org/wiki/Introsort introsort]] 36[def __stl_sort [@http://www.cplusplus.com/reference/algorithm/sort/ STL std::sort]] 37[def __big_o [@http://en.wikipedia.org/wiki/Big_O_notation Big O notation]] 38[def __radix_sort[@http://en.wikipedia.org/wiki/Radix_sort radix sort]] 39[def __adaptive_left_reflex [@http://www.nik.no/2002/Maus.pdf Arne Maus, Adaptive Left Reflex]] 40[def __american_flag [@http://en.wikipedia.org/wiki/American_flag_sort American flag sort]] 41[def __overloading [link sort.overview.overloading overloading]] 42[def __lookup [link sort.rationale.lookup lookup]] 43[def __random_access_iter [@http://en.cppreference.com/w/cpp/concept/RandomAccessIterator RandomAccessIterator]] 44[def __strictweakordering [@http://en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings strict weak ordering]] 45 46[/ Links to functions for use in text] 47[def __integer_sort [^[funcref boost::sort::spreadsort::integer_sort integer_sort]]] 48[def __float_sort [^[funcref boost::sort::spreadsort::float_sort float_sort]]] 49[def __string_sort [^[funcref boost::sort::spreadsort::string_sort string_sort]]] 50[def __spreadsort [^[funcref boost::sort::spreadsort::spreadsort spreadsort]]] [/Note diff from Wiki link __spreadsort] 51[def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]] 52 53[section:overview Overview] 54 55[section:intro Introduction] 56 57Spreadsort combines generic implementations of multiple high-speed sorting algorithms 58that outperform those in the C++ standard in both average and worst case performance 59when there are over 1000 elements in the list to sort. 60 61They fall back to __stl_sort on small data sets. 62 63[warning These algorithms all only work on 64[@http://www.cplusplus.com/reference/iterator/RandomAccessIterator/ random access iterators].] 65 66They are hybrids using both radix and comparison-based sorting, 67specialized to sorting common data types, such as integers, floats, and strings. 68 69These algorithms are encoded in a generic fashion and accept functors, 70enabling them to sort any object that can be processed like these basic data types. 71In the case of __string_sort, this includes anything 72with a defined strict-weak-ordering that __std_sort can sort, 73but writing efficient functors for some complex key types 74may not be worth the additional effort relative to just using __std_sort, 75depending on how important speed is to your application. 76Sample usages are available in the example directory. 77 78Unlike many radix-based algorithms, 79the underlying __spreadsort algorithm is designed around [*worst-case performance]. 80It performs better on chunky data (where it is not widely distributed), 81so that on real data it can perform substantially better than on random data. 82Conceptually, __spreadsort can sort any data for which an absolute ordering can be determined, 83and __string_sort is sufficiently flexible that this should be possible. 84 85Situations where __spreadsort is fastest relative to __std_sort: 86 87# Large number of elements to sort (['N] >= 10000). 88 89# Slow comparison function (such as floating-point numbers on x86 processors or strings). 90 91# Large data elements (such as key + data sorted on a key). 92 93# Completely sorted data when __spreadsort has an optimization to quit early in this case. 94 95Situations where __spreadsort is slower than __std_sort: 96 97# Data sorted in reverse order. Both __std_sort and __spreadsort are faster 98on reverse-ordered data than randomized data, 99but __std_sort speeds up more in this special case. 100 101# Very small amounts of data (< 1000 elements). 102For this reason there is a fallback in __spreadsort to __std_sort 103if the input size is less than 1000, 104so performance is identical for small amounts of data in practice. 105 106These functions are defined in `namespace boost::sort::spreadsort`. 107 108[endsect] [/section Introduction] 109 110[section:overloading Overloading] 111 112[tip In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.] 113 114Each of __integer_sort, __float_sort, and __string_sort have 3 main versions: 115The base version, which takes a first iterator and a last iterator, just like __std_sort: 116 117 integer_sort(array.begin(), array.end()); 118 float_sort(array.begin(), array.end()); 119 string_sort(array.begin(), array.end()); 120 121The version with an overridden shift functor, providing flexibility 122in case the `operator>>` already does something other than a bitshift. 123The rightshift functor takes two args, first the data type, 124and second a natural number of bits to shift right. 125 126For __string_sort this variant is slightly different; 127it needs a bracket functor equivalent to `operator`\[\], 128taking a number corresponding to the character offset, 129along with a second `getlength` functor to get the length of the string in characters. 130In all cases, this operator must return an integer type that compares with the 131`operator<` to provide the intended order 132(integers can be negated to reverse their order). 133 134In other words (aside from negative floats, which are inverted as ints): 135 136 rightshift(A, n) < rightshift(B, n) -> A < B 137 A < B -> rightshift(A, 0) < rightshift(B, 0) 138 139[rightshift_1] 140[bracket_1] 141 142See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of integer sorting with a rightshift functor. 143 144And a version with a comparison functor for maximum flexibility. 145This functor must provide the same sorting order as the integers returned by the rightshift (aside from negative floats): 146 147 rightshift(A, n) < rightshift(B, n) -> compare(A, B) 148 compare(A, B) -> rightshift(A, 0) < rightshift(B, 0) 149 150[reverse_int_2] 151 152Examples of functors are: 153 154[lessthan_functor] 155 156[bracket_functor] 157 158[getsize_functor] 159 160and these functors are used thus: 161 162[stringsort_functors_call] 163 164See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for a working example of sorting strings with all functors. 165 166[endsect] [/section:overloading Overloading] 167 168[section:performance Performance] 169 170The __spreadsort algorithm is a hybrid algorithm; 171when the number of elements being sorted is below a certain number, 172comparison-based sorting is used. Above it, radix sorting is used. 173The radix-based algorithm will thus cut up the problem into small pieces, 174and either completely sort the data based upon its radix if the data is clustered, 175or finish sorting the cut-down pieces with comparison-based sorting. 176 177The Spreadsort algorithm dynamically chooses 178either comparison-based or radix-based sorting when recursing, 179whichever provides better worst-case performance. 180This way worst-case performance is guaranteed to be the better of 181['[bigo](N[sdot]log2(N))] comparisons and ['[bigo](N[sdot]log2(K/S + S))] operations where 182 183* ['N] is the number of elements being sorted, 184* ['K] is the length in bits of the key, and 185* ['S] is a constant. 186 187This results in substantially improved performance for large [' N]; 188__integer_sort tends to be 50% to 2X faster than __std_sort, 189while __float_sort and _string_sort are roughly 2X faster than __std_sort. 190 191Performance graphs are provided for __integer_sort, __float_sort, and __string_sort in their description. 192 193Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop 194running Windows Vista 64 with MSVC 8.0, 195and an old G4 laptop running Mac OSX with gcc. 196[@http://www.boost.org/build/doc/html/ Boost bjam/b2] was used to control compilation. 197 198Direct performance comparisons on a newer x86 system running Ubuntu, 199with the fallback to __std_sort at lower input sizes disabled are below. 200 201[note The fallback to __std_sort for smaller input sizes prevents 202the worse performance seen on the left sides of the first two graphs.] 203 204__integer_sort starts to become faster than __std_sort at about 1000 integers (4000 bytes), 205and __string_sort becomes faster than __std_sort at slightly fewer bytes (as few as 30 strings). 206 207[note The 4-threaded graph has 4 threads doing [*separate sorts simultaneously] 208(not splitting up a single sort) 209as a test for thread cache collision and other multi-threaded performance issues.] 210 211__float_sort times are very similar to __integer_sort times. 212 213[/ These provide links to the images, but currently graphs are shown - see below] 214[/@../../doc/images/single_threaded.png single_threaded.png] [/file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png] 215[/@../../doc/images/4_threaded.png 4_threaded.png] 216[/@../../doc/images/entropy.png entropy.png] 217[/@../../doc/images/bits_per_byte.png bits_per_byte.png] 218 219[$../images/single_threaded.png] [/<img src="../images/single_threaded.png"> == file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png] 220[$../images/4_threaded.png] 221[$../images/entropy.png] 222[$../images/bits_per_byte.png] 223 224Histogramming with a fixed maximum number of splits is used 225because it reduces the number of cache misses, 226thus improving performance relative to the approach described in detail 227in the [@http://en.wikipedia.org/wiki/Spreadsort original SpreadSort publication]. 228 229The importance of cache-friendly histogramming is described 230in __adaptive_left_reflex, 231though without the worst-case handling described below. 232 233The time taken per radix iteration is: 234 235['[bigo](N)] iterations over the data 236 237['[bigo](N)] integer-type comparisons (even for _float_sort and __string_sort) 238 239['[bigo](N)] swaps 240 241['[bigo](2[super S])] bin operations. 242 243To obtain ['[bigo](N)] worst-case performance per iteration, 244the restriction ['S <= log2(N)] is applied, and ['[bigo](2[super S])] becomes ['[bigo](N)]. 245For each such iteration, the number of unsorted bits log2(range) 246(referred to as ['K]) per element is reduced by ['S]. 247As ['S] decreases depending upon the amount of elements being sorted, 248it can drop from a maximum of ['S[sub max]] to the minimum of ['S[sub min]]. 249 250Assumption: __std_sort is assumed to be ['[bigo](N*log2(N))], 251as __introsort exists and is commonly used. 252(If you have a quibble with this please take it up with the implementor of your __std_sort; 253you're welcome to replace the recursive calls to __std_sort to calls 254to __introsort if your __std_sort library call is poorly implemented). 255 256[@http://en.wikipedia.org/wiki/Introsort Introsort] is not included with this algorithm for simplicity and 257because the implementor of the __std_sort call 258is assumed to know what they're doing. 259 260To maintain a minimum value for ['S (S[sub min])], 261comparison-based sorting has to be used to sort when 262['n <= log2(meanbinsize)], where ['log2(meanbinsize) (lbs)] is a small constant, 263usually between 0 and 4, used to minimize bin overhead per element. 264There is a small corner-case where if ['K < S[sub min]] and ['n >= 2^K], 265then the data can be sorted in a single radix-based iteration with an ['S = K] 266(this bucketsorting special case is by default only applied to __float_sort). 267So for the final recursion, worst-case performance is: 268 2691 radix-based iteration if ['K <= S[sub min]], 270 271or ['S[sub min] + lbs] comparison-based iterations if ['K > S[sub min]] but ['n <= 2[super (S[sub min] + lbs)]]. 272 273So for the final iteration, worst-case runtime is ['[bigo](N*(S[sub min] + lbs))] but 274if ['K > S[sub min]] and ['N > 2[super (S[sub min] + lbs)]] 275then more than 1 radix recursion will be required. 276 277For the second to last iteration, ['K <= S[sub min] * 2 + 1] can be handled, 278(if the data is divided into ['2[super (S[sub min] + 1)]] pieces) 279or if ['N < 2[super (S[sub min] + lbs + 1)]], 280then it is faster to fallback to __std_sort. 281 282In the case of a radix-based sort plus recursion, it will take 283['[bigo](N*(S[sub min] + lbs)) + [bigo](N) = [bigo](N*(S[sub min] + lbs + 1))] worst-case time, 284as 285['K_remaining = K_start - (S[sub min] + 1)], and ['K_start <= S[sub min] * 2 + 1]. 286 287Alternatively, comparison-based sorting is used if ['N < 2[super (S[sub min] + lbs + 1)]], 288which will take ['[bigo](N*(S[sub min] + lbs + 1))] time. 289 290So either way ['[bigo](N*(S[sub min] + lbs + 1))] is the worst-case time for the second to last iteration, 291which occurs if ['K <= S[sub min] * 2 + ]1 or ['N < 2[super (S[sub min] + lbs + 1)]]. 292 293This continues as long as ['S[sub min] <= S <= S[sub max]], 294so that for ['K_m <= K_(m-1) + S[sub min] + m] where ['m] 295is the maximum number of iterations after this one has finished, 296or where ['N < 2[super (S[sub min] + lbs + m)]], 297then the worst-case runtime is ['[bigo](N*(S[sub min] + lbs + m))]. 298 299[space][space]['K_m] at ['m <= (S[sub max] - S[sub min])] works out to: 300 301[space][space]['K_1 <= (S[sub min]) + S[sub min] + 1 <= 2S[sub min] + 1] 302 303[space][space]['K_2 <= (2S[sub min] + 1) + S[sub min] + 2] 304 305as the sum from 0 to ['m] is ['m(m + 1)/2] 306 307[space][space]['K_m <= (m + 1)S[sub min] + m(m + 1)/2 <= (S[sub min] + m/2)(m + 1)] 308 309substituting in S[sub max] - S[sub min] for m 310 311[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + (S[sub max] - S[sub min])/2)*(S[sub max] - S[sub min] + 1)] 312 313[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + S[sub max]) * (S[sub max] - S[sub min] + 1)/2] 314 315Since this involves ['S[sub max] - S[sub min] + 1] iterations, 316this works out to dividing ['K] into an average ['(S[sub min] + S[sub max])]/2 317pieces per iteration. 318 319To finish the problem from this point takes ['[bigo](N * (S[sub max] - S[sub min]))] for ['m] iterations, 320plus the worst-case of ['[bigo](N*(S[sub min] + lbs))] for the last iteration, 321for a total of ['[bigo](N *(S[sub max] + lbs))] time. 322 323When ['m > S[sub max] - S[sub min]], the problem is divided into ['S[sub max]] pieces per iteration, 324or __std_sort is called if ['N < 2^(m + S[sub min] + lbs)]. For this range: 325 326[space][space]['K_m <= K_(m - 1) + S[sub max]], providing runtime of 327 328[space][space]['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))] if recursive, 329 330or ['[bigo](N * log(2^(m + S[sub min] + lbs)))] if comparison-based, 331 332which simplifies to ['[bigo](N * (m + S[sub min] + lbs))], 333which substitutes to ['[bigo](N * ((m - (S[sub max] - S[sub min])) + S[sub max] + lbs))], 334which given that ['m - (S[sub max] - S[sub min]) <= (K - K_(S[sub max] - S[sub min]))/S[sub max]] 335(otherwise a lesser number of radix-based iterations would be used) 336 337also comes out to ['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))]. 338 339Asymptotically, for large ['N] and large ['K], this simplifies to: 340 341[space][space]['[bigo](N * (K/S[sub max] + S[sub max] + lbs))], 342 343simplifying out the constants related to the ['S[sub max] - S[sub min]] range, 344providing an additional ['[bigo](N * (S[sub max] + lbs))] runtime on top of the 345['[bigo](N * (K/S))] performance of LSD __radix_sort, 346but without the ['[bigo](N)] memory overhead. 347For simplicity, because ['lbs] is a small constant 348(0 can be used, and performs reasonably), 349it is ignored when summarizing the performance in further discussions. 350By checking whether comparison-based sorting is better, 351Spreadsort is also ['[bigo](N*log(N))], whichever is better, 352and unlike LSD __radix_sort, can perform much better than the worst-case 353if the data is either evenly distributed or highly clustered. 354 355This analysis was for __integer_sort and __float_sort. 356__string_sort differs in that ['S[sub min] = S[sub max] = sizeof(Char_type) * 8], 357['lbs] is 0, and that __std_sort's comparison is not a constant-time operation, 358so strictly speaking __string_sort runtime is 359 360[space][space]['[bigo](N * (K/S[sub max] + (S[sub max] comparisons)))]. 361 362Worst-case, this ends up being ['[bigo](N * K)] 363(where ['K] is the mean string length in bytes), 364as described for __american_flag, which is better than the 365 366[space][space]['[bigo](N * K * log(N))] 367 368worst-case for comparison-based sorting. 369 370[endsect] [/section:performance Performance] 371 372[section:tuning Tuning] 373__integer_sort and __float_sort have tuning constants that control 374how the radix-sorting portion of those algorithms work. 375The ideal constant values for __integer_sort and __float_sort vary depending on 376the platform, compiler, and data being sorted. 377By far the most important constant is ['max_splits], 378which defines how many pieces the radix-sorting portion splits 379the data into per iteration. 380 381The ideal value of ['max_splits] depends upon the size of the L1 processor cache, 382and is between 10 and 13 on many systems. 383A default value of 11 is used. For mostly-sorted data, a much larger value is better, 384as swaps (and thus cache misses) are rare, 385but this hurts runtime severely for unsorted data, so is not recommended. 386 387On some x86 systems, when the total number of elements being sorted is small 388( less than 1 million or so), the ideal ['max_splits] can be substantially larger, 389such as 17. This is suspected to be because all the data fits into the L2 cache, 390and misses from L1 cache to L2 cache do not impact performance 391as severely as misses to main memory. 392Modifying tuning constants other than ['max_splits] is not recommended, 393as the performance improvement for changing other constants is usually minor. 394 395If you can afford to let it run for a day, and have at least 1GB of free memory, 396the perl command: `./tune.pl -large -tune` (UNIX) 397or `perl tune.pl -large -tune -windows` (Windows) 398can be used to automatically tune these constants. 399This should be run from the `libs/sort directory` inside the boost home directory. 400This will work to identify the `ideal constants.hpp` settings for your system, 401testing on various distributions in a 20 million element (80MB) file, 402and additionally verifies that all sorting routines sort correctly 403across various data distributions. 404Alternatively, you can test with the file size you're most concerned with 405`./tune.pl number -tune` (UNIX) or `perl tune.pl number -tune -windows` (Windows). 406Substitute the number of elements you want to test with for `number`. 407Otherwise, just use the options it comes with, they're decent. 408With default settings `./tune.pl -tune` (UNIX) `perl tune.pl -tune -windows` (Windows), 409the script will take hours to run (less than a day), 410but may not pick the correct ['max_splits] if it is over 10. 411Alternatively, you can add the `-small` option to make it take just a few minutes, 412tuning for smaller vector sizes (one hundred thousand elements), 413but the resulting constants may not be good for large files 414(see above note about ['max_splits] on Windows). 415 416The tuning script can also be used just to verify that sorting works correctly 417on your system, and see how much of a speedup it gets, 418by omiting the "-tune" option. This runs at the end of tuning runs. 419Default args will take about an hour to run and give accurate results 420on decent-sized test vectors. `./tune.pl -small` (UNIX) `perl tune.pl -small -windows` (Windows) 421is a faster option, that tests on smaller vectors and isn't as accurate. 422 423If any differences are encountered during tuning, please call `tune.pl` with `-debug > log_file_name`. 424If the resulting log file contains compilation or permissions issues, 425it is likely an issue with your setup. 426If some other type of error is encountered (or result differences), 427please send them to the library author at spreadsort@gmail.com. 428Including the zipped `input.txt` that was being used is also helpful. 429 430[endsect] [/section:tuning Tuning] 431 432[endsect] [/section Overview] 433 434[section:sort_hpp Spreadsort] 435 436[section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`] 437 438__spreadsort checks whether the data-type provided is an integer, 439castable float, string, or wstring. 440 441* If data-type is an integer, __integer_sort is used. 442* If data-type is a float, __float_sort is used. 443* If data-type is a string or wstring, __string_sort is used. 444* Sorting other data-types requires picking between 445__integer_sort, __float_sort and __string_sort directly, 446as __spreadsort won't accept types that don't have the appropriate type traits. 447 448Overloading variants are provided that permit use of user-defined right-shift functors and comparison functors. 449 450Each function is optimized for its set of arguments; default functors are not provided to avoid the risk of any reduction of performance. 451 452See __overloading section. 453 454[h5 Rationale:] 455 456__spreadsort function provides a wrapper that calls the fastest sorting algorithm 457available for a data-type, enabling faster generic programming. 458 459[section:spreadsort_examples Spreadsort Examples] 460 461See [@../../example/ example] folder for all examples. 462 463See [@../../example/sample.cpp sample.cpp] for a simple working example. 464 465For an example of 64-bit integer sorting, see [@../../example/int64.cpp int64.cpp]. 466 467This example sets the element type of a vector to 64-bit integer 468 469[int64bit_1] 470 471and calls the sort 472 473[int64bit_2] 474 475For a simple example sorting `float`s, 476 477 vector<float> vec; 478 vec.push_back(1.0); 479 vec.push_back(2.3); 480 vec.push_back(1.3); 481 ... 482 spreadsort(vec.begin(), vec.end()); 483 //The sorted vector contains "1.0 1.3 2.3 ..." 484 485See also [@../../example/floatsample.cpp floatsample.cpp] which checks for abnormal values. 486 487[endsect] [/section:spreadsort_examples Spreadsort Examples] 488 489[endsect] [/section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`] 490 491[section:integer_sort Integer Sort] 492 493__integer_sort is a fast templated in-place hybrid radix/comparison algorithm, 494which in testing tends to be roughly 50% to 2X faster than 495__std_sort for large tests (>=100kB). 496Worst-case performance is ['[bigo](N * (log2(range)/s + s))], 497so __integer_sort is asymptotically faster than pure comparison-based algorithms. 498['s] is ['max_splits], which defaults to 11, 499so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)] 500slow radix-based iterations + 11 fast comparison-based iterations). 501 502Some performance plots of runtime vs. n and log2(range) are provided below: 503 504[@../../doc/graph/windows_integer_sort.htm Windows Integer Sort] 505 506[@../../doc/graph/osx_integer_sort.htm OSX integer Sort] 507 508[section:integersort_examples Integer Sort Examples] 509 510See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of using rightshift, using a user-defined functor: 511 512[rightshift_int_functor] 513 514Other examples: 515 516[@../../example/keyplusdatasample.cpp Sort structs using an integer key.] 517 518[@../../example/reverseintsample.cpp Sort integers in reverse order.] 519 520[@../../example/mostlysorted.cpp Simple sorting of integers; this case is a performance test for integers that are already mostly sorted.] 521 522[endsect] [/section:integersort_examples Integer Sort Examples] 523 524[endsect] [/section:integer_sort Integer Sort] 525 526[section:float_sort Float Sort] 527 528__float_sort is a fast templated in-place hybrid radix/comparison algorithm much like __integer_sort, but sorts IEEE floating-point numbers (positive, zero, NaN, and negative) into ascending order by casting them to integers. This works because positive IEEE floating-point numbers sort like integers with the same bits, and negative IEEE floating-point numbers sort in the reverse order of integers with the same bits. float_sort is roughly 2X as fast as std::sort. 529 530-0.0 vs. 0.0 and NaN are given definitive ordered positions by the radix-based portion of this algorithm, where comparison-based sorting does not guarantee their relative position. The included tests avoid creating NaN and -0.0 so that results match std::sort, which is not consistent in how it handles these numbers, as they compare as equal to numbers with different values. 531 532float_sort checks the size of the data type and whether it is castable, picking 533 an integer type to cast to, if a casting functor isn't provided by the user. 534 535float_mem_cast casts IEEE floating-point numbers (positive, zero, NaN, and negative) into integers. This is an essential utility for creating a custom rightshift functor for float_sort, when one is needed. Only IEEE floating-point numbers of the same size as the integer type being cast to should be used in this specialized method call. 536Worst-case performance is ['[bigo](N * (log2(range)/s + s))], 537so __float_sort is asymptotically faster than pure comparison-based algorithms. 538['s] is ['max_splits], which defaults to 11, 539so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)] 540slow radix-based iterations + 11 fast comparison-based iterations). 541 542Some performance plots of runtime vs. n and log2(range) are provided below: 543 544[@../../doc/graph/windows_float_sort.htm Windows Float Sort] 545 546[@../../doc/graph/osx_float_sort.htm OSX Float Sort] 547 548[section:floatsort_examples Float Sort Examples] 549 550See [@../../example/floatfunctorsample.cpp floatfunctorsample.cpp] for a working example of how to sort structs with a float key: 551 552[float_functor_types] 553 554[float_functor_datatypes] 555 556Right-shift functor: 557 558[float_functor_rightshift] 559 560Comparison lessthan `operator<` functor: 561 562[float_functor_lessthan] 563 564Other examples: 565 566[@../../example/double.cpp Sort doubles.] 567 568[@../../example/shiftfloatsample.cpp Sort floats using a rightshift functor.] 569 570[endsect] [/section:floatsort_examples Float Sort Examples] 571 572[endsect] [/section:float_sort Float Sort] 573 574[section:string_sort String Sort] 575__string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order. 576 577The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a < operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor. Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order. 578 579This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings. For this reason, __std_sort will be called instead if the character type is of size > 2. 580 581__string_sort has a special optimization for identical substrings. This adds some overhead on random data, but identical substrings are common in real strings. 582 583reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical. __string_sort is sufficiently flexible that it should sort any data type that __std_sort can, assuming the user provides appropriate functors that index into a key. 584 585[@../../doc/graph/windows_string_sort.htm Windows String Sort] 586 587[@../../doc/graph/osx_string_sort.htm OSX String Sort] 588 589 590 591[section:stringsort_examples String Sort Examples] 592 593See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for an example of how to sort structs using a string key and all functors: 594 595[lessthan_functor] 596 597[bracket_functor] 598 599[getsize_functor] 600 601and these functors are used thus: 602 603[stringsort_functors_call] 604 605 606See [@../../example/generalizedstruct.cpp generalizedstruct.cpp] for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort: 607 608[generalized_functors] 609 610[generalized_functors_call] 611 612Other examples: 613 614[@../../example/stringsample.cpp String sort.] 615 616[@../../example/reversestringsample.cpp Reverse string sort.] 617 618[@../../example/wstringsample.cpp Wide character string sort.] 619 620[@../../example/caseinsensitive.cpp Case insensitive string sort.] 621 622[@../../example/charstringsample.cpp Sort structs using a string key and indexing functors.] 623 624[@../../example/reversestringfunctorsample.cpp Sort structs using a string keynd all functors in reverse order.] 625 626[endsect] [/section:stringsort_examples String Sort Examples] 627 628[endsect] [/section:string_sort String Sort] 629 630[section:rationale Rationale] 631 632[section:radix_sorting Radix Sorting] 633Radix-based sorting allows the data to be divided up into more than 2 pieces per iteration, 634and for cache-friendly versions, it normally cuts the data up into around a thousand pieces per iteration. 635This allows many fewer iterations to be used to complete sorting the data, 636enabling performance superior to the ['[bigo](N*log(N))] comparison-based sorting limit. 637[endsect] [/section:radix_sorting Radix Sorting] 638 639[section:hybrid_radix Hybrid Radix] 640 641There a two primary types of radix-based sorting: 642 643Most-significant-digit Radix sorting (MSD) divides the data recursively 644based upon the top-most unsorted bits. 645This approach is efficient for even distributions that divide nicely, 646and can be done in-place (limited additional memory used). 647There is substantial constant overhead for each iteration to deal 648with the splitting structure. 649The algorithms provided here use MSD Radix Sort for their radix-sorting portion. 650The main disadvantage of MSD Radix sorting is that when the data is cut up into small 651pieces, the overhead for additional recursive calls starts to dominate runtime, 652and this makes worst-case behavior substantially worse than ['[bigo](N*log(N))]. 653 654By contrast, __integer_sort, __float_sort, and __string_sort all check to see 655whether Radix-based or Comparison-based sorting have better worst-case runtime, 656and make the appropriate recursive call. 657Because Comparison-based sorting algorithms are efficient on small pieces, 658the tendency of MSD __radix_sort to cut the problem up small is turned into 659an advantage by these hybrid sorts. It is hard to conceive of a common usage case 660where pure MSD __radix_sort would have any significant advantage 661over hybrid algorithms. 662 663Least-significant-digit __radix_sort (LSD) sorts based upon 664the least-significant bits first. This requires a complete copy of the data being sorted, 665using substantial additional memory. The main advantage of LSD Radix Sort 666is that aside from some constant overhead and the memory allocation, 667it uses a fixed amount of time per element to sort, regardless of distribution or 668size of the list. This amount of time is proportional to the length of the radix. 669The other advantage of LSD Radix Sort is that it is a stable sorting algorithm, 670so elements with the same key will retain their original order. 671 672One disadvantage is that LSD Radix Sort uses the same amount of time 673to sort "easy" sorting problems as "hard" sorting problems, 674and this time spent may end up being greater than an efficient ['[bigo](N*log(N))] 675algorithm such as __introsort spends sorting "hard" problems on large data sets, 676depending on the length of the datatype, and relative speed of comparisons, 677memory allocation, and random accesses. 678 679The other main disadvantage of LSD Radix Sort is its memory overhead. 680It's only faster for large data sets, but large data sets use significant memory, 681which LSD Radix Sort needs to duplicate. LSD Radix Sort doesn't make sense for items 682of variable length, such as strings; it could be implemented by starting at the end 683of the longest element, but would be extremely inefficient. 684 685All that said, there are places where LSD Radix Sort is the appropriate and 686fastest solution, so it would be appropriate to create a templated LSD Radix Sort 687similar to __integer_sort and __float_sort. This would be most appropriate in cases where 688comparisons are extremely slow. 689 690[endsect] [/section:hybrid_radix Hybrid Radix] 691 692[section:why_spreadsort Why spreadsort?] 693 694The __spreadsort algorithm used in this library is designed to provide best possible 695worst-case performance, while still being cache-friendly. 696It provides the better of ['[bigo](N*log(K/S + S))] and ['[bigo](N*log(N))] worst-case time, 697where ['K] is the log of the range. The log of the range is normally the length in bits 698of the data type; 32 for a 32-bit integer. 699 700`flash_sort` (another hybrid algorithm), by comparison is ['[bigo](N)] 701for evenly distributed lists. The problem is, `flash_sort` is merely an MSD __radix_sort 702combined with ['[bigo](N*N)] insertion sort to deal with small subsets where 703the MSD Radix Sort is inefficient, so it is inefficient with chunks of data 704around the size at which it switches to `insertion_sort`, and ends up operating 705as an enhanced MSD Radix Sort. 706For uneven distributions this makes it especially inefficient. 707 708__integer_sort and __float_sort use __introsort instead, which provides ['[bigo](N*log(N))] 709performance for these medium-sized pieces. Also, `flash_sort`'s ['[bigo](N)] 710performance for even distributions comes at the cost of cache misses, 711which on modern architectures are extremely expensive, and in testing 712on modern systems ends up being slower than cutting up the data in multiple, 713cache-friendly steps. Also worth noting is that on most modern computers, 714`log2(available RAM)/log2(L1 cache size)` is around 3, 715where a cache miss takes more than 3 times as long as an in-cache random-access, 716and the size of ['max_splits] is tuned to the size of the cache. 717On a computer where cache misses aren't this expensive, ['max_splits] 718could be increased to a large value, or eliminated entirely, 719and `integer_sort/float_sort` would have the same ['[bigo](N)] performance 720on even distributions. 721 722Adaptive Left Radix (ALR) is similar to `flash_sort`, but more cache-friendly. 723It still uses insertion_sort. Because ALR uses ['[bigo](N*N)] `insertion_sort`, 724it isn't efficient to use the comparison-based fallback sort on large lists, 725and if the data is clustered in small chunks just over the fallback size 726with a few outliers, radix-based sorting iterates many times doing little sorting 727with high overhead. Asymptotically, ALR is still ['[bigo](N*log(K/S + S))], 728but with a very small ['S] (about 2 in the worst case), 729which compares unfavorably with the 11 default value of ['max_splits] for 730Spreadsort. 731 732ALR also does not have the ['[bigo](N*log(N))] fallback, so for small lists 733that are not evenly distributed it is extremely inefficient. 734See the `alrbreaker` and `binaryalrbreaker` testcases for examples; 735either replace the call to sort with a call to ALR and update the ALR_THRESHOLD 736at the top, or as a quick comparison make `get_max_count return ALR_THRESHOLD` 737(20 by default based upon the paper). 738These small tests take 4-10 times as long with ALR as __std_sort 739in the author's testing, depending on the test system, 740because they are trying to sort a highly uneven distribution. 741Normal Spreadsort does much better with them, because `get_max_count` 742is designed around minimizing worst-case runtime. 743 744`burst_sort` is an efficient hybrid algorithm for strings that 745uses substantial additional memory. 746 747__string_sort uses minimal additional memory by comparison. 748Speed comparisons between the two haven't been made, 749but the better memory efficiency makes __string_sort more general. 750 751`postal_sort` and __string_sort are similar. A direct performance comparison 752would be welcome, but an efficient version of `postal_sort` was not found 753in a search for source. 754 755__string_sort is most similar to the __american_flag algorithm. 756The main difference is that it doesn't bother trying to optimize how empty 757buckets/piles are handled, instead just checking to see if all characters 758at the current index are equal. Other differences are using __std_sort 759as the fallback algorithm, and a larger fallback size (256 vs. 16), 760which makes empty pile handling less important. 761 762Another difference is not applying the stack-size restriction. 763Because of the equality check in __string_sort, it would take ['m*m] memory 764worth of strings to force __string_sort to create a stack of depth ['m]. 765This problem isn't a realistic one on modern systems with multi-megabyte stacksize 766limits, where main memory would be exhausted holding the long strings necessary 767to exceed the stacksize limit. __string_sort can be thought of as modernizing 768__american_flag to take advantage of __introsort as a fallback algorithm. 769In the author's testing, __american_flag (on `std::strings`) had comparable runtime 770to __introsort, but making a hybrid of the two allows reduced overhead and 771substantially superior performance. 772 773[endsect] [/section:why_spreadsort] 774 775[section:unstable_sort Unstable Sorting] 776 777Making a __radix_sort stable requires the usage of an external copy of the data. 778A stable hybrid algorithm also requires a stable comparison-based algorithm, 779and these are generally slow. LSD __radix_sort uses an external copy of the data, 780and provides stability, along with likely being faster (than a stable hybrid sort), 781so that's probably a better way to go for integer and floating-point types. 782It might make sense to make a stable version of __string_sort using external memory, 783but for simplicity this has been left out for now. 784 785[endsect] [/section:unstable_sort Unstable Sorting] 786 787[section:optimization Unused X86 optimization] 788 789Though the ideal ['max_splits] for `n < 1 million` (or so) on x86 790['seems] to be substantially larger, enabling a roughly 15% speedup for such tests, 791this optimization isn't general, and doesn't apply for `n > 1 million`. 792A too large ['max_splits] can cause sort to take more than twice as long, 793so it should be set on the low end of the reasonable range, where it is right now. 794 795[endsect] [/section:optimization Unused X86 optimization] 796 797[section:lookup Lookup Table?] 798 799The ideal way to optimize the constants would be to have a carefully-tuned 800lookup-table instead of the `get_max_count` function, but 4 tuning variables 801is simpler, `get_max_count` enforces worst-case performance minimization rules, 802and such a lookup table would be difficult to optimize 803for cross-platform performance. 804 805Alternatively, `get_max_count` could be used to generate a static lookup table. 806This hasn't been done due to concerns about cross-platform compatibility 807and flexibility. 808 809[endsect] [/section:lookup] 810 811[endsect] [/section:rationale Rationale] 812 813[endsect] [/section:sort_hpp Spreadsort] 814 815[section:definitions Definitions] 816 817[h4 stable sort] 818 819A sorting approach that preserves pre-existing order. 820If there are two elements with identical keys in a list that is later stably sorted, 821whichever came first in the initial list will come first in a stably sorted list. 822The algorithms provided here provide no such guarantee; items with identical keys 823will have arbitrary resulting order relative to each other. 824 825[endsect] [/section:definitions Definitions] 826 827[section:faq Frequently Asked Questions] 828 829There are no FAQs yet. 830 831[endsect] [/section:faq Frequently asked Questions] 832 833[section:acks Acknowledgements] 834 835* The author would like to thank his wife Mary for her patience and support 836during the long process of converting this from a piece of C code 837to a template library. 838 839* The author would also like to thank Phil Endecott and Frank Gennari 840for the improvements they've suggested and for testing. 841Without them this would have taken longer to develop or performed worse. 842 843* `float_mem_cast` was fixed to be safe and fast thanks to Scott McMurray. 844That fix was critical for a high-performance cross-platform __float_sort. 845 846* Thanks also for multiple helpful suggestions provided by Steven Watanabe, 847Edouard Alligand, and others. 848 849* Initial documentation was refactored to use Quickbook by Paul A. Bristow. 850 851[endsect] [/section:acknowledgements Acknowledgements] 852 853[section:bibliog Bibliography] 854 855[h4 Standard Template Library Sort Algorithms] 856 857[@http://www.cplusplus.com/reference/algorithm/sort/ C++ STL sort algorithms]. 858 859[h4 Radix Sort] 860 861A type of algorithm that sorts based upon distribution instead of by comparison. 862Wikipedia has an article about Radix Sorting. 863A more detailed description of various Radix Sorting algorithms is provided here: 864 865Donald Knuth. The Art of Computer Programming, 866Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998. 867ISBN 0-201-89685-0. Section 5.2.5: Sorting by Distribution, pp.168-179. 868 869[h4 Introsort] 870 871A high-speed comparison-based sorting algorithm that takes ['[bigo](N * log(N))] time. 872See __introsort and 873Musser, David R. (1997). "Introspective Sorting and Selection Algorithms", 874Software: Practice and Experience (Wiley) 27 (8), pp 983-993, 875available at [@http://www.cs.rpi.edu/~musser/gp/introsort.ps Musser Introsort]. 876 877[h4 American Flag Sort] 878 879A high-speed hybrid string sorting algorithm that __string_sort is partially based 880upon. See __american_flag and Peter M. McIlroy, Keith Bostic, M. Douglas McIlroy. Engineering Radix Sort, Computing Systems 1993. 881 882[h4 Adaptive Left Radix (ARL)] 883 884ARL (Adaptive Left Radix) is a hybrid cache-friendly integer sorting algorithm 885with comparable speed on random data to __integer_sort, 886but does not have the optimizations for worst-case performance, 887causing it to perform poorly on certain types of unevenly distributed data. 888 889Arne Maus, [@http://www.nik.no/2002/Maus.pdf ARL, a faster in-place, cache friendly sorting algorithm], 890presented at NIK2002, Norwegian Informatics Conference, Kongsberg, 2002. Tapir, ISBN 82-91116-45-8. 891 892[h4 Original Spreadsort] 893 894The algorithm that __integer_sort was originally based on. 895__integer_sort uses a smaller number of key bits at a time for better cache efficiency 896than the method described in the paper. 897The importance of cache efficiency grew as CPU clock speeds increased 898while main memory latency stagnated. 899See Steven J. Ross, 900The Spreadsort High-performance General-case Sorting Algorithm, 901Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.1100-1106. Las Vegas Nevada. 2002. See 902[@../../doc/papers/original_spreadsort06_2002.pdf Steven Ross spreadsort_2002]. 903 904[endsect] [/section:bibliography Bibliography] 905 906[section:history History] 907 908* First release following review in Boost 1.58. 909 910* [@http://permalink.gmane.org/gmane.comp.lib.boost.devel/255194 Review of Boost.Sort/Spreadsort library] 911 912[endsect] [/section:history] 913 914[xinclude autodoc.xml] [/ Using Doxygen reference documentation.] 915 916[endsect] [/section:spreadsort]