1Intro 2----- 3This describes an adaptive, stable, natural mergesort, modestly called 4timsort (hey, I earned it <wink>). It has supernatural performance on many 5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and 6as few as N-1), yet as fast as Python's previous highly tuned samplesort 7hybrid on random arrays. 8 9In a nutshell, the main routine marches over the array once, left to right, 10alternately identifying the next run, then merging it into the previous 11runs "intelligently". Everything else is complication for speed, and some 12hard-won measure of memory efficiency. 13 14 15Comparison with Python's Samplesort Hybrid 16------------------------------------------ 17+ timsort can require a temp array containing as many as N//2 pointers, 18 which means as many as 2*N extra bytes on 32-bit boxes. It can be 19 expected to require a temp array this large when sorting random data; on 20 data with significant structure, it may get away without using any extra 21 heap memory. This appears to be the strongest argument against it, but 22 compared to the size of an object, 2 temp bytes worst-case (also expected- 23 case for random data) doesn't scare me much. 24 25 It turns out that Perl is moving to a stable mergesort, and the code for 26 that appears always to require a temp array with room for at least N 27 pointers. (Note that I wouldn't want to do that even if space weren't an 28 issue; I believe its efforts at memory frugality also save timsort 29 significant pointer-copying costs, and allow it to have a smaller working 30 set.) 31 32+ Across about four hours of generating random arrays, and sorting them 33 under both methods, samplesort required about 1.5% more comparisons 34 (the program is at the end of this file). 35 36+ In real life, this may be faster or slower on random arrays than 37 samplesort was, depending on platform quirks. Since it does fewer 38 comparisons on average, it can be expected to do better the more 39 expensive a comparison function is. OTOH, it does more data movement 40 (pointer copying) than samplesort, and that may negate its small 41 comparison advantage (depending on platform quirks) unless comparison 42 is very expensive. 43 44+ On arrays with many kinds of pre-existing order, this blows samplesort out 45 of the water. It's significantly faster than samplesort even on some 46 cases samplesort was special-casing the snot out of. I believe that lists 47 very often do have exploitable partial order in real life, and this is the 48 strongest argument in favor of timsort (indeed, samplesort's special cases 49 for extreme partial order are appreciated by real users, and timsort goes 50 much deeper than those, in particular naturally covering every case where 51 someone has suggested "and it would be cool if list.sort() had a special 52 case for this too ... and for that ..."). 53 54+ Here are exact comparison counts across all the tests in sortperf.py, 55 when run with arguments "15 20 1". 56 57 Column Key: 58 *sort: random data 59 \sort: descending data 60 /sort: ascending data 61 3sort: ascending, then 3 random exchanges 62 +sort: ascending, then 10 random at the end 63 %sort: ascending, then randomly replace 1% of elements w/ random values 64 ~sort: many duplicates 65 =sort: all equal 66 !sort: worst case scenario 67 68 First the trivial cases, trivial for samplesort because it special-cased 69 them, and trivial for timsort because it naturally works on runs. Within 70 an "n" block, the first line gives the # of compares done by samplesort, 71 the second line by timsort, and the third line is the percentage by 72 which the samplesort count exceeds the timsort count: 73 74 n \sort /sort =sort 75------- ------ ------ ------ 76 32768 32768 32767 32767 samplesort 77 32767 32767 32767 timsort 78 0.00% 0.00% 0.00% (samplesort - timsort) / timsort 79 80 65536 65536 65535 65535 81 65535 65535 65535 82 0.00% 0.00% 0.00% 83 84 131072 131072 131071 131071 85 131071 131071 131071 86 0.00% 0.00% 0.00% 87 88 262144 262144 262143 262143 89 262143 262143 262143 90 0.00% 0.00% 0.00% 91 92 524288 524288 524287 524287 93 524287 524287 524287 94 0.00% 0.00% 0.00% 95 961048576 1048576 1048575 1048575 97 1048575 1048575 1048575 98 0.00% 0.00% 0.00% 99 100 The algorithms are effectively identical in these cases, except that 101 timsort does one less compare in \sort. 102 103 Now for the more interesting cases. Where lg(x) is the logarithm of x to 104 the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for 105 the best any comparison-based sorting algorithm can do on average (across 106 all permutations). When a method gets significantly below that, it's 107 either astronomically lucky, or is finding exploitable structure in the 108 data. 109 110 111 n lg(n!) *sort 3sort +sort %sort ~sort !sort 112------- ------- ------ ------- ------- ------ ------- -------- 113 32768 444255 453096 453614 32908 452871 130491 469141 old 114 448885 33016 33007 50426 182083 65534 new 115 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new 116 117 65536 954037 972699 981940 65686 973104 260029 1004607 118 962991 65821 65808 101667 364341 131070 119 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% 120 121 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 122 2057533 131410 131361 206193 728871 262142 123 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% 124 125 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 126 4377402 262437 262459 416347 1457945 524286 127 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% 128 129 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 130 9278734 524580 524633 837947 2916107 1048574 131 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% 132 1331048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 134 19606028 1048958 1048941 1694896 5832445 2097150 135 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% 136 137 Discussion of cases: 138 139 *sort: There's no structure in random data to exploit, so the theoretical 140 limit is lg(n!). Both methods get close to that, and timsort is hugging 141 it (indeed, in a *marginal* sense, it's a spectacular improvement -- 142 there's only about 1% left before hitting the wall, and timsort knows 143 darned well it's doing compares that won't pay on random data -- but so 144 does the samplesort hybrid). For contrast, Hoare's original random-pivot 145 quicksort does about 39% more compares than the limit, and the median-of-3 146 variant about 19% more. 147 148 3sort, %sort, and !sort: No contest; there's structure in this data, but 149 not of the specific kinds samplesort special-cases. Note that structure 150 in !sort wasn't put there on purpose -- it was crafted as a worst case for 151 a previous quicksort implementation. That timsort nails it came as a 152 surprise to me (although it's obvious in retrospect). 153 154 +sort: samplesort special-cases this data, and does a few less compares 155 than timsort. However, timsort runs this case significantly faster on all 156 boxes we have timings for, because timsort is in the business of merging 157 runs efficiently, while samplesort does much more data movement in this 158 (for it) special case. 159 160 ~sort: samplesort's special cases for large masses of equal elements are 161 extremely effective on ~sort's specific data pattern, and timsort just 162 isn't going to get close to that, despite that it's clearly getting a 163 great deal of benefit out of the duplicates (the # of compares is much less 164 than lg(n!)). ~sort has a perfectly uniform distribution of just 4 165 distinct values, and as the distribution gets more skewed, samplesort's 166 equal-element gimmicks become less effective, while timsort's adaptive 167 strategies find more to exploit; in a database supplied by Kevin Altis, a 168 sort on its highly skewed "on which stock exchange does this company's 169 stock trade?" field ran over twice as fast under timsort. 170 171 However, despite that timsort does many more comparisons on ~sort, and 172 that on several platforms ~sort runs highly significantly slower under 173 timsort, on other platforms ~sort runs highly significantly faster under 174 timsort. No other kind of data has shown this wild x-platform behavior, 175 and we don't have an explanation for it. The only thing I can think of 176 that could transform what "should be" highly significant slowdowns into 177 highly significant speedups on some boxes are catastrophic cache effects 178 in samplesort. 179 180 But timsort "should be" slower than samplesort on ~sort, so it's hard 181 to count that it isn't on some boxes as a strike against it <wink>. 182 183+ Here's the highwater mark for the number of heap-based temp slots (4 184 bytes each on this box) needed by each test, again with arguments 185 "15 20 1": 186 187 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 188 32768 16384 0 0 6256 0 10821 12288 0 16383 189 65536 32766 0 0 21652 0 31276 24576 0 32767 190 131072 65534 0 0 17258 0 58112 49152 0 65535 191 262144 131072 0 0 35660 0 123561 98304 0 131071 192 524288 262142 0 0 31302 0 212057 196608 0 262143 1931048576 524286 0 0 312438 0 484942 393216 0 524287 194 195 Discussion: The tests that end up doing (close to) perfectly balanced 196 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort 197 also ends up doing balanced merges, but systematically benefits a lot from 198 the preliminary pre-merge searches described under "Merge Memory" later. 199 %sort approaches having a balanced merge at the end because the random 200 selection of elements to replace is expected to produce an out-of-order 201 element near the midpoint. \sort, /sort, =sort are the trivial one-run 202 cases, needing no merging at all. +sort ends up having one very long run 203 and one very short, and so gets all the temp space it needs from the small 204 temparray member of the MergeState struct (note that the same would be 205 true if the new random elements were prefixed to the sorted list instead, 206 but not if they appeared "in the middle"). 3sort approaches N//3 temp 207 slots twice, but the run lengths that remain after 3 random exchanges 208 clearly has very high variance. 209 210 211A detailed description of timsort follows. 212 213Runs 214---- 215count_run() returns the # of elements in the next run, and, if it's a 216descending run, reverses it in-place. A run is either "ascending", which 217means non-decreasing: 218 219 a0 <= a1 <= a2 <= ... 220 221or "descending", which means non-increasing: 222 223 a0 >= a1 >= a2 >= ... 224 225Note that a run is always at least 2 long, unless we start at the array's 226last element. If all elements in the array are equal, it can be viewed as 227both ascending and descending. Upon return, the run count_run() identifies 228is always ascending. 229 230Reversal is done via the obvious fast "swap elements starting at each 231end, and converge at the middle" method. That can violate stability if 232the slice contains any equal elements. For that reason, for a long time 233the code used strict inequality (">" rather than ">=") in its definition 234of descending. 235 236Removing that restriction required some complication: when processing a 237descending run, all-equal sub-runs of elements are reversed in-place, on the 238fly. Their original relative order is restored "by magic" via the final 239"reverse the entire run" step. 240 241This makes processing descending runs a little more costly. We only use 242`__lt__` comparisons, so that `x == y` has to be deduced from 243`not x < y and not y < x`. But so long as a run remains strictly decreasing, 244only one of those compares needs to be done per loop iteration. So the primsry 245extra cost is paid only when there are equal elements, and they get some 246compensating benefit by not needing to end the descending run. 247 248There's one more trick added since the original: after reversing a descending 249run, it's possible that it can be extended by an adjacent ascending run. For 250example, given [3, 2, 1, 3, 4, 5, 0], the 3-element descending prefix is 251reversed in-place, and then extended by [3, 4, 5]. 252 253If an array is random, it's very unlikely we'll see long runs. If a natural 254run contains less than minrun elements (see next section), the main loop 255artificially boosts it to minrun elements, via a stable binary insertion sort 256applied to the right number of array elements following the short natural 257run. In a random array, *all* runs are likely to be minrun long as a 258result. This has two primary good effects: 259 2601. Random data strongly tends then toward perfectly balanced (both runs have 261 the same length) merges, which is the most efficient way to proceed when 262 data is random. 263 2642. Because runs are never very short, the rest of the code doesn't make 265 heroic efforts to shave a few cycles off per-merge overheads. For 266 example, reasonable use of function calls is made, rather than trying to 267 inline everything. Since there are no more than N/minrun runs to begin 268 with, a few "extra" function calls per merge is barely measurable. 269 270 271Computing minrun 272---------------- 273If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the 274whole array then; it's hard to beat that given the overheads of trying 275something fancier (see note BINSORT). 276 277When N is a power of 2, testing on random data showed that minrun values of 27816, 32, 64 and 128 worked about equally well. At 256 the data-movement cost 279in binary insertion sort clearly hurt, and at 8 the increase in the number 280of function calls clearly hurt. Picking *some* power of 2 is important 281here, so that the merges end up perfectly balanced (see next section). We 282pick 32 as a good value in the sweet range; picking a value at the low end 283allows the adaptive gimmicks more opportunity to exploit shorter natural 284runs. 285 286Because sortperf.py only tries powers of 2, it took a long time to notice 287that 32 isn't a good choice for the general case! Consider N=2112: 288 289>>> divmod(2112, 32) 290(66, 0) 291>>> 292 293If the data is randomly ordered, we're very likely to end up with 66 runs 294each of length 32. The first 64 of these trigger a sequence of perfectly 295balanced merges (see next section), leaving runs of lengths 2048 and 64 to 296merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 297compares, but it's still more compares than necessary, and-- mergesort's 298bugaboo relative to samplesort --a lot more data movement (O(N) copies just 299to get 64 elements into place). 300 301If we take minrun=33 in this case, then we're very likely to end up with 64 302runs each of length 33, and then all merges are perfectly balanced. Better! 303 304What we want to avoid is picking minrun such that in 305 306 q, r = divmod(N, minrun) 307 308q is a power of 2 and r>0 (then the last merge only gets r elements into 309place, and r < minrun is small compared to N), or q a little larger than a 310power of 2 regardless of r (then we've got a case similar to "2112", again 311leaving too little work for the last merge to do). 312 313Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that 314N/minrun is exactly a power of 2, or if that isn't possible, is close to, but 315strictly less than, a power of 2. This is easier to do than it may sound: 316take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining 317bits are set. In fact, that rule covers every case in this section, including 318small N and exact powers of 2; merge_compute_minrun() is a deceptively simple 319function. 320 321 322The Merge Pattern 323----------------- 324In order to exploit regularities in the data, we're merging on natural 325run lengths, and they can become wildly unbalanced. That's a Good Thing 326for this sort! It means we have to find a way to manage an assortment of 327potentially very different run lengths, though. 328 329Stability constrains permissible merging patterns. For example, if we have 3303 consecutive runs of lengths 331 332 A:10000 B:20000 C:10000 333 334we dare not merge A with C first, because if A, B and C happen to contain 335a common element, it would get out of order wrt its occurrence(s) in B. The 336merging must be done as (A+B)+C or A+(B+C) instead. 337 338So merging is always done on two consecutive runs at a time, and in-place, 339although this may require some temp memory (more on that later). 340 341When a run is identified, its length is passed to found_new_run() to 342potentially merge runs on a stack of pending runs. We would like to delay 343merging as long as possible in order to exploit patterns that may come up 344later, but we like even more to do merging as soon as possible to exploit 345that the run just found is still high in the memory hierarchy. We also can't 346delay merging "too long" because it consumes memory to remember the runs that 347are still unmerged, and the stack has a fixed size. 348 349The original version of this code used the first thing I made up that didn't 350obviously suck ;-) It was loosely based on invariants involving the Fibonacci 351sequence. 352 353It worked OK, but it was hard to reason about, and was subtle enough that the 354intended invariants weren't actually preserved. Researchers discovered that 355when trying to complete a computer-generated correctness proof. That was 356easily-enough repaired, but the discovery spurred quite a bit of academic 357interest in truly good ways to manage incremental merging on the fly. 358 359At least a dozen different approaches were developed, some provably having 360near-optimal worst case behavior with respect to the entropy of the 361distribution of run lengths. Some details can be found in bpo-34561. 362 363The code now uses the "powersort" merge strategy from: 364 365 "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods 366 That Optimally Adapt to Existing Runs" 367 J. Ian Munro and Sebastian Wild 368 369The code is pretty simple, but the justification is quite involved, as it's 370based on fast approximations to optimal binary search trees, which are 371substantial topics on their own. 372 373Here we'll just cover some pragmatic details: 374 375The `powerloop()` function computes a run's "power". Say two adjacent runs 376begin at index s1. The first run has length n1, and the second run (starting 377at index s1+n1, called "s2" below) has length n2. The list has total length n. 378The "power" of the first run is a small integer, the depth of the node 379connecting the two runs in an ideal binary merge tree, where power 1 is the 380root node, and the power increases by 1 for each level deeper in the tree. 381 382The power is the least integer L such that the "midpoint interval" contains 383a rational number of the form J/2**L. The midpoint interval is the semi- 384closed interval: 385 386 ((s1 + n1/2)/n, (s2 + n2/2)/n] 387 388Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and 389(s2 + n2/2)/n are computed to infinite precision in binary, the power L is 390the first position at which the 2**-L bit differs between the expansions. 391Since the left end of the interval is less than the right end, the first 392differing bit must be a 0 bit in the left quotient and a 1 bit in the right 393quotient. 394 395`powerloop()` emulates these divisions, 1 bit at a time, using comparisons, 396subtractions, and shifts in a loop. 397 398You'll notice the paper uses an O(1) method instead, but that relies on two 399things we don't have: 400 401- An O(1) "count leading zeroes" primitive. We can find such a thing as a C 402 extension on most platforms, but not all, and there's no uniform spelling 403 on the platforms that support it. 404 405- Integer division on an integer type twice as wide as needed to hold the 406 list length. But the latter is Py_ssize_t for us, and is typically the 407 widest native signed integer type the platform supports. 408 409But since runs in our algorithm are almost never very short, the once-per-run 410overhead of `powerloop()` seems lost in the noise. 411 412Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all, 413shift integers of that width left by 1. How do we know that won't spill into 414the sign bit? The trick is that we have some slop. `n` (the total list 415length) is the number of list elements, which is at most 4 times (on a 32-box, 416with 4-byte pointers) smaller than than the largest size_t. So at least the 417leading two bits of the integers we're using are clear. 418 419Since we can't compute a run's power before seeing the run that follows it, 420the most-recently identified run is never merged by `found_new_run()`. 421Instead a new run is only used to compute the 2nd-most-recent run's power. 422Then adjacent runs are merged so long as their saved power (tree depth) is 423greater than that newly computed power. When found_new_run() returns, only 424then is a new run pushed on to the stack of pending runs. 425 426A key invariant is that powers on the run stack are strictly decreasing 427(starting from the run at the top of the stack). 428 429Note that even powersort's strategy isn't always truly optimal. It can't be. 430Computing an optimal merge sequence can be done in time quadratic in the 431number of runs, which is very much slower, and also requires finding & 432remembering _all_ the runs' lengths (of which there may be billions) in 433advance. It's remarkable, though, how close to optimal this strategy gets. 434 435Curious factoid: of all the alternatives I've seen in the literature, 436powersort's is the only one that's always truly optimal for a collection of 3 437run lengths (for three lengths A B C, it's always optimal to first merge the 438shorter of A and C with B). 439 440 441Merge Memory 442------------ 443Merging adjacent runs of lengths A and B in-place, and in linear time, is 444difficult. Theoretical constructions are known that can do it, but they're 445too difficult and slow for practical use. But if we have temp memory equal 446to min(A, B), it's easy. 447 448If A is smaller (function merge_lo), copy A to a temp array, leave B alone, 449and then we can do the obvious merge algorithm left to right, from the temp 450area and B, starting the stores into where A used to live. There's always a 451free area in the original area comprising a number of elements equal to the 452number not yet merged from the temp array (trivially true at the start; 453proceed by induction). The only tricky bit is that if a comparison raises an 454exception, we have to remember to copy the remaining elements back in from 455the temp area, lest the array end up with duplicate entries from B. But 456that's exactly the same thing we need to do if we reach the end of B first, 457so the exit code is pleasantly common to both the normal and error cases. 458 459If B is smaller (function merge_hi, which is merge_lo's "mirror image"), 460much the same, except that we need to merge right to left, copying B into a 461temp array and starting the stores at the right end of where B used to live. 462 463A refinement: When we're about to merge adjacent runs A and B, we first do 464a form of binary search (more on that later) to see where B[0] should end up 465in A. Elements in A preceding that point are already in their final 466positions, effectively shrinking the size of A. Likewise we also search to 467see where A[-1] should end up in B, and elements of B after that point can 468also be ignored. This cuts the amount of temp memory needed by the same 469amount. 470 471These preliminary searches may not pay off, and can be expected *not* to 472repay their cost if the data is random. But they can win huge in all of 473time, copying, and memory savings when they do pay, so this is one of the 474"per-merge overheads" mentioned above that we're happy to endure because 475there is at most one very short run. It's generally true in this algorithm 476that we're willing to gamble a little to win a lot, even though the net 477expectation is negative for random data. 478 479 480Merge Algorithms 481---------------- 482merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo 483deals with runs where A <= B, and merge_hi where A > B. They don't know 484whether the data is clustered or uniform, but a lovely thing about merging 485is that many kinds of clustering "reveal themselves" by how many times in a 486row the winning merge element comes from the same run. We'll only discuss 487merge_lo here; merge_hi is exactly analogous. 488 489Merging begins in the usual, obvious way, comparing the first element of A 490to the first of B, and moving B[0] to the merge area if it's less than A[0], 491else moving A[0] to the merge area. Call that the "one pair at a time" 492mode. The only twist here is keeping track of how many times in a row "the 493winner" comes from the same run. 494 495If that count reaches MIN_GALLOP, we switch to "galloping mode". Here 496we *search* B for where A[0] belongs, and move over all the B's before 497that point in one chunk to the merge area, then move A[0] to the merge 498area. Then we search A for where B[0] belongs, and similarly move a 499slice of A in one chunk. Then back to searching B for where A[0] belongs, 500etc. We stay in galloping mode until both searches find slices to copy 501less than MIN_GALLOP elements long, at which point we go back to one-pair- 502at-a-time mode. 503 504A refinement: The MergeState struct contains the value of min_gallop that 505controls when we enter galloping mode, initialized to MIN_GALLOP. 506merge_lo() and merge_hi() adjust this higher when galloping isn't paying 507off, and lower when it is. 508 509 510Galloping 511--------- 512Still without loss of generality, assume A is the shorter run. In galloping 513mode, we first look for A[0] in B. We do this via "galloping", comparing 514A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding 515the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most 516roughly lg(B) comparisons, and, unlike a straight binary search, favors 517finding the right spot early in B (more on that later). 518 519After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 520consecutive elements, and a straight binary search requires exactly k-1 521additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we 522copy all the B's up to that point in one chunk, and then copy A[0]. Note 523that no matter where A[0] belongs in B, the combination of galloping + binary 524search finds it in no more than about 2*lg(B) comparisons. 525 526If we did a straight binary search, we could find it in no more than 527ceiling(lg(B+1)) comparisons -- but straight binary search takes that many 528comparisons no matter where A[0] belongs. Straight binary search thus loses 529to galloping unless the run is quite long, and we simply can't guess 530whether it is in advance. 531 532If data is random and runs have the same length, A[0] belongs at B[0] half 533the time, at B[1] a quarter of the time, and so on: a consecutive winning 534sub-run in B of length k occurs with probability 1/2**(k+1). So long 535winning sub-runs are extremely unlikely in random data, and guessing that a 536winning sub-run is going to be long is a dangerous game. 537 538OTOH, if data is lopsided or lumpy or contains many duplicates, long 539stretches of winning sub-runs are very likely, and cutting the number of 540comparisons needed to find one from O(B) to O(log B) is a huge win. 541 542Galloping compromises by getting out fast if there isn't a long winning 543sub-run, yet finding such very efficiently when they exist. 544 545I first learned about the galloping strategy in a related context; see: 546 547 "Adaptive Set Intersections, Unions, and Differences" (2000) 548 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro 549 550and its followup(s). An earlier paper called the same strategy 551"exponential search": 552 553 "Optimistic Sorting and Information Theoretic Complexity" 554 Peter McIlroy 555 SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 556 467-474, Austin, Texas, 25-27 January 1993. 557 558and it probably dates back to an earlier paper by Bentley and Yao. The 559McIlroy paper in particular has good analysis of a mergesort that's 560probably strongly related to this one in its galloping strategy. 561 562 563Galloping with a Broken Leg 564--------------------------- 565So why don't we always gallop? Because it can lose, on two counts: 566 5671. While we're willing to endure small per-merge overheads, per-comparison 568 overheads are a different story. Calling Yet Another Function per 569 comparison is expensive, and gallop_left() and gallop_right() are 570 too long-winded for sane inlining. 571 5722. Galloping can-- alas --require more comparisons than linear one-at-time 573 search, depending on the data. 574 575#2 requires details. If A[0] belongs before B[0], galloping requires 1 576compare to determine that, same as linear search, except it costs more 577to call the gallop function. If A[0] belongs right before B[1], galloping 578requires 2 compares, again same as linear search. On the third compare, 579galloping checks A[0] against B[3], and if it's <=, requires one more 580compare to determine whether A[0] belongs at B[2] or B[3]. That's a total 581of 4 compares, but if A[0] does belong at B[2], linear search would have 582discovered that in only 3 compares, and that's a huge loss! Really. It's 583an increase of 33% in the number of compares needed, and comparisons are 584expensive in Python. 585 586index in B where # compares linear # gallop # binary gallop 587A[0] belongs search needs compares compares total 588---------------- ----------------- -------- -------- ------ 589 0 1 1 0 1 590 591 1 2 2 0 2 592 593 2 3 3 1 4 594 3 4 3 1 4 595 596 4 5 4 2 6 597 5 6 4 2 6 598 6 7 4 2 6 599 7 8 4 2 6 600 601 8 9 5 3 8 602 9 10 5 3 8 603 10 11 5 3 8 604 11 12 5 3 8 605 ... 606 607In general, if A[0] belongs at B[i], linear search requires i+1 comparisons 608to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. 609The advantage of galloping is unbounded as i grows, but it doesn't win at 610all until i=6. Before then, it loses twice (at i=2 and i=4), and ties 611at the other values. At and after i=6, galloping always wins. 612 613We can't guess in advance when it's going to win, though, so we do one pair 614at a time until the evidence seems strong that galloping may pay. MIN_GALLOP 615is 7, and that's pretty strong evidence. However, if the data is random, it 616simply will trigger galloping mode purely by luck every now and again, and 617it's quite likely to hit one of the losing cases next. On the other hand, 618in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it 619"should be" then. So the MergeState struct keeps a min_gallop variable 620that merge_lo and merge_hi adjust: the longer we stay in galloping mode, 621the smaller min_gallop gets, making it easier to transition back to 622galloping mode (if we ever leave it in the current merge, and at the 623start of the next merge). But whenever the gallop loop doesn't pay, 624min_gallop is increased by one, making it harder to transition back 625to galloping mode (and again both within a merge and across merges). For 626random data, this all but eliminates the gallop penalty: min_gallop grows 627large enough that we almost never get into galloping mode. And for cases 628like ~sort, min_gallop can fall to as low as 1. This seems to work well, 629but in all it's a minor improvement over using a fixed MIN_GALLOP value. 630 631 632Galloping Complication 633---------------------- 634The description above was for merge_lo. merge_hi has to merge "from the 635other end", and really needs to gallop starting at the last element in a run 636instead of the first. Galloping from the first still works, but does more 637comparisons than it should (this is significant -- I timed it both ways). For 638this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) 639functions have a "hint" argument, which is the index at which galloping 640should begin. So galloping can actually start at any index, and proceed at 641offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index. 642 643In the code as I type it's always called with either 0 or n-1 (where n is 644the # of elements in a run). It's tempting to try to do something fancier, 645melding galloping with some form of interpolation search; for example, if 646we're merging a run of length 1 with a run of length 10000, index 5000 is 647probably a better guess at the final result than either 0 or 9999. But 648it's unclear how to generalize that intuition usefully, and merging of 649wildly unbalanced runs already enjoys excellent performance. 650 651~sort is a good example of when balanced runs could benefit from a better 652hint value: to the extent possible, this would like to use a starting 653offset equal to the previous value of acount/bcount. Doing so saves about 65410% of the compares in ~sort. However, doing so is also a mixed bag, 655hurting other cases. 656 657 658Comparing Average # of Compares on Random Arrays 659------------------------------------------------ 660[NOTE: This was done when the new algorithm used about 0.1% more compares 661 on random data than does its current incarnation.] 662 663Here list.sort() is samplesort, and list.msort() this sort: 664 665""" 666import random 667from time import clock as now 668 669def fill(n): 670 from random import random 671 return [random() for i in range(n)] 672 673def mycmp(x, y): 674 global ncmp 675 ncmp += 1 676 return cmp(x, y) 677 678def timeit(values, method): 679 global ncmp 680 X = values[:] 681 bound = getattr(X, method) 682 ncmp = 0 683 t1 = now() 684 bound(mycmp) 685 t2 = now() 686 return t2-t1, ncmp 687 688format = "%5s %9.2f %11d" 689f2 = "%5s %9.2f %11.2f" 690 691def drive(): 692 count = sst = sscmp = mst = mscmp = nelts = 0 693 while True: 694 n = random.randrange(100000) 695 nelts += n 696 x = fill(n) 697 698 t, c = timeit(x, 'sort') 699 sst += t 700 sscmp += c 701 702 t, c = timeit(x, 'msort') 703 mst += t 704 mscmp += c 705 706 count += 1 707 if count % 10: 708 continue 709 710 print "count", count, "nelts", nelts 711 print format % ("sort", sst, sscmp) 712 print format % ("msort", mst, mscmp) 713 print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) 714 715drive() 716""" 717 718I ran this on Windows and kept using the computer lightly while it was 719running. time.clock() is wall-clock time on Windows, with better than 720microsecond resolution. samplesort started with a 1.52% #-of-comparisons 721disadvantage, fell quickly to 1.48%, and then fluctuated within that small 722range. Here's the last chunk of output before I killed the job: 723 724count 2630 nelts 130906543 725 sort 6110.80 1937887573 726msort 6002.78 1909389381 727 1.80 1.49 728 729We've done nearly 2 billion comparisons apiece at Python speed there, and 730that's enough <wink>. 731 732For random arrays of size 2 (yes, there are only 2 interesting ones), 733samplesort has a 50%(!) comparison disadvantage. This is a consequence of 734samplesort special-casing at most one ascending run at the start, then 735falling back to the general case if it doesn't find an ascending run 736immediately. The consequence is that it ends up using two compares to sort 737[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be 738taught how to deal with mixtures of ascending and descending runs 739efficiently in all cases. 740 741 742NOTES 743----- 744 745BINSORT 746A "binary insertion sort" is just like a textbook insertion sort, but instead 747of locating the correct position of the next item via linear (one at a time) 748search, an equivalent to Python's bisect.bisect_right is used to find the 749correct position in logarithmic time. Most texts don't mention this 750variation, and those that do usually say it's not worth the bother: insertion 751sort remains quadratic (expected and worst cases) either way. Speeding the 752search doesn't reduce the quadratic data movement costs. 753 754But in CPython's case, comparisons are extraordinarily expensive compared to 755moving data, and the details matter. Moving objects is just copying 756pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary 757user-supplied Python code), but even in simple cases (like 3 < 4) _all_ 758decisions are made at runtime: what's the type of the left comparand? the 759type of the right? do they need to be coerced to a common type? where's the 760code to compare these types? And so on. Even the simplest Python comparison 761triggers a large pile of C-level pointer dereferences, conditionals, and 762function calls. 763 764So cutting the number of compares is almost always measurably helpful in 765CPython, and the savings swamp the quadratic-time data movement costs for 766reasonable minrun values. 767 768 769LEFT OR RIGHT 770gallop_left() and gallop_right() are akin to the Python bisect module's 771bisect_left() and bisect_right(): they're the same unless the slice they're 772searching contains a (at least one) value equal to the value being searched 773for. In that case, gallop_left() returns the position immediately before the 774leftmost equal value, and gallop_right() the position immediately after the 775rightmost equal value. The distinction is needed to preserve stability. In 776general, when merging adjacent runs A and B, gallop_left is used to search 777thru B for where an element from A belongs, and gallop_right to search thru A 778for where an element from B belongs. 779 780 781REGION OF UNCERTAINTY 782Two kinds of confusion seem to be common about the claim that after finding 783a k such that 784 785 B[2**(k-1) - 1] < A[0] <= B[2**k - 1] 786 787then a binary search requires exactly k-1 tries to find A[0]'s proper 788location. For concreteness, say k=3, so B[3] < A[0] <= B[7]. 789 790The first confusion takes the form "OK, then the region of uncertainty is at 791indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 = 7923"; or the region is viewed as a Python slice and the objection is "but that's 793the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already 794compared A[0] against B[3] and against B[7], so A[0]'s correct location is 795already known wrt _both_ endpoints. What remains is to find A[0]'s correct 796location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the 797slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1 798inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has 799 (2**k-1)-1 - 2**(k-1) + 1 = 800 2**k-1 - 2**(k-1) = 801 2*2**(k-1)-1 - 2**(k-1) = 802 (2-1)*2**(k-1) - 1 = 803 2**(k-1) - 1 804elements. 805 806The second confusion: "k-1 = 2 binary searches can find the correct location 807among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we 808could make this more efficient by arranging for the region of uncertainty to 809span 2**(k-1) elements." Resolution: that confuses "elements" with 810"locations". In a slice with N elements, there are N+1 _locations_. In the 811example, with the region of uncertainty B[4], B[5], B[6], there are 4 812locations: before B[4], between B[4] and B[5], between B[5] and B[6], and 813after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) 814locations. That's why k-1 binary searches are necessary and sufficient. 815 816OPTIMIZATION OF INDIVIDUAL COMPARISONS 817As noted above, even the simplest Python comparison triggers a large pile of 818C-level pointer dereferences, conditionals, and function calls. This can be 819partially mitigated by pre-scanning the data to determine whether the data is 820homogeneous with respect to type. If so, it is sometimes possible to 821substitute faster type-specific comparisons for the slower, generic 822PyObject_RichCompareBool. 823