• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Intro
2-----
3This describes an adaptive, stable, natural mergesort, modestly called
4timsort (hey, I earned it <wink>).  It has supernatural performance on many
5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
6as few as N-1), yet as fast as Python's previous highly tuned samplesort
7hybrid on random arrays.
8
9In a nutshell, the main routine marches over the array once, left to right,
10alternately identifying the next run, then merging it into the previous
11runs "intelligently".  Everything else is complication for speed, and some
12hard-won measure of memory efficiency.
13
14
15Comparison with Python's Samplesort Hybrid
16------------------------------------------
17+ timsort can require a temp array containing as many as N//2 pointers,
18  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
19  expected to require a temp array this large when sorting random data; on
20  data with significant structure, it may get away without using any extra
21  heap memory.  This appears to be the strongest argument against it, but
22  compared to the size of an object, 2 temp bytes worst-case (also expected-
23  case for random data) doesn't scare me much.
24
25  It turns out that Perl is moving to a stable mergesort, and the code for
26  that appears always to require a temp array with room for at least N
27  pointers. (Note that I wouldn't want to do that even if space weren't an
28  issue; I believe its efforts at memory frugality also save timsort
29  significant pointer-copying costs, and allow it to have a smaller working
30  set.)
31
32+ Across about four hours of generating random arrays, and sorting them
33  under both methods, samplesort required about 1.5% more comparisons
34  (the program is at the end of this file).
35
36+ In real life, this may be faster or slower on random arrays than
37  samplesort was, depending on platform quirks.  Since it does fewer
38  comparisons on average, it can be expected to do better the more
39  expensive a comparison function is.  OTOH, it does more data movement
40  (pointer copying) than samplesort, and that may negate its small
41  comparison advantage (depending on platform quirks) unless comparison
42  is very expensive.
43
44+ On arrays with many kinds of pre-existing order, this blows samplesort out
45  of the water.  It's significantly faster than samplesort even on some
46  cases samplesort was special-casing the snot out of.  I believe that lists
47  very often do have exploitable partial order in real life, and this is the
48  strongest argument in favor of timsort (indeed, samplesort's special cases
49  for extreme partial order are appreciated by real users, and timsort goes
50  much deeper than those, in particular naturally covering every case where
51  someone has suggested "and it would be cool if list.sort() had a special
52  case for this too ... and for that ...").
53
54+ Here are exact comparison counts across all the tests in sortperf.py,
55  when run with arguments "15 20 1".
56
57  Column Key:
58      *sort: random data
59      \sort: descending data
60      /sort: ascending data
61      3sort: ascending, then 3 random exchanges
62      +sort: ascending, then 10 random at the end
63      %sort: ascending, then randomly replace 1% of elements w/ random values
64      ~sort: many duplicates
65      =sort: all equal
66      !sort: worst case scenario
67
68  First the trivial cases, trivial for samplesort because it special-cased
69  them, and trivial for timsort because it naturally works on runs.  Within
70  an "n" block, the first line gives the # of compares done by samplesort,
71  the second line by timsort, and the third line is the percentage by
72  which the samplesort count exceeds the timsort count:
73
74      n   \sort   /sort   =sort
75-------  ------  ------  ------
76  32768   32768   32767   32767  samplesort
77          32767   32767   32767  timsort
78          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
79
80  65536   65536   65535   65535
81          65535   65535   65535
82          0.00%   0.00%   0.00%
83
84 131072  131072  131071  131071
85         131071  131071  131071
86          0.00%   0.00%   0.00%
87
88 262144  262144  262143  262143
89         262143  262143  262143
90          0.00%   0.00%   0.00%
91
92 524288  524288  524287  524287
93         524287  524287  524287
94          0.00%   0.00%   0.00%
95
961048576 1048576 1048575 1048575
97        1048575 1048575 1048575
98          0.00%   0.00%   0.00%
99
100  The algorithms are effectively identical in these cases, except that
101  timsort does one less compare in \sort.
102
103  Now for the more interesting cases.  Where lg(x) is the logarithm of x to
104  the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
105  the best any comparison-based sorting algorithm can do on average (across
106  all permutations).  When a method gets significantly below that, it's
107  either astronomically lucky, or is finding exploitable structure in the
108  data.
109
110
111      n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
112-------  -------   ------   -------  -------  ------  -------  --------
113  32768   444255   453096   453614    32908   452871   130491    469141 old
114                   448885    33016    33007    50426   182083     65534 new
115                    0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
116
117  65536   954037   972699   981940    65686   973104   260029   1004607
118                   962991    65821    65808   101667   364341    131070
119                    1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
120
121 131072  2039137  2101881  2091491   131232  2092894   554790   2161379
122                  2057533   131410   131361   206193   728871    262142
123                    2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
124
125 262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
126                  4377402   262437   262459   416347  1457945    524286
127                    1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
128
129 524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
130                  9278734   524580   524633   837947  2916107   1048574
131                   1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
132
1331048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
134                 19606028  1048958  1048941  1694896  5832445   2097150
135                    1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
136
137  Discussion of cases:
138
139  *sort:  There's no structure in random data to exploit, so the theoretical
140  limit is lg(n!).  Both methods get close to that, and timsort is hugging
141  it (indeed, in a *marginal* sense, it's a spectacular improvement --
142  there's only about 1% left before hitting the wall, and timsort knows
143  darned well it's doing compares that won't pay on random data -- but so
144  does the samplesort hybrid).  For contrast, Hoare's original random-pivot
145  quicksort does about 39% more compares than the limit, and the median-of-3
146  variant about 19% more.
147
148  3sort, %sort, and !sort:  No contest; there's structure in this data, but
149  not of the specific kinds samplesort special-cases.  Note that structure
150  in !sort wasn't put there on purpose -- it was crafted as a worst case for
151  a previous quicksort implementation.  That timsort nails it came as a
152  surprise to me (although it's obvious in retrospect).
153
154  +sort:  samplesort special-cases this data, and does a few less compares
155  than timsort.  However, timsort runs this case significantly faster on all
156  boxes we have timings for, because timsort is in the business of merging
157  runs efficiently, while samplesort does much more data movement in this
158  (for it) special case.
159
160  ~sort:  samplesort's special cases for large masses of equal elements are
161  extremely effective on ~sort's specific data pattern, and timsort just
162  isn't going to get close to that, despite that it's clearly getting a
163  great deal of benefit out of the duplicates (the # of compares is much less
164  than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
165  distinct values, and as the distribution gets more skewed, samplesort's
166  equal-element gimmicks become less effective, while timsort's adaptive
167  strategies find more to exploit; in a database supplied by Kevin Altis, a
168  sort on its highly skewed "on which stock exchange does this company's
169  stock trade?" field ran over twice as fast under timsort.
170
171  However, despite that timsort does many more comparisons on ~sort, and
172  that on several platforms ~sort runs highly significantly slower under
173  timsort, on other platforms ~sort runs highly significantly faster under
174  timsort.  No other kind of data has shown this wild x-platform behavior,
175  and we don't have an explanation for it.  The only thing I can think of
176  that could transform what "should be" highly significant slowdowns into
177  highly significant speedups on some boxes are catastrophic cache effects
178  in samplesort.
179
180  But timsort "should be" slower than samplesort on ~sort, so it's hard
181  to count that it isn't on some boxes as a strike against it <wink>.
182
183+ Here's the highwater mark for the number of heap-based temp slots (4
184  bytes each on this box) needed by each test, again with arguments
185  "15 20 1":
186
187   2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
188  32768  16384     0     0   6256      0  10821  12288      0  16383
189  65536  32766     0     0  21652      0  31276  24576      0  32767
190 131072  65534     0     0  17258      0  58112  49152      0  65535
191 262144 131072     0     0  35660      0 123561  98304      0 131071
192 524288 262142     0     0  31302      0 212057 196608      0 262143
1931048576 524286     0     0 312438      0 484942 393216      0 524287
194
195  Discussion:  The tests that end up doing (close to) perfectly balanced
196  merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
197  also ends up doing balanced merges, but systematically benefits a lot from
198  the preliminary pre-merge searches described under "Merge Memory" later.
199  %sort approaches having a balanced merge at the end because the random
200  selection of elements to replace is expected to produce an out-of-order
201  element near the midpoint.  \sort, /sort, =sort are the trivial one-run
202  cases, needing no merging at all.  +sort ends up having one very long run
203  and one very short, and so gets all the temp space it needs from the small
204  temparray member of the MergeState struct (note that the same would be
205  true if the new random elements were prefixed to the sorted list instead,
206  but not if they appeared "in the middle").  3sort approaches N//3 temp
207  slots twice, but the run lengths that remain after 3 random exchanges
208  clearly has very high variance.
209
210
211A detailed description of timsort follows.
212
213Runs
214----
215count_run() returns the # of elements in the next run, and, if it's a
216descending run, reverses it in-place. A run is either "ascending", which
217means non-decreasing:
218
219    a0 <= a1 <= a2 <= ...
220
221or "descending", which means non-increasing:
222
223    a0 >= a1 >= a2 >= ...
224
225Note that a run is always at least 2 long, unless we start at the array's
226last element. If all elements in the array are equal, it can be viewed as
227both ascending and descending. Upon return, the run count_run() identifies
228is always ascending.
229
230Reversal is done via the obvious fast "swap elements starting at each
231end, and converge at the middle" method. That can violate stability if
232the slice contains any equal elements. For that reason, for a long time
233the code used strict inequality (">" rather than ">=") in its definition
234of descending.
235
236Removing that restriction required some complication: when processing a
237descending run, all-equal sub-runs of elements are reversed in-place, on the
238fly. Their original relative order is restored "by magic" via the final
239"reverse the entire run" step.
240
241This makes processing descending runs a little more costly. We only use
242`__lt__` comparisons, so that `x == y` has to be deduced from
243`not x < y and not y < x`. But so long as a run remains strictly decreasing,
244only one of those compares needs to be done per loop iteration. So the primsry
245extra cost is paid only when there are equal elements, and they get some
246compensating benefit by not needing to end the descending run.
247
248There's one more trick added since the original: after reversing a descending
249run, it's possible that it can be extended by an adjacent ascending run. For
250example, given [3, 2, 1, 3, 4, 5, 0], the 3-element descending prefix is
251reversed in-place, and then extended by [3, 4, 5].
252
253If an array is random, it's very unlikely we'll see long runs.  If a natural
254run contains less than minrun elements (see next section), the main loop
255artificially boosts it to minrun elements, via a stable binary insertion sort
256applied to the right number of array elements following the short natural
257run.  In a random array, *all* runs are likely to be minrun long as a
258result.  This has two primary good effects:
259
2601. Random data strongly tends then toward perfectly balanced (both runs have
261   the same length) merges, which is the most efficient way to proceed when
262   data is random.
263
2642. Because runs are never very short, the rest of the code doesn't make
265   heroic efforts to shave a few cycles off per-merge overheads.  For
266   example, reasonable use of function calls is made, rather than trying to
267   inline everything.  Since there are no more than N/minrun runs to begin
268   with, a few "extra" function calls per merge is barely measurable.
269
270
271Computing minrun
272----------------
273If N < MAX_MINRUN, minrun is N.  IOW, binary insertion sort is used for the
274whole array then; it's hard to beat that given the overheads of trying
275something fancier (see note BINSORT).
276
277When N is a power of 2, testing on random data showed that minrun values of
27816, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
279in binary insertion sort clearly hurt, and at 8 the increase in the number
280of function calls clearly hurt.  Picking *some* power of 2 is important
281here, so that the merges end up perfectly balanced (see next section).  We
282pick 32 as a good value in the sweet range; picking a value at the low end
283allows the adaptive gimmicks more opportunity to exploit shorter natural
284runs.
285
286Because sortperf.py only tries powers of 2, it took a long time to notice
287that 32 isn't a good choice for the general case!  Consider N=2112:
288
289>>> divmod(2112, 32)
290(66, 0)
291>>>
292
293If the data is randomly ordered, we're very likely to end up with 66 runs
294each of length 32.  The first 64 of these trigger a sequence of perfectly
295balanced merges (see next section), leaving runs of lengths 2048 and 64 to
296merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
297compares, but it's still more compares than necessary, and-- mergesort's
298bugaboo relative to samplesort --a lot more data movement (O(N) copies just
299to get 64 elements into place).
300
301If we take minrun=33 in this case, then we're very likely to end up with 64
302runs each of length 33, and then all merges are perfectly balanced.  Better!
303
304What we want to avoid is picking minrun such that in
305
306    q, r = divmod(N, minrun)
307
308q is a power of 2 and r>0 (then the last merge only gets r elements into
309place, and r < minrun is small compared to N), or q a little larger than a
310power of 2 regardless of r (then we've got a case similar to "2112", again
311leaving too little work for the last merge to do).
312
313Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that
314N/minrun is exactly a power of 2, or if that isn't possible, is close to, but
315strictly less than, a power of 2.  This is easier to do than it may sound:
316take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining
317bits are set. In fact, that rule covers every case in this section, including
318small N and exact powers of 2; merge_compute_minrun() is a deceptively simple
319function.
320
321
322The Merge Pattern
323-----------------
324In order to exploit regularities in the data, we're merging on natural
325run lengths, and they can become wildly unbalanced.  That's a Good Thing
326for this sort!  It means we have to find a way to manage an assortment of
327potentially very different run lengths, though.
328
329Stability constrains permissible merging patterns.  For example, if we have
3303 consecutive runs of lengths
331
332    A:10000  B:20000  C:10000
333
334we dare not merge A with C first, because if A, B and C happen to contain
335a common element, it would get out of order wrt its occurrence(s) in B.  The
336merging must be done as (A+B)+C or A+(B+C) instead.
337
338So merging is always done on two consecutive runs at a time, and in-place,
339although this may require some temp memory (more on that later).
340
341When a run is identified, its length is passed to found_new_run() to
342potentially merge runs on a stack of pending runs.  We would like to delay
343merging as long as possible in order to exploit patterns that may come up
344later, but we like even more to do merging as soon as possible to exploit
345that the run just found is still high in the memory hierarchy.  We also can't
346delay merging "too long" because it consumes memory to remember the runs that
347are still unmerged, and the stack has a fixed size.
348
349The original version of this code used the first thing I made up that didn't
350obviously suck ;-) It was loosely based on invariants involving the Fibonacci
351sequence.
352
353It worked OK, but it was hard to reason about, and was subtle enough that the
354intended invariants weren't actually preserved.  Researchers discovered that
355when trying to complete a computer-generated correctness proof.  That was
356easily-enough repaired, but the discovery spurred quite a bit of academic
357interest in truly good ways to manage incremental merging on the fly.
358
359At least a dozen different approaches were developed, some provably having
360near-optimal worst case behavior with respect to the entropy of the
361distribution of run lengths.  Some details can be found in bpo-34561.
362
363The code now uses the "powersort" merge strategy from:
364
365    "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
366     That Optimally Adapt to Existing Runs"
367    J. Ian Munro and Sebastian Wild
368
369The code is pretty simple, but the justification is quite involved, as it's
370based on fast approximations to optimal binary search trees, which are
371substantial topics on their own.
372
373Here we'll just cover some pragmatic details:
374
375The `powerloop()` function computes a run's "power". Say two adjacent runs
376begin at index s1. The first run has length n1, and the second run (starting
377at index s1+n1, called "s2" below) has length n2. The list has total length n.
378The "power" of the first run is a small integer, the depth of the node
379connecting the two runs in an ideal binary merge tree, where power 1 is the
380root node, and the power increases by 1 for each level deeper in the tree.
381
382The power is the least integer L such that the "midpoint interval" contains
383a rational number of the form J/2**L. The midpoint interval is the semi-
384closed interval:
385
386    ((s1 + n1/2)/n, (s2 + n2/2)/n]
387
388Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
389(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
390the first position at which the 2**-L bit differs between the expansions.
391Since the left end of the interval is less than the right end, the first
392differing bit must be a 0 bit in the left quotient and a 1 bit in the right
393quotient.
394
395`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
396subtractions, and shifts in a loop.
397
398You'll notice the paper uses an O(1) method instead, but that relies on two
399things we don't have:
400
401- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
402  extension on most platforms, but not all, and there's no uniform spelling
403  on the platforms that support it.
404
405- Integer division on an integer type twice as wide as needed to hold the
406  list length. But the latter is Py_ssize_t for us, and is typically the
407  widest native signed integer type the platform supports.
408
409But since runs in our algorithm are almost never very short, the once-per-run
410overhead of `powerloop()` seems lost in the noise.
411
412Detail: why is Py_ssize_t "wide enough" in `powerloop()`?  We do, after all,
413shift integers of that width left by 1.  How do we know that won't spill into
414the sign bit?  The trick is that we have some slop. `n` (the total list
415length) is the number of list elements, which is at most 4 times (on a 32-box,
416with 4-byte pointers) smaller than than the largest size_t. So at least the
417leading two bits of the integers we're using are clear.
418
419Since we can't compute a run's power before seeing the run that follows it,
420the most-recently identified run is never merged by `found_new_run()`.
421Instead a new run is only used to compute the 2nd-most-recent run's power.
422Then adjacent runs are merged so long as their saved power (tree depth) is
423greater than that newly computed power. When found_new_run() returns, only
424then is a new run pushed on to the stack of pending runs.
425
426A key invariant is that powers on the run stack are strictly decreasing
427(starting from the run at the top of the stack).
428
429Note that even powersort's strategy isn't always truly optimal. It can't be.
430Computing an optimal merge sequence can be done in time quadratic in the
431number of runs, which is very much slower, and also requires finding &
432remembering _all_ the runs' lengths (of which there may be billions) in
433advance.  It's remarkable, though, how close to optimal this strategy gets.
434
435Curious factoid: of all the alternatives I've seen in the literature,
436powersort's is the only one that's always truly optimal for a collection of 3
437run lengths (for three lengths A B C, it's always optimal to first merge the
438shorter of A and C with B).
439
440
441Merge Memory
442------------
443Merging adjacent runs of lengths A and B in-place, and in linear time, is
444difficult.  Theoretical constructions are known that can do it, but they're
445too difficult and slow for practical use.  But if we have temp memory equal
446to min(A, B), it's easy.
447
448If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
449and then we can do the obvious merge algorithm left to right, from the temp
450area and B, starting the stores into where A used to live.  There's always a
451free area in the original area comprising a number of elements equal to the
452number not yet merged from the temp array (trivially true at the start;
453proceed by induction).  The only tricky bit is that if a comparison raises an
454exception, we have to remember to copy the remaining elements back in from
455the temp area, lest the array end up with duplicate entries from B.  But
456that's exactly the same thing we need to do if we reach the end of B first,
457so the exit code is pleasantly common to both the normal and error cases.
458
459If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
460much the same, except that we need to merge right to left, copying B into a
461temp array and starting the stores at the right end of where B used to live.
462
463A refinement:  When we're about to merge adjacent runs A and B, we first do
464a form of binary search (more on that later) to see where B[0] should end up
465in A.  Elements in A preceding that point are already in their final
466positions, effectively shrinking the size of A.  Likewise we also search to
467see where A[-1] should end up in B, and elements of B after that point can
468also be ignored.  This cuts the amount of temp memory needed by the same
469amount.
470
471These preliminary searches may not pay off, and can be expected *not* to
472repay their cost if the data is random.  But they can win huge in all of
473time, copying, and memory savings when they do pay, so this is one of the
474"per-merge overheads" mentioned above that we're happy to endure because
475there is at most one very short run.  It's generally true in this algorithm
476that we're willing to gamble a little to win a lot, even though the net
477expectation is negative for random data.
478
479
480Merge Algorithms
481----------------
482merge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
483deals with runs where A <= B, and merge_hi where A > B.  They don't know
484whether the data is clustered or uniform, but a lovely thing about merging
485is that many kinds of clustering "reveal themselves" by how many times in a
486row the winning merge element comes from the same run.  We'll only discuss
487merge_lo here; merge_hi is exactly analogous.
488
489Merging begins in the usual, obvious way, comparing the first element of A
490to the first of B, and moving B[0] to the merge area if it's less than A[0],
491else moving A[0] to the merge area.  Call that the "one pair at a time"
492mode.  The only twist here is keeping track of how many times in a row "the
493winner" comes from the same run.
494
495If that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
496we *search* B for where A[0] belongs, and move over all the B's before
497that point in one chunk to the merge area, then move A[0] to the merge
498area.  Then we search A for where B[0] belongs, and similarly move a
499slice of A in one chunk.  Then back to searching B for where A[0] belongs,
500etc.  We stay in galloping mode until both searches find slices to copy
501less than MIN_GALLOP elements long, at which point we go back to one-pair-
502at-a-time mode.
503
504A refinement:  The MergeState struct contains the value of min_gallop that
505controls when we enter galloping mode, initialized to MIN_GALLOP.
506merge_lo() and merge_hi() adjust this higher when galloping isn't paying
507off, and lower when it is.
508
509
510Galloping
511---------
512Still without loss of generality, assume A is the shorter run.  In galloping
513mode, we first look for A[0] in B.  We do this via "galloping", comparing
514A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
515the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
516roughly lg(B) comparisons, and, unlike a straight binary search, favors
517finding the right spot early in B (more on that later).
518
519After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
520consecutive elements, and a straight binary search requires exactly k-1
521additional comparisons to nail it (see note REGION OF UNCERTAINTY).  Then we
522copy all the B's up to that point in one chunk, and then copy A[0].  Note
523that no matter where A[0] belongs in B, the combination of galloping + binary
524search finds it in no more than about 2*lg(B) comparisons.
525
526If we did a straight binary search, we could find it in no more than
527ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
528comparisons no matter where A[0] belongs.  Straight binary search thus loses
529to galloping unless the run is quite long, and we simply can't guess
530whether it is in advance.
531
532If data is random and runs have the same length, A[0] belongs at B[0] half
533the time, at B[1] a quarter of the time, and so on:  a consecutive winning
534sub-run in B of length k occurs with probability 1/2**(k+1).  So long
535winning sub-runs are extremely unlikely in random data, and guessing that a
536winning sub-run is going to be long is a dangerous game.
537
538OTOH, if data is lopsided or lumpy or contains many duplicates, long
539stretches of winning sub-runs are very likely, and cutting the number of
540comparisons needed to find one from O(B) to O(log B) is a huge win.
541
542Galloping compromises by getting out fast if there isn't a long winning
543sub-run, yet finding such very efficiently when they exist.
544
545I first learned about the galloping strategy in a related context; see:
546
547    "Adaptive Set Intersections, Unions, and Differences" (2000)
548    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
549
550and its followup(s).  An earlier paper called the same strategy
551"exponential search":
552
553   "Optimistic Sorting and Information Theoretic Complexity"
554   Peter McIlroy
555   SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
556   467-474, Austin, Texas, 25-27 January 1993.
557
558and it probably dates back to an earlier paper by Bentley and Yao.  The
559McIlroy paper in particular has good analysis of a mergesort that's
560probably strongly related to this one in its galloping strategy.
561
562
563Galloping with a Broken Leg
564---------------------------
565So why don't we always gallop?  Because it can lose, on two counts:
566
5671. While we're willing to endure small per-merge overheads, per-comparison
568   overheads are a different story.  Calling Yet Another Function per
569   comparison is expensive, and gallop_left() and gallop_right() are
570   too long-winded for sane inlining.
571
5722. Galloping can-- alas --require more comparisons than linear one-at-time
573   search, depending on the data.
574
575#2 requires details.  If A[0] belongs before B[0], galloping requires 1
576compare to determine that, same as linear search, except it costs more
577to call the gallop function.  If A[0] belongs right before B[1], galloping
578requires 2 compares, again same as linear search.  On the third compare,
579galloping checks A[0] against B[3], and if it's <=, requires one more
580compare to determine whether A[0] belongs at B[2] or B[3].  That's a total
581of 4 compares, but if A[0] does belong at B[2], linear search would have
582discovered that in only 3 compares, and that's a huge loss!  Really.  It's
583an increase of 33% in the number of compares needed, and comparisons are
584expensive in Python.
585
586index in B where    # compares linear  # gallop  # binary  gallop
587A[0] belongs        search needs       compares  compares  total
588----------------    -----------------  --------  --------  ------
589               0                    1         1         0       1
590
591               1                    2         2         0       2
592
593               2                    3         3         1       4
594               3                    4         3         1       4
595
596               4                    5         4         2       6
597               5                    6         4         2       6
598               6                    7         4         2       6
599               7                    8         4         2       6
600
601               8                    9         5         3       8
602               9                   10         5         3       8
603              10                   11         5         3       8
604              11                   12         5         3       8
605                                        ...
606
607In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
608to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
609The advantage of galloping is unbounded as i grows, but it doesn't win at
610all until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
611at the other values.  At and after i=6, galloping always wins.
612
613We can't guess in advance when it's going to win, though, so we do one pair
614at a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
615is 7, and that's pretty strong evidence.  However, if the data is random, it
616simply will trigger galloping mode purely by luck every now and again, and
617it's quite likely to hit one of the losing cases next.  On the other hand,
618in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
619"should be" then.  So the MergeState struct keeps a min_gallop variable
620that merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
621the smaller min_gallop gets, making it easier to transition back to
622galloping mode (if we ever leave it in the current merge, and at the
623start of the next merge).  But whenever the gallop loop doesn't pay,
624min_gallop is increased by one, making it harder to transition back
625to galloping mode (and again both within a merge and across merges).  For
626random data, this all but eliminates the gallop penalty:  min_gallop grows
627large enough that we almost never get into galloping mode.  And for cases
628like ~sort, min_gallop can fall to as low as 1.  This seems to work well,
629but in all it's a minor improvement over using a fixed MIN_GALLOP value.
630
631
632Galloping Complication
633----------------------
634The description above was for merge_lo.  merge_hi has to merge "from the
635other end", and really needs to gallop starting at the last element in a run
636instead of the first.  Galloping from the first still works, but does more
637comparisons than it should (this is significant -- I timed it both ways). For
638this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
639functions have a "hint" argument, which is the index at which galloping
640should begin.  So galloping can actually start at any index, and proceed at
641offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
642
643In the code as I type it's always called with either 0 or n-1 (where n is
644the # of elements in a run).  It's tempting to try to do something fancier,
645melding galloping with some form of interpolation search; for example, if
646we're merging a run of length 1 with a run of length 10000, index 5000 is
647probably a better guess at the final result than either 0 or 9999.  But
648it's unclear how to generalize that intuition usefully, and merging of
649wildly unbalanced runs already enjoys excellent performance.
650
651~sort is a good example of when balanced runs could benefit from a better
652hint value:  to the extent possible, this would like to use a starting
653offset equal to the previous value of acount/bcount.  Doing so saves about
65410% of the compares in ~sort.  However, doing so is also a mixed bag,
655hurting other cases.
656
657
658Comparing Average # of Compares on Random Arrays
659------------------------------------------------
660[NOTE:  This was done when the new algorithm used about 0.1% more compares
661 on random data than does its current incarnation.]
662
663Here list.sort() is samplesort, and list.msort() this sort:
664
665"""
666import random
667from time import clock as now
668
669def fill(n):
670    from random import random
671    return [random() for i in range(n)]
672
673def mycmp(x, y):
674    global ncmp
675    ncmp += 1
676    return cmp(x, y)
677
678def timeit(values, method):
679    global ncmp
680    X = values[:]
681    bound = getattr(X, method)
682    ncmp = 0
683    t1 = now()
684    bound(mycmp)
685    t2 = now()
686    return t2-t1, ncmp
687
688format = "%5s  %9.2f  %11d"
689f2     = "%5s  %9.2f  %11.2f"
690
691def drive():
692    count = sst = sscmp = mst = mscmp = nelts = 0
693    while True:
694        n = random.randrange(100000)
695        nelts += n
696        x = fill(n)
697
698        t, c = timeit(x, 'sort')
699        sst += t
700        sscmp += c
701
702        t, c = timeit(x, 'msort')
703        mst += t
704        mscmp += c
705
706        count += 1
707        if count % 10:
708            continue
709
710        print "count", count, "nelts", nelts
711        print format % ("sort",  sst, sscmp)
712        print format % ("msort", mst, mscmp)
713        print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
714
715drive()
716"""
717
718I ran this on Windows and kept using the computer lightly while it was
719running.  time.clock() is wall-clock time on Windows, with better than
720microsecond resolution.  samplesort started with a 1.52% #-of-comparisons
721disadvantage, fell quickly to 1.48%, and then fluctuated within that small
722range.  Here's the last chunk of output before I killed the job:
723
724count 2630 nelts 130906543
725 sort    6110.80   1937887573
726msort    6002.78   1909389381
727            1.80         1.49
728
729We've done nearly 2 billion comparisons apiece at Python speed there, and
730that's enough <wink>.
731
732For random arrays of size 2 (yes, there are only 2 interesting ones),
733samplesort has a 50%(!) comparison disadvantage.  This is a consequence of
734samplesort special-casing at most one ascending run at the start, then
735falling back to the general case if it doesn't find an ascending run
736immediately.  The consequence is that it ends up using two compares to sort
737[2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
738taught how to deal with mixtures of ascending and descending runs
739efficiently in all cases.
740
741
742NOTES
743-----
744
745BINSORT
746A "binary insertion sort" is just like a textbook insertion sort, but instead
747of locating the correct position of the next item via linear (one at a time)
748search, an equivalent to Python's bisect.bisect_right is used to find the
749correct position in logarithmic time.  Most texts don't mention this
750variation, and those that do usually say it's not worth the bother:  insertion
751sort remains quadratic (expected and worst cases) either way.  Speeding the
752search doesn't reduce the quadratic data movement costs.
753
754But in CPython's case, comparisons are extraordinarily expensive compared to
755moving data, and the details matter.  Moving objects is just copying
756pointers.  Comparisons can be arbitrarily expensive (can invoke arbitrary
757user-supplied Python code), but even in simple cases (like 3 < 4) _all_
758decisions are made at runtime:  what's the type of the left comparand?  the
759type of the right?  do they need to be coerced to a common type?  where's the
760code to compare these types?  And so on.  Even the simplest Python comparison
761triggers a large pile of C-level pointer dereferences, conditionals, and
762function calls.
763
764So cutting the number of compares is almost always measurably helpful in
765CPython, and the savings swamp the quadratic-time data movement costs for
766reasonable minrun values.
767
768
769LEFT OR RIGHT
770gallop_left() and gallop_right() are akin to the Python bisect module's
771bisect_left() and bisect_right():  they're the same unless the slice they're
772searching contains a (at least one) value equal to the value being searched
773for.  In that case, gallop_left() returns the position immediately before the
774leftmost equal value, and gallop_right() the position immediately after the
775rightmost equal value.  The distinction is needed to preserve stability.  In
776general, when merging adjacent runs A and B, gallop_left is used to search
777thru B for where an element from A belongs, and gallop_right to search thru A
778for where an element from B belongs.
779
780
781REGION OF UNCERTAINTY
782Two kinds of confusion seem to be common about the claim that after finding
783a k such that
784
785    B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
786
787then a binary search requires exactly k-1 tries to find A[0]'s proper
788location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
789
790The first confusion takes the form "OK, then the region of uncertainty is at
791indices 3, 4, 5, 6 and 7:  that's 5 elements, not the claimed 2**(k-1) - 1 =
7923"; or the region is viewed as a Python slice and the objection is "but that's
793the slice B[3:7], so has 7-3 = 4 elements".  Resolution:  we've already
794compared A[0] against B[3] and against B[7], so A[0]'s correct location is
795already known wrt _both_ endpoints.  What remains is to find A[0]'s correct
796location wrt B[4], B[5] and B[6], which spans 3 elements.  Or in general, the
797slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
798inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
799    (2**k-1)-1 - 2**(k-1) + 1 =
800    2**k-1 - 2**(k-1) =
801    2*2**(k-1)-1 - 2**(k-1) =
802    (2-1)*2**(k-1) - 1 =
803    2**(k-1) - 1
804elements.
805
806The second confusion:  "k-1 = 2 binary searches can find the correct location
807among 2**(k-1) = 4 elements, but you're only applying it to 3 elements:  we
808could make this more efficient by arranging for the region of uncertainty to
809span 2**(k-1) elements."  Resolution:  that confuses "elements" with
810"locations".  In a slice with N elements, there are N+1 _locations_.  In the
811example, with the region of uncertainty B[4], B[5], B[6], there are 4
812locations:  before B[4], between B[4] and B[5], between B[5] and B[6], and
813after B[6].  In general, across 2**(k-1)-1 elements, there are 2**(k-1)
814locations.  That's why k-1 binary searches are necessary and sufficient.
815
816OPTIMIZATION OF INDIVIDUAL COMPARISONS
817As noted above, even the simplest Python comparison triggers a large pile of
818C-level pointer dereferences, conditionals, and function calls.  This can be
819partially mitigated by pre-scanning the data to determine whether the data is
820homogeneous with respect to type.  If so, it is sometimes possible to
821substitute faster type-specific comparisons for the slower, generic
822PyObject_RichCompareBool.
823