17db96d56Sopenharmony_ciIntro
27db96d56Sopenharmony_ci-----
37db96d56Sopenharmony_ciThis describes an adaptive, stable, natural mergesort, modestly called
47db96d56Sopenharmony_citimsort (hey, I earned it <wink>).  It has supernatural performance on many
57db96d56Sopenharmony_cikinds of partially ordered arrays (less than lg(N!) comparisons needed, and
67db96d56Sopenharmony_cias few as N-1), yet as fast as Python's previous highly tuned samplesort
77db96d56Sopenharmony_cihybrid on random arrays.
87db96d56Sopenharmony_ci
97db96d56Sopenharmony_ciIn a nutshell, the main routine marches over the array once, left to right,
107db96d56Sopenharmony_cialternately identifying the next run, then merging it into the previous
117db96d56Sopenharmony_ciruns "intelligently".  Everything else is complication for speed, and some
127db96d56Sopenharmony_cihard-won measure of memory efficiency.
137db96d56Sopenharmony_ci
147db96d56Sopenharmony_ci
157db96d56Sopenharmony_ciComparison with Python's Samplesort Hybrid
167db96d56Sopenharmony_ci------------------------------------------
177db96d56Sopenharmony_ci+ timsort can require a temp array containing as many as N//2 pointers,
187db96d56Sopenharmony_ci  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
197db96d56Sopenharmony_ci  expected to require a temp array this large when sorting random data; on
207db96d56Sopenharmony_ci  data with significant structure, it may get away without using any extra
217db96d56Sopenharmony_ci  heap memory.  This appears to be the strongest argument against it, but
227db96d56Sopenharmony_ci  compared to the size of an object, 2 temp bytes worst-case (also expected-
237db96d56Sopenharmony_ci  case for random data) doesn't scare me much.
247db96d56Sopenharmony_ci
257db96d56Sopenharmony_ci  It turns out that Perl is moving to a stable mergesort, and the code for
267db96d56Sopenharmony_ci  that appears always to require a temp array with room for at least N
277db96d56Sopenharmony_ci  pointers. (Note that I wouldn't want to do that even if space weren't an
287db96d56Sopenharmony_ci  issue; I believe its efforts at memory frugality also save timsort
297db96d56Sopenharmony_ci  significant pointer-copying costs, and allow it to have a smaller working
307db96d56Sopenharmony_ci  set.)
317db96d56Sopenharmony_ci
327db96d56Sopenharmony_ci+ Across about four hours of generating random arrays, and sorting them
337db96d56Sopenharmony_ci  under both methods, samplesort required about 1.5% more comparisons
347db96d56Sopenharmony_ci  (the program is at the end of this file).
357db96d56Sopenharmony_ci
367db96d56Sopenharmony_ci+ In real life, this may be faster or slower on random arrays than
377db96d56Sopenharmony_ci  samplesort was, depending on platform quirks.  Since it does fewer
387db96d56Sopenharmony_ci  comparisons on average, it can be expected to do better the more
397db96d56Sopenharmony_ci  expensive a comparison function is.  OTOH, it does more data movement
407db96d56Sopenharmony_ci  (pointer copying) than samplesort, and that may negate its small
417db96d56Sopenharmony_ci  comparison advantage (depending on platform quirks) unless comparison
427db96d56Sopenharmony_ci  is very expensive.
437db96d56Sopenharmony_ci
447db96d56Sopenharmony_ci+ On arrays with many kinds of pre-existing order, this blows samplesort out
457db96d56Sopenharmony_ci  of the water.  It's significantly faster than samplesort even on some
467db96d56Sopenharmony_ci  cases samplesort was special-casing the snot out of.  I believe that lists
477db96d56Sopenharmony_ci  very often do have exploitable partial order in real life, and this is the
487db96d56Sopenharmony_ci  strongest argument in favor of timsort (indeed, samplesort's special cases
497db96d56Sopenharmony_ci  for extreme partial order are appreciated by real users, and timsort goes
507db96d56Sopenharmony_ci  much deeper than those, in particular naturally covering every case where
517db96d56Sopenharmony_ci  someone has suggested "and it would be cool if list.sort() had a special
527db96d56Sopenharmony_ci  case for this too ... and for that ...").
537db96d56Sopenharmony_ci
547db96d56Sopenharmony_ci+ Here are exact comparison counts across all the tests in sortperf.py,
557db96d56Sopenharmony_ci  when run with arguments "15 20 1".
567db96d56Sopenharmony_ci
577db96d56Sopenharmony_ci  Column Key:
587db96d56Sopenharmony_ci      *sort: random data
597db96d56Sopenharmony_ci      \sort: descending data
607db96d56Sopenharmony_ci      /sort: ascending data
617db96d56Sopenharmony_ci      3sort: ascending, then 3 random exchanges
627db96d56Sopenharmony_ci      +sort: ascending, then 10 random at the end
637db96d56Sopenharmony_ci      %sort: ascending, then randomly replace 1% of elements w/ random values
647db96d56Sopenharmony_ci      ~sort: many duplicates
657db96d56Sopenharmony_ci      =sort: all equal
667db96d56Sopenharmony_ci      !sort: worst case scenario
677db96d56Sopenharmony_ci
687db96d56Sopenharmony_ci  First the trivial cases, trivial for samplesort because it special-cased
697db96d56Sopenharmony_ci  them, and trivial for timsort because it naturally works on runs.  Within
707db96d56Sopenharmony_ci  an "n" block, the first line gives the # of compares done by samplesort,
717db96d56Sopenharmony_ci  the second line by timsort, and the third line is the percentage by
727db96d56Sopenharmony_ci  which the samplesort count exceeds the timsort count:
737db96d56Sopenharmony_ci
747db96d56Sopenharmony_ci      n   \sort   /sort   =sort
757db96d56Sopenharmony_ci-------  ------  ------  ------
767db96d56Sopenharmony_ci  32768   32768   32767   32767  samplesort
777db96d56Sopenharmony_ci          32767   32767   32767  timsort
787db96d56Sopenharmony_ci          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
797db96d56Sopenharmony_ci
807db96d56Sopenharmony_ci  65536   65536   65535   65535
817db96d56Sopenharmony_ci          65535   65535   65535
827db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
837db96d56Sopenharmony_ci
847db96d56Sopenharmony_ci 131072  131072  131071  131071
857db96d56Sopenharmony_ci         131071  131071  131071
867db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
877db96d56Sopenharmony_ci
887db96d56Sopenharmony_ci 262144  262144  262143  262143
897db96d56Sopenharmony_ci         262143  262143  262143
907db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
917db96d56Sopenharmony_ci
927db96d56Sopenharmony_ci 524288  524288  524287  524287
937db96d56Sopenharmony_ci         524287  524287  524287
947db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
957db96d56Sopenharmony_ci
967db96d56Sopenharmony_ci1048576 1048576 1048575 1048575
977db96d56Sopenharmony_ci        1048575 1048575 1048575
987db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
997db96d56Sopenharmony_ci
1007db96d56Sopenharmony_ci  The algorithms are effectively identical in these cases, except that
1017db96d56Sopenharmony_ci  timsort does one less compare in \sort.
1027db96d56Sopenharmony_ci
1037db96d56Sopenharmony_ci  Now for the more interesting cases.  Where lg(x) is the logarithm of x to
1047db96d56Sopenharmony_ci  the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
1057db96d56Sopenharmony_ci  the best any comparison-based sorting algorithm can do on average (across
1067db96d56Sopenharmony_ci  all permutations).  When a method gets significantly below that, it's
1077db96d56Sopenharmony_ci  either astronomically lucky, or is finding exploitable structure in the
1087db96d56Sopenharmony_ci  data.
1097db96d56Sopenharmony_ci
1107db96d56Sopenharmony_ci
1117db96d56Sopenharmony_ci      n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
1127db96d56Sopenharmony_ci-------  -------   ------   -------  -------  ------  -------  --------
1137db96d56Sopenharmony_ci  32768   444255   453096   453614    32908   452871   130491    469141 old
1147db96d56Sopenharmony_ci                   448885    33016    33007    50426   182083     65534 new
1157db96d56Sopenharmony_ci                    0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
1167db96d56Sopenharmony_ci
1177db96d56Sopenharmony_ci  65536   954037   972699   981940    65686   973104   260029   1004607
1187db96d56Sopenharmony_ci                   962991    65821    65808   101667   364341    131070
1197db96d56Sopenharmony_ci                    1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
1207db96d56Sopenharmony_ci
1217db96d56Sopenharmony_ci 131072  2039137  2101881  2091491   131232  2092894   554790   2161379
1227db96d56Sopenharmony_ci                  2057533   131410   131361   206193   728871    262142
1237db96d56Sopenharmony_ci                    2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
1247db96d56Sopenharmony_ci
1257db96d56Sopenharmony_ci 262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
1267db96d56Sopenharmony_ci                  4377402   262437   262459   416347  1457945    524286
1277db96d56Sopenharmony_ci                    1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
1287db96d56Sopenharmony_ci
1297db96d56Sopenharmony_ci 524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
1307db96d56Sopenharmony_ci                  9278734   524580   524633   837947  2916107   1048574
1317db96d56Sopenharmony_ci                   1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
1327db96d56Sopenharmony_ci
1337db96d56Sopenharmony_ci1048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
1347db96d56Sopenharmony_ci                 19606028  1048958  1048941  1694896  5832445   2097150
1357db96d56Sopenharmony_ci                    1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
1367db96d56Sopenharmony_ci
1377db96d56Sopenharmony_ci  Discussion of cases:
1387db96d56Sopenharmony_ci
1397db96d56Sopenharmony_ci  *sort:  There's no structure in random data to exploit, so the theoretical
1407db96d56Sopenharmony_ci  limit is lg(n!).  Both methods get close to that, and timsort is hugging
1417db96d56Sopenharmony_ci  it (indeed, in a *marginal* sense, it's a spectacular improvement --
1427db96d56Sopenharmony_ci  there's only about 1% left before hitting the wall, and timsort knows
1437db96d56Sopenharmony_ci  darned well it's doing compares that won't pay on random data -- but so
1447db96d56Sopenharmony_ci  does the samplesort hybrid).  For contrast, Hoare's original random-pivot
1457db96d56Sopenharmony_ci  quicksort does about 39% more compares than the limit, and the median-of-3
1467db96d56Sopenharmony_ci  variant about 19% more.
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci  3sort, %sort, and !sort:  No contest; there's structure in this data, but
1497db96d56Sopenharmony_ci  not of the specific kinds samplesort special-cases.  Note that structure
1507db96d56Sopenharmony_ci  in !sort wasn't put there on purpose -- it was crafted as a worst case for
1517db96d56Sopenharmony_ci  a previous quicksort implementation.  That timsort nails it came as a
1527db96d56Sopenharmony_ci  surprise to me (although it's obvious in retrospect).
1537db96d56Sopenharmony_ci
1547db96d56Sopenharmony_ci  +sort:  samplesort special-cases this data, and does a few less compares
1557db96d56Sopenharmony_ci  than timsort.  However, timsort runs this case significantly faster on all
1567db96d56Sopenharmony_ci  boxes we have timings for, because timsort is in the business of merging
1577db96d56Sopenharmony_ci  runs efficiently, while samplesort does much more data movement in this
1587db96d56Sopenharmony_ci  (for it) special case.
1597db96d56Sopenharmony_ci
1607db96d56Sopenharmony_ci  ~sort:  samplesort's special cases for large masses of equal elements are
1617db96d56Sopenharmony_ci  extremely effective on ~sort's specific data pattern, and timsort just
1627db96d56Sopenharmony_ci  isn't going to get close to that, despite that it's clearly getting a
1637db96d56Sopenharmony_ci  great deal of benefit out of the duplicates (the # of compares is much less
1647db96d56Sopenharmony_ci  than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
1657db96d56Sopenharmony_ci  distinct values, and as the distribution gets more skewed, samplesort's
1667db96d56Sopenharmony_ci  equal-element gimmicks become less effective, while timsort's adaptive
1677db96d56Sopenharmony_ci  strategies find more to exploit; in a database supplied by Kevin Altis, a
1687db96d56Sopenharmony_ci  sort on its highly skewed "on which stock exchange does this company's
1697db96d56Sopenharmony_ci  stock trade?" field ran over twice as fast under timsort.
1707db96d56Sopenharmony_ci
1717db96d56Sopenharmony_ci  However, despite that timsort does many more comparisons on ~sort, and
1727db96d56Sopenharmony_ci  that on several platforms ~sort runs highly significantly slower under
1737db96d56Sopenharmony_ci  timsort, on other platforms ~sort runs highly significantly faster under
1747db96d56Sopenharmony_ci  timsort.  No other kind of data has shown this wild x-platform behavior,
1757db96d56Sopenharmony_ci  and we don't have an explanation for it.  The only thing I can think of
1767db96d56Sopenharmony_ci  that could transform what "should be" highly significant slowdowns into
1777db96d56Sopenharmony_ci  highly significant speedups on some boxes are catastrophic cache effects
1787db96d56Sopenharmony_ci  in samplesort.
1797db96d56Sopenharmony_ci
1807db96d56Sopenharmony_ci  But timsort "should be" slower than samplesort on ~sort, so it's hard
1817db96d56Sopenharmony_ci  to count that it isn't on some boxes as a strike against it <wink>.
1827db96d56Sopenharmony_ci
1837db96d56Sopenharmony_ci+ Here's the highwater mark for the number of heap-based temp slots (4
1847db96d56Sopenharmony_ci  bytes each on this box) needed by each test, again with arguments
1857db96d56Sopenharmony_ci  "15 20 1":
1867db96d56Sopenharmony_ci
1877db96d56Sopenharmony_ci   2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
1887db96d56Sopenharmony_ci  32768  16384     0     0   6256      0  10821  12288      0  16383
1897db96d56Sopenharmony_ci  65536  32766     0     0  21652      0  31276  24576      0  32767
1907db96d56Sopenharmony_ci 131072  65534     0     0  17258      0  58112  49152      0  65535
1917db96d56Sopenharmony_ci 262144 131072     0     0  35660      0 123561  98304      0 131071
1927db96d56Sopenharmony_ci 524288 262142     0     0  31302      0 212057 196608      0 262143
1937db96d56Sopenharmony_ci1048576 524286     0     0 312438      0 484942 393216      0 524287
1947db96d56Sopenharmony_ci
1957db96d56Sopenharmony_ci  Discussion:  The tests that end up doing (close to) perfectly balanced
1967db96d56Sopenharmony_ci  merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
1977db96d56Sopenharmony_ci  also ends up doing balanced merges, but systematically benefits a lot from
1987db96d56Sopenharmony_ci  the preliminary pre-merge searches described under "Merge Memory" later.
1997db96d56Sopenharmony_ci  %sort approaches having a balanced merge at the end because the random
2007db96d56Sopenharmony_ci  selection of elements to replace is expected to produce an out-of-order
2017db96d56Sopenharmony_ci  element near the midpoint.  \sort, /sort, =sort are the trivial one-run
2027db96d56Sopenharmony_ci  cases, needing no merging at all.  +sort ends up having one very long run
2037db96d56Sopenharmony_ci  and one very short, and so gets all the temp space it needs from the small
2047db96d56Sopenharmony_ci  temparray member of the MergeState struct (note that the same would be
2057db96d56Sopenharmony_ci  true if the new random elements were prefixed to the sorted list instead,
2067db96d56Sopenharmony_ci  but not if they appeared "in the middle").  3sort approaches N//3 temp
2077db96d56Sopenharmony_ci  slots twice, but the run lengths that remain after 3 random exchanges
2087db96d56Sopenharmony_ci  clearly has very high variance.
2097db96d56Sopenharmony_ci
2107db96d56Sopenharmony_ci
2117db96d56Sopenharmony_ciA detailed description of timsort follows.
2127db96d56Sopenharmony_ci
2137db96d56Sopenharmony_ciRuns
2147db96d56Sopenharmony_ci----
2157db96d56Sopenharmony_cicount_run() returns the # of elements in the next run.  A run is either
2167db96d56Sopenharmony_ci"ascending", which means non-decreasing:
2177db96d56Sopenharmony_ci
2187db96d56Sopenharmony_ci    a0 <= a1 <= a2 <= ...
2197db96d56Sopenharmony_ci
2207db96d56Sopenharmony_cior "descending", which means strictly decreasing:
2217db96d56Sopenharmony_ci
2227db96d56Sopenharmony_ci    a0 > a1 > a2 > ...
2237db96d56Sopenharmony_ci
2247db96d56Sopenharmony_ciNote that a run is always at least 2 long, unless we start at the array's
2257db96d56Sopenharmony_cilast element.
2267db96d56Sopenharmony_ci
2277db96d56Sopenharmony_ciThe definition of descending is strict, because the main routine reverses
2287db96d56Sopenharmony_cia descending run in-place, transforming a descending run into an ascending
2297db96d56Sopenharmony_cirun.  Reversal is done via the obvious fast "swap elements starting at each
2307db96d56Sopenharmony_ciend, and converge at the middle" method, and that can violate stability if
2317db96d56Sopenharmony_cithe slice contains any equal elements.  Using a strict definition of
2327db96d56Sopenharmony_cidescending ensures that a descending run contains distinct elements.
2337db96d56Sopenharmony_ci
2347db96d56Sopenharmony_ciIf an array is random, it's very unlikely we'll see long runs.  If a natural
2357db96d56Sopenharmony_cirun contains less than minrun elements (see next section), the main loop
2367db96d56Sopenharmony_ciartificially boosts it to minrun elements, via a stable binary insertion sort
2377db96d56Sopenharmony_ciapplied to the right number of array elements following the short natural
2387db96d56Sopenharmony_cirun.  In a random array, *all* runs are likely to be minrun long as a
2397db96d56Sopenharmony_ciresult.  This has two primary good effects:
2407db96d56Sopenharmony_ci
2417db96d56Sopenharmony_ci1. Random data strongly tends then toward perfectly balanced (both runs have
2427db96d56Sopenharmony_ci   the same length) merges, which is the most efficient way to proceed when
2437db96d56Sopenharmony_ci   data is random.
2447db96d56Sopenharmony_ci
2457db96d56Sopenharmony_ci2. Because runs are never very short, the rest of the code doesn't make
2467db96d56Sopenharmony_ci   heroic efforts to shave a few cycles off per-merge overheads.  For
2477db96d56Sopenharmony_ci   example, reasonable use of function calls is made, rather than trying to
2487db96d56Sopenharmony_ci   inline everything.  Since there are no more than N/minrun runs to begin
2497db96d56Sopenharmony_ci   with, a few "extra" function calls per merge is barely measurable.
2507db96d56Sopenharmony_ci
2517db96d56Sopenharmony_ci
2527db96d56Sopenharmony_ciComputing minrun
2537db96d56Sopenharmony_ci----------------
2547db96d56Sopenharmony_ciIf N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
2557db96d56Sopenharmony_ciarray then; it's hard to beat that given the overheads of trying something
2567db96d56Sopenharmony_cifancier (see note BINSORT).
2577db96d56Sopenharmony_ci
2587db96d56Sopenharmony_ciWhen N is a power of 2, testing on random data showed that minrun values of
2597db96d56Sopenharmony_ci16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
2607db96d56Sopenharmony_ciin binary insertion sort clearly hurt, and at 8 the increase in the number
2617db96d56Sopenharmony_ciof function calls clearly hurt.  Picking *some* power of 2 is important
2627db96d56Sopenharmony_cihere, so that the merges end up perfectly balanced (see next section).  We
2637db96d56Sopenharmony_cipick 32 as a good value in the sweet range; picking a value at the low end
2647db96d56Sopenharmony_ciallows the adaptive gimmicks more opportunity to exploit shorter natural
2657db96d56Sopenharmony_ciruns.
2667db96d56Sopenharmony_ci
2677db96d56Sopenharmony_ciBecause sortperf.py only tries powers of 2, it took a long time to notice
2687db96d56Sopenharmony_cithat 32 isn't a good choice for the general case!  Consider N=2112:
2697db96d56Sopenharmony_ci
2707db96d56Sopenharmony_ci>>> divmod(2112, 32)
2717db96d56Sopenharmony_ci(66, 0)
2727db96d56Sopenharmony_ci>>>
2737db96d56Sopenharmony_ci
2747db96d56Sopenharmony_ciIf the data is randomly ordered, we're very likely to end up with 66 runs
2757db96d56Sopenharmony_cieach of length 32.  The first 64 of these trigger a sequence of perfectly
2767db96d56Sopenharmony_cibalanced merges (see next section), leaving runs of lengths 2048 and 64 to
2777db96d56Sopenharmony_cimerge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
2787db96d56Sopenharmony_cicompares, but it's still more compares than necessary, and-- mergesort's
2797db96d56Sopenharmony_cibugaboo relative to samplesort --a lot more data movement (O(N) copies just
2807db96d56Sopenharmony_cito get 64 elements into place).
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ciIf we take minrun=33 in this case, then we're very likely to end up with 64
2837db96d56Sopenharmony_ciruns each of length 33, and then all merges are perfectly balanced.  Better!
2847db96d56Sopenharmony_ci
2857db96d56Sopenharmony_ciWhat we want to avoid is picking minrun such that in
2867db96d56Sopenharmony_ci
2877db96d56Sopenharmony_ci    q, r = divmod(N, minrun)
2887db96d56Sopenharmony_ci
2897db96d56Sopenharmony_ciq is a power of 2 and r>0 (then the last merge only gets r elements into
2907db96d56Sopenharmony_ciplace, and r < minrun is small compared to N), or q a little larger than a
2917db96d56Sopenharmony_cipower of 2 regardless of r (then we've got a case similar to "2112", again
2927db96d56Sopenharmony_cileaving too little work for the last merge to do).
2937db96d56Sopenharmony_ci
2947db96d56Sopenharmony_ciInstead we pick a minrun in range(32, 65) such that N/minrun is exactly a
2957db96d56Sopenharmony_cipower of 2, or if that isn't possible, is close to, but strictly less than,
2967db96d56Sopenharmony_cia power of 2.  This is easier to do than it may sound:  take the first 6
2977db96d56Sopenharmony_cibits of N, and add 1 if any of the remaining bits are set.  In fact, that
2987db96d56Sopenharmony_cirule covers every case in this section, including small N and exact powers
2997db96d56Sopenharmony_ciof 2; merge_compute_minrun() is a deceptively simple function.
3007db96d56Sopenharmony_ci
3017db96d56Sopenharmony_ci
3027db96d56Sopenharmony_ciThe Merge Pattern
3037db96d56Sopenharmony_ci-----------------
3047db96d56Sopenharmony_ciIn order to exploit regularities in the data, we're merging on natural
3057db96d56Sopenharmony_cirun lengths, and they can become wildly unbalanced.  That's a Good Thing
3067db96d56Sopenharmony_cifor this sort!  It means we have to find a way to manage an assortment of
3077db96d56Sopenharmony_cipotentially very different run lengths, though.
3087db96d56Sopenharmony_ci
3097db96d56Sopenharmony_ciStability constrains permissible merging patterns.  For example, if we have
3107db96d56Sopenharmony_ci3 consecutive runs of lengths
3117db96d56Sopenharmony_ci
3127db96d56Sopenharmony_ci    A:10000  B:20000  C:10000
3137db96d56Sopenharmony_ci
3147db96d56Sopenharmony_ciwe dare not merge A with C first, because if A, B and C happen to contain
3157db96d56Sopenharmony_cia common element, it would get out of order wrt its occurrence(s) in B.  The
3167db96d56Sopenharmony_cimerging must be done as (A+B)+C or A+(B+C) instead.
3177db96d56Sopenharmony_ci
3187db96d56Sopenharmony_ciSo merging is always done on two consecutive runs at a time, and in-place,
3197db96d56Sopenharmony_cialthough this may require some temp memory (more on that later).
3207db96d56Sopenharmony_ci
3217db96d56Sopenharmony_ciWhen a run is identified, its length is passed to found_new_run() to
3227db96d56Sopenharmony_cipotentially merge runs on a stack of pending runs.  We would like to delay
3237db96d56Sopenharmony_cimerging as long as possible in order to exploit patterns that may come up
3247db96d56Sopenharmony_cilater, but we like even more to do merging as soon as possible to exploit
3257db96d56Sopenharmony_cithat the run just found is still high in the memory hierarchy.  We also can't
3267db96d56Sopenharmony_cidelay merging "too long" because it consumes memory to remember the runs that
3277db96d56Sopenharmony_ciare still unmerged, and the stack has a fixed size.
3287db96d56Sopenharmony_ci
3297db96d56Sopenharmony_ciThe original version of this code used the first thing I made up that didn't
3307db96d56Sopenharmony_ciobviously suck ;-) It was loosely based on invariants involving the Fibonacci
3317db96d56Sopenharmony_cisequence.
3327db96d56Sopenharmony_ci
3337db96d56Sopenharmony_ciIt worked OK, but it was hard to reason about, and was subtle enough that the
3347db96d56Sopenharmony_ciintended invariants weren't actually preserved.  Researchers discovered that
3357db96d56Sopenharmony_ciwhen trying to complete a computer-generated correctness proof.  That was
3367db96d56Sopenharmony_cieasily-enough repaired, but the discovery spurred quite a bit of academic
3377db96d56Sopenharmony_ciinterest in truly good ways to manage incremental merging on the fly.
3387db96d56Sopenharmony_ci
3397db96d56Sopenharmony_ciAt least a dozen different approaches were developed, some provably having
3407db96d56Sopenharmony_cinear-optimal worst case behavior with respect to the entropy of the
3417db96d56Sopenharmony_cidistribution of run lengths.  Some details can be found in bpo-34561.
3427db96d56Sopenharmony_ci
3437db96d56Sopenharmony_ciThe code now uses the "powersort" merge strategy from:
3447db96d56Sopenharmony_ci
3457db96d56Sopenharmony_ci    "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
3467db96d56Sopenharmony_ci     That Optimally Adapt to Existing Runs"
3477db96d56Sopenharmony_ci    J. Ian Munro and Sebastian Wild
3487db96d56Sopenharmony_ci
3497db96d56Sopenharmony_ciThe code is pretty simple, but the justification is quite involved, as it's
3507db96d56Sopenharmony_cibased on fast approximations to optimal binary search trees, which are
3517db96d56Sopenharmony_cisubstantial topics on their own.
3527db96d56Sopenharmony_ci
3537db96d56Sopenharmony_ciHere we'll just cover some pragmatic details:
3547db96d56Sopenharmony_ci
3557db96d56Sopenharmony_ciThe `powerloop()` function computes a run's "power". Say two adjacent runs
3567db96d56Sopenharmony_cibegin at index s1. The first run has length n1, and the second run (starting
3577db96d56Sopenharmony_ciat index s1+n1, called "s2" below) has length n2. The list has total length n.
3587db96d56Sopenharmony_ciThe "power" of the first run is a small integer, the depth of the node
3597db96d56Sopenharmony_ciconnecting the two runs in an ideal binary merge tree, where power 1 is the
3607db96d56Sopenharmony_ciroot node, and the power increases by 1 for each level deeper in the tree.
3617db96d56Sopenharmony_ci
3627db96d56Sopenharmony_ciThe power is the least integer L such that the "midpoint interval" contains
3637db96d56Sopenharmony_cia rational number of the form J/2**L. The midpoint interval is the semi-
3647db96d56Sopenharmony_ciclosed interval:
3657db96d56Sopenharmony_ci
3667db96d56Sopenharmony_ci    ((s1 + n1/2)/n, (s2 + n2/2)/n]
3677db96d56Sopenharmony_ci
3687db96d56Sopenharmony_ciYes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
3697db96d56Sopenharmony_ci(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
3707db96d56Sopenharmony_cithe first position at which the 2**-L bit differs between the expansions.
3717db96d56Sopenharmony_ciSince the left end of the interval is less than the right end, the first
3727db96d56Sopenharmony_cidiffering bit must be a 0 bit in the left quotient and a 1 bit in the right
3737db96d56Sopenharmony_ciquotient.
3747db96d56Sopenharmony_ci
3757db96d56Sopenharmony_ci`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
3767db96d56Sopenharmony_cisubtractions, and shifts in a loop.
3777db96d56Sopenharmony_ci
3787db96d56Sopenharmony_ciYou'll notice the paper uses an O(1) method instead, but that relies on two
3797db96d56Sopenharmony_cithings we don't have:
3807db96d56Sopenharmony_ci
3817db96d56Sopenharmony_ci- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
3827db96d56Sopenharmony_ci  extension on most platforms, but not all, and there's no uniform spelling
3837db96d56Sopenharmony_ci  on the platforms that support it.
3847db96d56Sopenharmony_ci
3857db96d56Sopenharmony_ci- Integer division on an integer type twice as wide as needed to hold the
3867db96d56Sopenharmony_ci  list length. But the latter is Py_ssize_t for us, and is typically the
3877db96d56Sopenharmony_ci  widest native signed integer type the platform supports.
3887db96d56Sopenharmony_ci
3897db96d56Sopenharmony_ciBut since runs in our algorithm are almost never very short, the once-per-run
3907db96d56Sopenharmony_cioverhead of `powerloop()` seems lost in the noise.
3917db96d56Sopenharmony_ci
3927db96d56Sopenharmony_ciDetail: why is Py_ssize_t "wide enough" in `powerloop()`?  We do, after all,
3937db96d56Sopenharmony_cishift integers of that width left by 1.  How do we know that won't spill into
3947db96d56Sopenharmony_cithe sign bit?  The trick is that we have some slop. `n` (the total list
3957db96d56Sopenharmony_cilength) is the number of list elements, which is at most 4 times (on a 32-box,
3967db96d56Sopenharmony_ciwith 4-byte pointers) smaller than than the largest size_t. So at least the
3977db96d56Sopenharmony_cileading two bits of the integers we're using are clear.
3987db96d56Sopenharmony_ci
3997db96d56Sopenharmony_ciSince we can't compute a run's power before seeing the run that follows it,
4007db96d56Sopenharmony_cithe most-recently identified run is never merged by `found_new_run()`.
4017db96d56Sopenharmony_ciInstead a new run is only used to compute the 2nd-most-recent run's power.
4027db96d56Sopenharmony_ciThen adjacent runs are merged so long as their saved power (tree depth) is
4037db96d56Sopenharmony_cigreater than that newly computed power. When found_new_run() returns, only
4047db96d56Sopenharmony_cithen is a new run pushed on to the stack of pending runs.
4057db96d56Sopenharmony_ci
4067db96d56Sopenharmony_ciA key invariant is that powers on the run stack are strictly decreasing
4077db96d56Sopenharmony_ci(starting from the run at the top of the stack).
4087db96d56Sopenharmony_ci
4097db96d56Sopenharmony_ciNote that even powersort's strategy isn't always truly optimal. It can't be.
4107db96d56Sopenharmony_ciComputing an optimal merge sequence can be done in time quadratic in the
4117db96d56Sopenharmony_cinumber of runs, which is very much slower, and also requires finding &
4127db96d56Sopenharmony_ciremembering _all_ the runs' lengths (of which there may be billions) in
4137db96d56Sopenharmony_ciadvance.  It's remarkable, though, how close to optimal this strategy gets.
4147db96d56Sopenharmony_ci
4157db96d56Sopenharmony_ciCurious factoid: of all the alternatives I've seen in the literature,
4167db96d56Sopenharmony_cipowersort's is the only one that's always truly optimal for a collection of 3
4177db96d56Sopenharmony_cirun lengths (for three lengths A B C, it's always optimal to first merge the
4187db96d56Sopenharmony_cishorter of A and C with B).
4197db96d56Sopenharmony_ci
4207db96d56Sopenharmony_ci
4217db96d56Sopenharmony_ciMerge Memory
4227db96d56Sopenharmony_ci------------
4237db96d56Sopenharmony_ciMerging adjacent runs of lengths A and B in-place, and in linear time, is
4247db96d56Sopenharmony_cidifficult.  Theoretical constructions are known that can do it, but they're
4257db96d56Sopenharmony_citoo difficult and slow for practical use.  But if we have temp memory equal
4267db96d56Sopenharmony_cito min(A, B), it's easy.
4277db96d56Sopenharmony_ci
4287db96d56Sopenharmony_ciIf A is smaller (function merge_lo), copy A to a temp array, leave B alone,
4297db96d56Sopenharmony_ciand then we can do the obvious merge algorithm left to right, from the temp
4307db96d56Sopenharmony_ciarea and B, starting the stores into where A used to live.  There's always a
4317db96d56Sopenharmony_cifree area in the original area comprising a number of elements equal to the
4327db96d56Sopenharmony_cinumber not yet merged from the temp array (trivially true at the start;
4337db96d56Sopenharmony_ciproceed by induction).  The only tricky bit is that if a comparison raises an
4347db96d56Sopenharmony_ciexception, we have to remember to copy the remaining elements back in from
4357db96d56Sopenharmony_cithe temp area, lest the array end up with duplicate entries from B.  But
4367db96d56Sopenharmony_cithat's exactly the same thing we need to do if we reach the end of B first,
4377db96d56Sopenharmony_ciso the exit code is pleasantly common to both the normal and error cases.
4387db96d56Sopenharmony_ci
4397db96d56Sopenharmony_ciIf B is smaller (function merge_hi, which is merge_lo's "mirror image"),
4407db96d56Sopenharmony_cimuch the same, except that we need to merge right to left, copying B into a
4417db96d56Sopenharmony_citemp array and starting the stores at the right end of where B used to live.
4427db96d56Sopenharmony_ci
4437db96d56Sopenharmony_ciA refinement:  When we're about to merge adjacent runs A and B, we first do
4447db96d56Sopenharmony_cia form of binary search (more on that later) to see where B[0] should end up
4457db96d56Sopenharmony_ciin A.  Elements in A preceding that point are already in their final
4467db96d56Sopenharmony_cipositions, effectively shrinking the size of A.  Likewise we also search to
4477db96d56Sopenharmony_cisee where A[-1] should end up in B, and elements of B after that point can
4487db96d56Sopenharmony_cialso be ignored.  This cuts the amount of temp memory needed by the same
4497db96d56Sopenharmony_ciamount.
4507db96d56Sopenharmony_ci
4517db96d56Sopenharmony_ciThese preliminary searches may not pay off, and can be expected *not* to
4527db96d56Sopenharmony_cirepay their cost if the data is random.  But they can win huge in all of
4537db96d56Sopenharmony_citime, copying, and memory savings when they do pay, so this is one of the
4547db96d56Sopenharmony_ci"per-merge overheads" mentioned above that we're happy to endure because
4557db96d56Sopenharmony_cithere is at most one very short run.  It's generally true in this algorithm
4567db96d56Sopenharmony_cithat we're willing to gamble a little to win a lot, even though the net
4577db96d56Sopenharmony_ciexpectation is negative for random data.
4587db96d56Sopenharmony_ci
4597db96d56Sopenharmony_ci
4607db96d56Sopenharmony_ciMerge Algorithms
4617db96d56Sopenharmony_ci----------------
4627db96d56Sopenharmony_cimerge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
4637db96d56Sopenharmony_cideals with runs where A <= B, and merge_hi where A > B.  They don't know
4647db96d56Sopenharmony_ciwhether the data is clustered or uniform, but a lovely thing about merging
4657db96d56Sopenharmony_ciis that many kinds of clustering "reveal themselves" by how many times in a
4667db96d56Sopenharmony_cirow the winning merge element comes from the same run.  We'll only discuss
4677db96d56Sopenharmony_cimerge_lo here; merge_hi is exactly analogous.
4687db96d56Sopenharmony_ci
4697db96d56Sopenharmony_ciMerging begins in the usual, obvious way, comparing the first element of A
4707db96d56Sopenharmony_cito the first of B, and moving B[0] to the merge area if it's less than A[0],
4717db96d56Sopenharmony_cielse moving A[0] to the merge area.  Call that the "one pair at a time"
4727db96d56Sopenharmony_cimode.  The only twist here is keeping track of how many times in a row "the
4737db96d56Sopenharmony_ciwinner" comes from the same run.
4747db96d56Sopenharmony_ci
4757db96d56Sopenharmony_ciIf that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
4767db96d56Sopenharmony_ciwe *search* B for where A[0] belongs, and move over all the B's before
4777db96d56Sopenharmony_cithat point in one chunk to the merge area, then move A[0] to the merge
4787db96d56Sopenharmony_ciarea.  Then we search A for where B[0] belongs, and similarly move a
4797db96d56Sopenharmony_cislice of A in one chunk.  Then back to searching B for where A[0] belongs,
4807db96d56Sopenharmony_cietc.  We stay in galloping mode until both searches find slices to copy
4817db96d56Sopenharmony_ciless than MIN_GALLOP elements long, at which point we go back to one-pair-
4827db96d56Sopenharmony_ciat-a-time mode.
4837db96d56Sopenharmony_ci
4847db96d56Sopenharmony_ciA refinement:  The MergeState struct contains the value of min_gallop that
4857db96d56Sopenharmony_cicontrols when we enter galloping mode, initialized to MIN_GALLOP.
4867db96d56Sopenharmony_cimerge_lo() and merge_hi() adjust this higher when galloping isn't paying
4877db96d56Sopenharmony_cioff, and lower when it is.
4887db96d56Sopenharmony_ci
4897db96d56Sopenharmony_ci
4907db96d56Sopenharmony_ciGalloping
4917db96d56Sopenharmony_ci---------
4927db96d56Sopenharmony_ciStill without loss of generality, assume A is the shorter run.  In galloping
4937db96d56Sopenharmony_cimode, we first look for A[0] in B.  We do this via "galloping", comparing
4947db96d56Sopenharmony_ciA[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
4957db96d56Sopenharmony_cithe k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
4967db96d56Sopenharmony_ciroughly lg(B) comparisons, and, unlike a straight binary search, favors
4977db96d56Sopenharmony_cifinding the right spot early in B (more on that later).
4987db96d56Sopenharmony_ci
4997db96d56Sopenharmony_ciAfter finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
5007db96d56Sopenharmony_ciconsecutive elements, and a straight binary search requires exactly k-1
5017db96d56Sopenharmony_ciadditional comparisons to nail it (see note REGION OF UNCERTAINTY).  Then we
5027db96d56Sopenharmony_cicopy all the B's up to that point in one chunk, and then copy A[0].  Note
5037db96d56Sopenharmony_cithat no matter where A[0] belongs in B, the combination of galloping + binary
5047db96d56Sopenharmony_cisearch finds it in no more than about 2*lg(B) comparisons.
5057db96d56Sopenharmony_ci
5067db96d56Sopenharmony_ciIf we did a straight binary search, we could find it in no more than
5077db96d56Sopenharmony_ciceiling(lg(B+1)) comparisons -- but straight binary search takes that many
5087db96d56Sopenharmony_cicomparisons no matter where A[0] belongs.  Straight binary search thus loses
5097db96d56Sopenharmony_cito galloping unless the run is quite long, and we simply can't guess
5107db96d56Sopenharmony_ciwhether it is in advance.
5117db96d56Sopenharmony_ci
5127db96d56Sopenharmony_ciIf data is random and runs have the same length, A[0] belongs at B[0] half
5137db96d56Sopenharmony_cithe time, at B[1] a quarter of the time, and so on:  a consecutive winning
5147db96d56Sopenharmony_cisub-run in B of length k occurs with probability 1/2**(k+1).  So long
5157db96d56Sopenharmony_ciwinning sub-runs are extremely unlikely in random data, and guessing that a
5167db96d56Sopenharmony_ciwinning sub-run is going to be long is a dangerous game.
5177db96d56Sopenharmony_ci
5187db96d56Sopenharmony_ciOTOH, if data is lopsided or lumpy or contains many duplicates, long
5197db96d56Sopenharmony_cistretches of winning sub-runs are very likely, and cutting the number of
5207db96d56Sopenharmony_cicomparisons needed to find one from O(B) to O(log B) is a huge win.
5217db96d56Sopenharmony_ci
5227db96d56Sopenharmony_ciGalloping compromises by getting out fast if there isn't a long winning
5237db96d56Sopenharmony_cisub-run, yet finding such very efficiently when they exist.
5247db96d56Sopenharmony_ci
5257db96d56Sopenharmony_ciI first learned about the galloping strategy in a related context; see:
5267db96d56Sopenharmony_ci
5277db96d56Sopenharmony_ci    "Adaptive Set Intersections, Unions, and Differences" (2000)
5287db96d56Sopenharmony_ci    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
5297db96d56Sopenharmony_ci
5307db96d56Sopenharmony_ciand its followup(s).  An earlier paper called the same strategy
5317db96d56Sopenharmony_ci"exponential search":
5327db96d56Sopenharmony_ci
5337db96d56Sopenharmony_ci   "Optimistic Sorting and Information Theoretic Complexity"
5347db96d56Sopenharmony_ci   Peter McIlroy
5357db96d56Sopenharmony_ci   SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
5367db96d56Sopenharmony_ci   467-474, Austin, Texas, 25-27 January 1993.
5377db96d56Sopenharmony_ci
5387db96d56Sopenharmony_ciand it probably dates back to an earlier paper by Bentley and Yao.  The
5397db96d56Sopenharmony_ciMcIlroy paper in particular has good analysis of a mergesort that's
5407db96d56Sopenharmony_ciprobably strongly related to this one in its galloping strategy.
5417db96d56Sopenharmony_ci
5427db96d56Sopenharmony_ci
5437db96d56Sopenharmony_ciGalloping with a Broken Leg
5447db96d56Sopenharmony_ci---------------------------
5457db96d56Sopenharmony_ciSo why don't we always gallop?  Because it can lose, on two counts:
5467db96d56Sopenharmony_ci
5477db96d56Sopenharmony_ci1. While we're willing to endure small per-merge overheads, per-comparison
5487db96d56Sopenharmony_ci   overheads are a different story.  Calling Yet Another Function per
5497db96d56Sopenharmony_ci   comparison is expensive, and gallop_left() and gallop_right() are
5507db96d56Sopenharmony_ci   too long-winded for sane inlining.
5517db96d56Sopenharmony_ci
5527db96d56Sopenharmony_ci2. Galloping can-- alas --require more comparisons than linear one-at-time
5537db96d56Sopenharmony_ci   search, depending on the data.
5547db96d56Sopenharmony_ci
5557db96d56Sopenharmony_ci#2 requires details.  If A[0] belongs before B[0], galloping requires 1
5567db96d56Sopenharmony_cicompare to determine that, same as linear search, except it costs more
5577db96d56Sopenharmony_cito call the gallop function.  If A[0] belongs right before B[1], galloping
5587db96d56Sopenharmony_cirequires 2 compares, again same as linear search.  On the third compare,
5597db96d56Sopenharmony_cigalloping checks A[0] against B[3], and if it's <=, requires one more
5607db96d56Sopenharmony_cicompare to determine whether A[0] belongs at B[2] or B[3].  That's a total
5617db96d56Sopenharmony_ciof 4 compares, but if A[0] does belong at B[2], linear search would have
5627db96d56Sopenharmony_cidiscovered that in only 3 compares, and that's a huge loss!  Really.  It's
5637db96d56Sopenharmony_cian increase of 33% in the number of compares needed, and comparisons are
5647db96d56Sopenharmony_ciexpensive in Python.
5657db96d56Sopenharmony_ci
5667db96d56Sopenharmony_ciindex in B where    # compares linear  # gallop  # binary  gallop
5677db96d56Sopenharmony_ciA[0] belongs        search needs       compares  compares  total
5687db96d56Sopenharmony_ci----------------    -----------------  --------  --------  ------
5697db96d56Sopenharmony_ci               0                    1         1         0       1
5707db96d56Sopenharmony_ci
5717db96d56Sopenharmony_ci               1                    2         2         0       2
5727db96d56Sopenharmony_ci
5737db96d56Sopenharmony_ci               2                    3         3         1       4
5747db96d56Sopenharmony_ci               3                    4         3         1       4
5757db96d56Sopenharmony_ci
5767db96d56Sopenharmony_ci               4                    5         4         2       6
5777db96d56Sopenharmony_ci               5                    6         4         2       6
5787db96d56Sopenharmony_ci               6                    7         4         2       6
5797db96d56Sopenharmony_ci               7                    8         4         2       6
5807db96d56Sopenharmony_ci
5817db96d56Sopenharmony_ci               8                    9         5         3       8
5827db96d56Sopenharmony_ci               9                   10         5         3       8
5837db96d56Sopenharmony_ci              10                   11         5         3       8
5847db96d56Sopenharmony_ci              11                   12         5         3       8
5857db96d56Sopenharmony_ci                                        ...
5867db96d56Sopenharmony_ci
5877db96d56Sopenharmony_ciIn general, if A[0] belongs at B[i], linear search requires i+1 comparisons
5887db96d56Sopenharmony_cito determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
5897db96d56Sopenharmony_ciThe advantage of galloping is unbounded as i grows, but it doesn't win at
5907db96d56Sopenharmony_ciall until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
5917db96d56Sopenharmony_ciat the other values.  At and after i=6, galloping always wins.
5927db96d56Sopenharmony_ci
5937db96d56Sopenharmony_ciWe can't guess in advance when it's going to win, though, so we do one pair
5947db96d56Sopenharmony_ciat a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
5957db96d56Sopenharmony_ciis 7, and that's pretty strong evidence.  However, if the data is random, it
5967db96d56Sopenharmony_cisimply will trigger galloping mode purely by luck every now and again, and
5977db96d56Sopenharmony_ciit's quite likely to hit one of the losing cases next.  On the other hand,
5987db96d56Sopenharmony_ciin cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
5997db96d56Sopenharmony_ci"should be" then.  So the MergeState struct keeps a min_gallop variable
6007db96d56Sopenharmony_cithat merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
6017db96d56Sopenharmony_cithe smaller min_gallop gets, making it easier to transition back to
6027db96d56Sopenharmony_cigalloping mode (if we ever leave it in the current merge, and at the
6037db96d56Sopenharmony_cistart of the next merge).  But whenever the gallop loop doesn't pay,
6047db96d56Sopenharmony_cimin_gallop is increased by one, making it harder to transition back
6057db96d56Sopenharmony_cito galloping mode (and again both within a merge and across merges).  For
6067db96d56Sopenharmony_cirandom data, this all but eliminates the gallop penalty:  min_gallop grows
6077db96d56Sopenharmony_cilarge enough that we almost never get into galloping mode.  And for cases
6087db96d56Sopenharmony_cilike ~sort, min_gallop can fall to as low as 1.  This seems to work well,
6097db96d56Sopenharmony_cibut in all it's a minor improvement over using a fixed MIN_GALLOP value.
6107db96d56Sopenharmony_ci
6117db96d56Sopenharmony_ci
6127db96d56Sopenharmony_ciGalloping Complication
6137db96d56Sopenharmony_ci----------------------
6147db96d56Sopenharmony_ciThe description above was for merge_lo.  merge_hi has to merge "from the
6157db96d56Sopenharmony_ciother end", and really needs to gallop starting at the last element in a run
6167db96d56Sopenharmony_ciinstead of the first.  Galloping from the first still works, but does more
6177db96d56Sopenharmony_cicomparisons than it should (this is significant -- I timed it both ways). For
6187db96d56Sopenharmony_cithis reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
6197db96d56Sopenharmony_cifunctions have a "hint" argument, which is the index at which galloping
6207db96d56Sopenharmony_cishould begin.  So galloping can actually start at any index, and proceed at
6217db96d56Sopenharmony_cioffsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
6227db96d56Sopenharmony_ci
6237db96d56Sopenharmony_ciIn the code as I type it's always called with either 0 or n-1 (where n is
6247db96d56Sopenharmony_cithe # of elements in a run).  It's tempting to try to do something fancier,
6257db96d56Sopenharmony_cimelding galloping with some form of interpolation search; for example, if
6267db96d56Sopenharmony_ciwe're merging a run of length 1 with a run of length 10000, index 5000 is
6277db96d56Sopenharmony_ciprobably a better guess at the final result than either 0 or 9999.  But
6287db96d56Sopenharmony_ciit's unclear how to generalize that intuition usefully, and merging of
6297db96d56Sopenharmony_ciwildly unbalanced runs already enjoys excellent performance.
6307db96d56Sopenharmony_ci
6317db96d56Sopenharmony_ci~sort is a good example of when balanced runs could benefit from a better
6327db96d56Sopenharmony_cihint value:  to the extent possible, this would like to use a starting
6337db96d56Sopenharmony_cioffset equal to the previous value of acount/bcount.  Doing so saves about
6347db96d56Sopenharmony_ci10% of the compares in ~sort.  However, doing so is also a mixed bag,
6357db96d56Sopenharmony_cihurting other cases.
6367db96d56Sopenharmony_ci
6377db96d56Sopenharmony_ci
6387db96d56Sopenharmony_ciComparing Average # of Compares on Random Arrays
6397db96d56Sopenharmony_ci------------------------------------------------
6407db96d56Sopenharmony_ci[NOTE:  This was done when the new algorithm used about 0.1% more compares
6417db96d56Sopenharmony_ci on random data than does its current incarnation.]
6427db96d56Sopenharmony_ci
6437db96d56Sopenharmony_ciHere list.sort() is samplesort, and list.msort() this sort:
6447db96d56Sopenharmony_ci
6457db96d56Sopenharmony_ci"""
6467db96d56Sopenharmony_ciimport random
6477db96d56Sopenharmony_cifrom time import clock as now
6487db96d56Sopenharmony_ci
6497db96d56Sopenharmony_cidef fill(n):
6507db96d56Sopenharmony_ci    from random import random
6517db96d56Sopenharmony_ci    return [random() for i in range(n)]
6527db96d56Sopenharmony_ci
6537db96d56Sopenharmony_cidef mycmp(x, y):
6547db96d56Sopenharmony_ci    global ncmp
6557db96d56Sopenharmony_ci    ncmp += 1
6567db96d56Sopenharmony_ci    return cmp(x, y)
6577db96d56Sopenharmony_ci
6587db96d56Sopenharmony_cidef timeit(values, method):
6597db96d56Sopenharmony_ci    global ncmp
6607db96d56Sopenharmony_ci    X = values[:]
6617db96d56Sopenharmony_ci    bound = getattr(X, method)
6627db96d56Sopenharmony_ci    ncmp = 0
6637db96d56Sopenharmony_ci    t1 = now()
6647db96d56Sopenharmony_ci    bound(mycmp)
6657db96d56Sopenharmony_ci    t2 = now()
6667db96d56Sopenharmony_ci    return t2-t1, ncmp
6677db96d56Sopenharmony_ci
6687db96d56Sopenharmony_ciformat = "%5s  %9.2f  %11d"
6697db96d56Sopenharmony_cif2     = "%5s  %9.2f  %11.2f"
6707db96d56Sopenharmony_ci
6717db96d56Sopenharmony_cidef drive():
6727db96d56Sopenharmony_ci    count = sst = sscmp = mst = mscmp = nelts = 0
6737db96d56Sopenharmony_ci    while True:
6747db96d56Sopenharmony_ci        n = random.randrange(100000)
6757db96d56Sopenharmony_ci        nelts += n
6767db96d56Sopenharmony_ci        x = fill(n)
6777db96d56Sopenharmony_ci
6787db96d56Sopenharmony_ci        t, c = timeit(x, 'sort')
6797db96d56Sopenharmony_ci        sst += t
6807db96d56Sopenharmony_ci        sscmp += c
6817db96d56Sopenharmony_ci
6827db96d56Sopenharmony_ci        t, c = timeit(x, 'msort')
6837db96d56Sopenharmony_ci        mst += t
6847db96d56Sopenharmony_ci        mscmp += c
6857db96d56Sopenharmony_ci
6867db96d56Sopenharmony_ci        count += 1
6877db96d56Sopenharmony_ci        if count % 10:
6887db96d56Sopenharmony_ci            continue
6897db96d56Sopenharmony_ci
6907db96d56Sopenharmony_ci        print "count", count, "nelts", nelts
6917db96d56Sopenharmony_ci        print format % ("sort",  sst, sscmp)
6927db96d56Sopenharmony_ci        print format % ("msort", mst, mscmp)
6937db96d56Sopenharmony_ci        print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
6947db96d56Sopenharmony_ci
6957db96d56Sopenharmony_cidrive()
6967db96d56Sopenharmony_ci"""
6977db96d56Sopenharmony_ci
6987db96d56Sopenharmony_ciI ran this on Windows and kept using the computer lightly while it was
6997db96d56Sopenharmony_cirunning.  time.clock() is wall-clock time on Windows, with better than
7007db96d56Sopenharmony_cimicrosecond resolution.  samplesort started with a 1.52% #-of-comparisons
7017db96d56Sopenharmony_cidisadvantage, fell quickly to 1.48%, and then fluctuated within that small
7027db96d56Sopenharmony_cirange.  Here's the last chunk of output before I killed the job:
7037db96d56Sopenharmony_ci
7047db96d56Sopenharmony_cicount 2630 nelts 130906543
7057db96d56Sopenharmony_ci sort    6110.80   1937887573
7067db96d56Sopenharmony_cimsort    6002.78   1909389381
7077db96d56Sopenharmony_ci            1.80         1.49
7087db96d56Sopenharmony_ci
7097db96d56Sopenharmony_ciWe've done nearly 2 billion comparisons apiece at Python speed there, and
7107db96d56Sopenharmony_cithat's enough <wink>.
7117db96d56Sopenharmony_ci
7127db96d56Sopenharmony_ciFor random arrays of size 2 (yes, there are only 2 interesting ones),
7137db96d56Sopenharmony_cisamplesort has a 50%(!) comparison disadvantage.  This is a consequence of
7147db96d56Sopenharmony_cisamplesort special-casing at most one ascending run at the start, then
7157db96d56Sopenharmony_cifalling back to the general case if it doesn't find an ascending run
7167db96d56Sopenharmony_ciimmediately.  The consequence is that it ends up using two compares to sort
7177db96d56Sopenharmony_ci[2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
7187db96d56Sopenharmony_citaught how to deal with mixtures of ascending and descending runs
7197db96d56Sopenharmony_ciefficiently in all cases.
7207db96d56Sopenharmony_ci
7217db96d56Sopenharmony_ci
7227db96d56Sopenharmony_ciNOTES
7237db96d56Sopenharmony_ci-----
7247db96d56Sopenharmony_ci
7257db96d56Sopenharmony_ciBINSORT
7267db96d56Sopenharmony_ciA "binary insertion sort" is just like a textbook insertion sort, but instead
7277db96d56Sopenharmony_ciof locating the correct position of the next item via linear (one at a time)
7287db96d56Sopenharmony_cisearch, an equivalent to Python's bisect.bisect_right is used to find the
7297db96d56Sopenharmony_cicorrect position in logarithmic time.  Most texts don't mention this
7307db96d56Sopenharmony_civariation, and those that do usually say it's not worth the bother:  insertion
7317db96d56Sopenharmony_cisort remains quadratic (expected and worst cases) either way.  Speeding the
7327db96d56Sopenharmony_cisearch doesn't reduce the quadratic data movement costs.
7337db96d56Sopenharmony_ci
7347db96d56Sopenharmony_ciBut in CPython's case, comparisons are extraordinarily expensive compared to
7357db96d56Sopenharmony_cimoving data, and the details matter.  Moving objects is just copying
7367db96d56Sopenharmony_cipointers.  Comparisons can be arbitrarily expensive (can invoke arbitrary
7377db96d56Sopenharmony_ciuser-supplied Python code), but even in simple cases (like 3 < 4) _all_
7387db96d56Sopenharmony_cidecisions are made at runtime:  what's the type of the left comparand?  the
7397db96d56Sopenharmony_citype of the right?  do they need to be coerced to a common type?  where's the
7407db96d56Sopenharmony_cicode to compare these types?  And so on.  Even the simplest Python comparison
7417db96d56Sopenharmony_citriggers a large pile of C-level pointer dereferences, conditionals, and
7427db96d56Sopenharmony_cifunction calls.
7437db96d56Sopenharmony_ci
7447db96d56Sopenharmony_ciSo cutting the number of compares is almost always measurably helpful in
7457db96d56Sopenharmony_ciCPython, and the savings swamp the quadratic-time data movement costs for
7467db96d56Sopenharmony_cireasonable minrun values.
7477db96d56Sopenharmony_ci
7487db96d56Sopenharmony_ci
7497db96d56Sopenharmony_ciLEFT OR RIGHT
7507db96d56Sopenharmony_cigallop_left() and gallop_right() are akin to the Python bisect module's
7517db96d56Sopenharmony_cibisect_left() and bisect_right():  they're the same unless the slice they're
7527db96d56Sopenharmony_cisearching contains a (at least one) value equal to the value being searched
7537db96d56Sopenharmony_cifor.  In that case, gallop_left() returns the position immediately before the
7547db96d56Sopenharmony_cileftmost equal value, and gallop_right() the position immediately after the
7557db96d56Sopenharmony_cirightmost equal value.  The distinction is needed to preserve stability.  In
7567db96d56Sopenharmony_cigeneral, when merging adjacent runs A and B, gallop_left is used to search
7577db96d56Sopenharmony_cithru B for where an element from A belongs, and gallop_right to search thru A
7587db96d56Sopenharmony_cifor where an element from B belongs.
7597db96d56Sopenharmony_ci
7607db96d56Sopenharmony_ci
7617db96d56Sopenharmony_ciREGION OF UNCERTAINTY
7627db96d56Sopenharmony_ciTwo kinds of confusion seem to be common about the claim that after finding
7637db96d56Sopenharmony_cia k such that
7647db96d56Sopenharmony_ci
7657db96d56Sopenharmony_ci    B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
7667db96d56Sopenharmony_ci
7677db96d56Sopenharmony_cithen a binary search requires exactly k-1 tries to find A[0]'s proper
7687db96d56Sopenharmony_cilocation. For concreteness, say k=3, so B[3] < A[0] <= B[7].
7697db96d56Sopenharmony_ci
7707db96d56Sopenharmony_ciThe first confusion takes the form "OK, then the region of uncertainty is at
7717db96d56Sopenharmony_ciindices 3, 4, 5, 6 and 7:  that's 5 elements, not the claimed 2**(k-1) - 1 =
7727db96d56Sopenharmony_ci3"; or the region is viewed as a Python slice and the objection is "but that's
7737db96d56Sopenharmony_cithe slice B[3:7], so has 7-3 = 4 elements".  Resolution:  we've already
7747db96d56Sopenharmony_cicompared A[0] against B[3] and against B[7], so A[0]'s correct location is
7757db96d56Sopenharmony_cialready known wrt _both_ endpoints.  What remains is to find A[0]'s correct
7767db96d56Sopenharmony_cilocation wrt B[4], B[5] and B[6], which spans 3 elements.  Or in general, the
7777db96d56Sopenharmony_cislice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
7787db96d56Sopenharmony_ciinclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
7797db96d56Sopenharmony_ci    (2**k-1)-1 - 2**(k-1) + 1 =
7807db96d56Sopenharmony_ci    2**k-1 - 2**(k-1) =
7817db96d56Sopenharmony_ci    2*2**(k-1)-1 - 2**(k-1) =
7827db96d56Sopenharmony_ci    (2-1)*2**(k-1) - 1 =
7837db96d56Sopenharmony_ci    2**(k-1) - 1
7847db96d56Sopenharmony_cielements.
7857db96d56Sopenharmony_ci
7867db96d56Sopenharmony_ciThe second confusion:  "k-1 = 2 binary searches can find the correct location
7877db96d56Sopenharmony_ciamong 2**(k-1) = 4 elements, but you're only applying it to 3 elements:  we
7887db96d56Sopenharmony_cicould make this more efficient by arranging for the region of uncertainty to
7897db96d56Sopenharmony_cispan 2**(k-1) elements."  Resolution:  that confuses "elements" with
7907db96d56Sopenharmony_ci"locations".  In a slice with N elements, there are N+1 _locations_.  In the
7917db96d56Sopenharmony_ciexample, with the region of uncertainty B[4], B[5], B[6], there are 4
7927db96d56Sopenharmony_cilocations:  before B[4], between B[4] and B[5], between B[5] and B[6], and
7937db96d56Sopenharmony_ciafter B[6].  In general, across 2**(k-1)-1 elements, there are 2**(k-1)
7947db96d56Sopenharmony_cilocations.  That's why k-1 binary searches are necessary and sufficient.
7957db96d56Sopenharmony_ci
7967db96d56Sopenharmony_ciOPTIMIZATION OF INDIVIDUAL COMPARISONS
7977db96d56Sopenharmony_ciAs noted above, even the simplest Python comparison triggers a large pile of
7987db96d56Sopenharmony_ciC-level pointer dereferences, conditionals, and function calls.  This can be
7997db96d56Sopenharmony_cipartially mitigated by pre-scanning the data to determine whether the data is
8007db96d56Sopenharmony_cihomogeneous with respect to type.  If so, it is sometimes possible to
8017db96d56Sopenharmony_cisubstitute faster type-specific comparisons for the slower, generic
8027db96d56Sopenharmony_ciPyObject_RichCompareBool.
803