python/Objects/listsort.txt

7db96d56Sopenharmony_ciIntro
7db96d56Sopenharmony_ci-----
7db96d56Sopenharmony_ciThis describes an adaptive, stable, natural mergesort, modestly called
7db96d56Sopenharmony_citimsort (hey, I earned it <wink>).  It has supernatural performance on many
7db96d56Sopenharmony_cikinds of partially ordered arrays (less than lg(N!) comparisons needed, and
7db96d56Sopenharmony_cias few as N-1), yet as fast as Python's previous highly tuned samplesort
7db96d56Sopenharmony_cihybrid on random arrays.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIn a nutshell, the main routine marches over the array once, left to right,
7db96d56Sopenharmony_cialternately identifying the next run, then merging it into the previous
7db96d56Sopenharmony_ciruns "intelligently".  Everything else is complication for speed, and some
7db96d56Sopenharmony_cihard-won measure of memory efficiency.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciComparison with Python's Samplesort Hybrid
7db96d56Sopenharmony_ci------------------------------------------
7db96d56Sopenharmony_ci+ timsort can require a temp array containing as many as N//2 pointers,
7db96d56Sopenharmony_ci  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
7db96d56Sopenharmony_ci  expected to require a temp array this large when sorting random data; on
7db96d56Sopenharmony_ci  data with significant structure, it may get away without using any extra
7db96d56Sopenharmony_ci  heap memory.  This appears to be the strongest argument against it, but
7db96d56Sopenharmony_ci  compared to the size of an object, 2 temp bytes worst-case (also expected-
7db96d56Sopenharmony_ci  case for random data) doesn't scare me much.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  It turns out that Perl is moving to a stable mergesort, and the code for
7db96d56Sopenharmony_ci  that appears always to require a temp array with room for at least N
7db96d56Sopenharmony_ci  pointers. (Note that I wouldn't want to do that even if space weren't an
7db96d56Sopenharmony_ci  issue; I believe its efforts at memory frugality also save timsort
7db96d56Sopenharmony_ci  significant pointer-copying costs, and allow it to have a smaller working
7db96d56Sopenharmony_ci  set.)
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci+ Across about four hours of generating random arrays, and sorting them
7db96d56Sopenharmony_ci  under both methods, samplesort required about 1.5% more comparisons
7db96d56Sopenharmony_ci  (the program is at the end of this file).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci+ In real life, this may be faster or slower on random arrays than
7db96d56Sopenharmony_ci  samplesort was, depending on platform quirks.  Since it does fewer
7db96d56Sopenharmony_ci  comparisons on average, it can be expected to do better the more
7db96d56Sopenharmony_ci  expensive a comparison function is.  OTOH, it does more data movement
7db96d56Sopenharmony_ci  (pointer copying) than samplesort, and that may negate its small
7db96d56Sopenharmony_ci  comparison advantage (depending on platform quirks) unless comparison
7db96d56Sopenharmony_ci  is very expensive.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci+ On arrays with many kinds of pre-existing order, this blows samplesort out
7db96d56Sopenharmony_ci  of the water.  It's significantly faster than samplesort even on some
7db96d56Sopenharmony_ci  cases samplesort was special-casing the snot out of.  I believe that lists
7db96d56Sopenharmony_ci  very often do have exploitable partial order in real life, and this is the
7db96d56Sopenharmony_ci  strongest argument in favor of timsort (indeed, samplesort's special cases
7db96d56Sopenharmony_ci  for extreme partial order are appreciated by real users, and timsort goes
7db96d56Sopenharmony_ci  much deeper than those, in particular naturally covering every case where
7db96d56Sopenharmony_ci  someone has suggested "and it would be cool if list.sort() had a special
7db96d56Sopenharmony_ci  case for this too ... and for that ...").
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci+ Here are exact comparison counts across all the tests in sortperf.py,
7db96d56Sopenharmony_ci  when run with arguments "15 20 1".
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  Column Key:
7db96d56Sopenharmony_ci      *sort: random data
7db96d56Sopenharmony_ci      \sort: descending data
7db96d56Sopenharmony_ci      /sort: ascending data
7db96d56Sopenharmony_ci      3sort: ascending, then 3 random exchanges
7db96d56Sopenharmony_ci      +sort: ascending, then 10 random at the end
7db96d56Sopenharmony_ci      %sort: ascending, then randomly replace 1% of elements w/ random values
7db96d56Sopenharmony_ci      ~sort: many duplicates
7db96d56Sopenharmony_ci      =sort: all equal
7db96d56Sopenharmony_ci      !sort: worst case scenario
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  First the trivial cases, trivial for samplesort because it special-cased
7db96d56Sopenharmony_ci  them, and trivial for timsort because it naturally works on runs.  Within
7db96d56Sopenharmony_ci  an "n" block, the first line gives the # of compares done by samplesort,
7db96d56Sopenharmony_ci  the second line by timsort, and the third line is the percentage by
7db96d56Sopenharmony_ci  which the samplesort count exceeds the timsort count:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci      n   \sort   /sort   =sort
7db96d56Sopenharmony_ci-------  ------  ------  ------
7db96d56Sopenharmony_ci  32768   32768   32767   32767  samplesort
7db96d56Sopenharmony_ci          32767   32767   32767  timsort
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  65536   65536   65535   65535
7db96d56Sopenharmony_ci          65535   65535   65535
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 131072  131072  131071  131071
7db96d56Sopenharmony_ci         131071  131071  131071
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 262144  262144  262143  262143
7db96d56Sopenharmony_ci         262143  262143  262143
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 524288  524288  524287  524287
7db96d56Sopenharmony_ci         524287  524287  524287
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci1048576 1048576 1048575 1048575
7db96d56Sopenharmony_ci        1048575 1048575 1048575
7db96d56Sopenharmony_ci          0.00%   0.00%   0.00%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  The algorithms are effectively identical in these cases, except that
7db96d56Sopenharmony_ci  timsort does one less compare in \sort.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  Now for the more interesting cases.  Where lg(x) is the logarithm of x to
7db96d56Sopenharmony_ci  the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
7db96d56Sopenharmony_ci  the best any comparison-based sorting algorithm can do on average (across
7db96d56Sopenharmony_ci  all permutations).  When a method gets significantly below that, it's
7db96d56Sopenharmony_ci  either astronomically lucky, or is finding exploitable structure in the
7db96d56Sopenharmony_ci  data.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci      n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
7db96d56Sopenharmony_ci-------  -------   ------   -------  -------  ------  -------  --------
7db96d56Sopenharmony_ci  32768   444255   453096   453614    32908   452871   130491    469141 old
7db96d56Sopenharmony_ci                   448885    33016    33007    50426   182083     65534 new
7db96d56Sopenharmony_ci                    0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  65536   954037   972699   981940    65686   973104   260029   1004607
7db96d56Sopenharmony_ci                   962991    65821    65808   101667   364341    131070
7db96d56Sopenharmony_ci                    1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 131072  2039137  2101881  2091491   131232  2092894   554790   2161379
7db96d56Sopenharmony_ci                  2057533   131410   131361   206193   728871    262142
7db96d56Sopenharmony_ci                    2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
7db96d56Sopenharmony_ci                  4377402   262437   262459   416347  1457945    524286
7db96d56Sopenharmony_ci                    1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci 524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
7db96d56Sopenharmony_ci                  9278734   524580   524633   837947  2916107   1048574
7db96d56Sopenharmony_ci                   1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci1048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
7db96d56Sopenharmony_ci                 19606028  1048958  1048941  1694896  5832445   2097150
7db96d56Sopenharmony_ci                    1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  Discussion of cases:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  *sort:  There's no structure in random data to exploit, so the theoretical
7db96d56Sopenharmony_ci  limit is lg(n!).  Both methods get close to that, and timsort is hugging
7db96d56Sopenharmony_ci  it (indeed, in a *marginal* sense, it's a spectacular improvement --
7db96d56Sopenharmony_ci  there's only about 1% left before hitting the wall, and timsort knows
7db96d56Sopenharmony_ci  darned well it's doing compares that won't pay on random data -- but so
7db96d56Sopenharmony_ci  does the samplesort hybrid).  For contrast, Hoare's original random-pivot
7db96d56Sopenharmony_ci  quicksort does about 39% more compares than the limit, and the median-of-3
7db96d56Sopenharmony_ci  variant about 19% more.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  3sort, %sort, and !sort:  No contest; there's structure in this data, but
7db96d56Sopenharmony_ci  not of the specific kinds samplesort special-cases.  Note that structure
7db96d56Sopenharmony_ci  in !sort wasn't put there on purpose -- it was crafted as a worst case for
7db96d56Sopenharmony_ci  a previous quicksort implementation.  That timsort nails it came as a
7db96d56Sopenharmony_ci  surprise to me (although it's obvious in retrospect).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  +sort:  samplesort special-cases this data, and does a few less compares
7db96d56Sopenharmony_ci  than timsort.  However, timsort runs this case significantly faster on all
7db96d56Sopenharmony_ci  boxes we have timings for, because timsort is in the business of merging
7db96d56Sopenharmony_ci  runs efficiently, while samplesort does much more data movement in this
7db96d56Sopenharmony_ci  (for it) special case.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  ~sort:  samplesort's special cases for large masses of equal elements are
7db96d56Sopenharmony_ci  extremely effective on ~sort's specific data pattern, and timsort just
7db96d56Sopenharmony_ci  isn't going to get close to that, despite that it's clearly getting a
7db96d56Sopenharmony_ci  great deal of benefit out of the duplicates (the # of compares is much less
7db96d56Sopenharmony_ci  than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
7db96d56Sopenharmony_ci  distinct values, and as the distribution gets more skewed, samplesort's
7db96d56Sopenharmony_ci  equal-element gimmicks become less effective, while timsort's adaptive
7db96d56Sopenharmony_ci  strategies find more to exploit; in a database supplied by Kevin Altis, a
7db96d56Sopenharmony_ci  sort on its highly skewed "on which stock exchange does this company's
7db96d56Sopenharmony_ci  stock trade?" field ran over twice as fast under timsort.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  However, despite that timsort does many more comparisons on ~sort, and
7db96d56Sopenharmony_ci  that on several platforms ~sort runs highly significantly slower under
7db96d56Sopenharmony_ci  timsort, on other platforms ~sort runs highly significantly faster under
7db96d56Sopenharmony_ci  timsort.  No other kind of data has shown this wild x-platform behavior,
7db96d56Sopenharmony_ci  and we don't have an explanation for it.  The only thing I can think of
7db96d56Sopenharmony_ci  that could transform what "should be" highly significant slowdowns into
7db96d56Sopenharmony_ci  highly significant speedups on some boxes are catastrophic cache effects
7db96d56Sopenharmony_ci  in samplesort.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  But timsort "should be" slower than samplesort on ~sort, so it's hard
7db96d56Sopenharmony_ci  to count that it isn't on some boxes as a strike against it <wink>.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci+ Here's the highwater mark for the number of heap-based temp slots (4
7db96d56Sopenharmony_ci  bytes each on this box) needed by each test, again with arguments
7db96d56Sopenharmony_ci  "15 20 1":
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci   2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
7db96d56Sopenharmony_ci  32768  16384     0     0   6256      0  10821  12288      0  16383
7db96d56Sopenharmony_ci  65536  32766     0     0  21652      0  31276  24576      0  32767
7db96d56Sopenharmony_ci 131072  65534     0     0  17258      0  58112  49152      0  65535
7db96d56Sopenharmony_ci 262144 131072     0     0  35660      0 123561  98304      0 131071
7db96d56Sopenharmony_ci 524288 262142     0     0  31302      0 212057 196608      0 262143
7db96d56Sopenharmony_ci1048576 524286     0     0 312438      0 484942 393216      0 524287
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci  Discussion:  The tests that end up doing (close to) perfectly balanced
7db96d56Sopenharmony_ci  merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
7db96d56Sopenharmony_ci  also ends up doing balanced merges, but systematically benefits a lot from
7db96d56Sopenharmony_ci  the preliminary pre-merge searches described under "Merge Memory" later.
7db96d56Sopenharmony_ci  %sort approaches having a balanced merge at the end because the random
7db96d56Sopenharmony_ci  selection of elements to replace is expected to produce an out-of-order
7db96d56Sopenharmony_ci  element near the midpoint.  \sort, /sort, =sort are the trivial one-run
7db96d56Sopenharmony_ci  cases, needing no merging at all.  +sort ends up having one very long run
7db96d56Sopenharmony_ci  and one very short, and so gets all the temp space it needs from the small
7db96d56Sopenharmony_ci  temparray member of the MergeState struct (note that the same would be
7db96d56Sopenharmony_ci  true if the new random elements were prefixed to the sorted list instead,
7db96d56Sopenharmony_ci  but not if they appeared "in the middle").  3sort approaches N//3 temp
7db96d56Sopenharmony_ci  slots twice, but the run lengths that remain after 3 random exchanges
7db96d56Sopenharmony_ci  clearly has very high variance.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciA detailed description of timsort follows.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciRuns
7db96d56Sopenharmony_ci----
7db96d56Sopenharmony_cicount_run() returns the # of elements in the next run.  A run is either
7db96d56Sopenharmony_ci"ascending", which means non-decreasing:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    a0 <= a1 <= a2 <= ...
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cior "descending", which means strictly decreasing:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    a0 > a1 > a2 > ...
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciNote that a run is always at least 2 long, unless we start at the array's
7db96d56Sopenharmony_cilast element.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe definition of descending is strict, because the main routine reverses
7db96d56Sopenharmony_cia descending run in-place, transforming a descending run into an ascending
7db96d56Sopenharmony_cirun.  Reversal is done via the obvious fast "swap elements starting at each
7db96d56Sopenharmony_ciend, and converge at the middle" method, and that can violate stability if
7db96d56Sopenharmony_cithe slice contains any equal elements.  Using a strict definition of
7db96d56Sopenharmony_cidescending ensures that a descending run contains distinct elements.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf an array is random, it's very unlikely we'll see long runs.  If a natural
7db96d56Sopenharmony_cirun contains less than minrun elements (see next section), the main loop
7db96d56Sopenharmony_ciartificially boosts it to minrun elements, via a stable binary insertion sort
7db96d56Sopenharmony_ciapplied to the right number of array elements following the short natural
7db96d56Sopenharmony_cirun.  In a random array, *all* runs are likely to be minrun long as a
7db96d56Sopenharmony_ciresult.  This has two primary good effects:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci1. Random data strongly tends then toward perfectly balanced (both runs have
7db96d56Sopenharmony_ci   the same length) merges, which is the most efficient way to proceed when
7db96d56Sopenharmony_ci   data is random.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci2. Because runs are never very short, the rest of the code doesn't make
7db96d56Sopenharmony_ci   heroic efforts to shave a few cycles off per-merge overheads.  For
7db96d56Sopenharmony_ci   example, reasonable use of function calls is made, rather than trying to
7db96d56Sopenharmony_ci   inline everything.  Since there are no more than N/minrun runs to begin
7db96d56Sopenharmony_ci   with, a few "extra" function calls per merge is barely measurable.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciComputing minrun
7db96d56Sopenharmony_ci----------------
7db96d56Sopenharmony_ciIf N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
7db96d56Sopenharmony_ciarray then; it's hard to beat that given the overheads of trying something
7db96d56Sopenharmony_cifancier (see note BINSORT).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciWhen N is a power of 2, testing on random data showed that minrun values of
7db96d56Sopenharmony_ci16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
7db96d56Sopenharmony_ciin binary insertion sort clearly hurt, and at 8 the increase in the number
7db96d56Sopenharmony_ciof function calls clearly hurt.  Picking *some* power of 2 is important
7db96d56Sopenharmony_cihere, so that the merges end up perfectly balanced (see next section).  We
7db96d56Sopenharmony_cipick 32 as a good value in the sweet range; picking a value at the low end
7db96d56Sopenharmony_ciallows the adaptive gimmicks more opportunity to exploit shorter natural
7db96d56Sopenharmony_ciruns.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciBecause sortperf.py only tries powers of 2, it took a long time to notice
7db96d56Sopenharmony_cithat 32 isn't a good choice for the general case!  Consider N=2112:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci>>> divmod(2112, 32)
7db96d56Sopenharmony_ci(66, 0)
7db96d56Sopenharmony_ci>>>
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf the data is randomly ordered, we're very likely to end up with 66 runs
7db96d56Sopenharmony_cieach of length 32.  The first 64 of these trigger a sequence of perfectly
7db96d56Sopenharmony_cibalanced merges (see next section), leaving runs of lengths 2048 and 64 to
7db96d56Sopenharmony_cimerge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
7db96d56Sopenharmony_cicompares, but it's still more compares than necessary, and-- mergesort's
7db96d56Sopenharmony_cibugaboo relative to samplesort --a lot more data movement (O(N) copies just
7db96d56Sopenharmony_cito get 64 elements into place).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf we take minrun=33 in this case, then we're very likely to end up with 64
7db96d56Sopenharmony_ciruns each of length 33, and then all merges are perfectly balanced.  Better!
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciWhat we want to avoid is picking minrun such that in
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    q, r = divmod(N, minrun)
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciq is a power of 2 and r>0 (then the last merge only gets r elements into
7db96d56Sopenharmony_ciplace, and r < minrun is small compared to N), or q a little larger than a
7db96d56Sopenharmony_cipower of 2 regardless of r (then we've got a case similar to "2112", again
7db96d56Sopenharmony_cileaving too little work for the last merge to do).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciInstead we pick a minrun in range(32, 65) such that N/minrun is exactly a
7db96d56Sopenharmony_cipower of 2, or if that isn't possible, is close to, but strictly less than,
7db96d56Sopenharmony_cia power of 2.  This is easier to do than it may sound:  take the first 6
7db96d56Sopenharmony_cibits of N, and add 1 if any of the remaining bits are set.  In fact, that
7db96d56Sopenharmony_cirule covers every case in this section, including small N and exact powers
7db96d56Sopenharmony_ciof 2; merge_compute_minrun() is a deceptively simple function.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe Merge Pattern
7db96d56Sopenharmony_ci-----------------
7db96d56Sopenharmony_ciIn order to exploit regularities in the data, we're merging on natural
7db96d56Sopenharmony_cirun lengths, and they can become wildly unbalanced.  That's a Good Thing
7db96d56Sopenharmony_cifor this sort!  It means we have to find a way to manage an assortment of
7db96d56Sopenharmony_cipotentially very different run lengths, though.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciStability constrains permissible merging patterns.  For example, if we have
7db96d56Sopenharmony_ci3 consecutive runs of lengths
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    A:10000  B:20000  C:10000
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciwe dare not merge A with C first, because if A, B and C happen to contain
7db96d56Sopenharmony_cia common element, it would get out of order wrt its occurrence(s) in B.  The
7db96d56Sopenharmony_cimerging must be done as (A+B)+C or A+(B+C) instead.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciSo merging is always done on two consecutive runs at a time, and in-place,
7db96d56Sopenharmony_cialthough this may require some temp memory (more on that later).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciWhen a run is identified, its length is passed to found_new_run() to
7db96d56Sopenharmony_cipotentially merge runs on a stack of pending runs.  We would like to delay
7db96d56Sopenharmony_cimerging as long as possible in order to exploit patterns that may come up
7db96d56Sopenharmony_cilater, but we like even more to do merging as soon as possible to exploit
7db96d56Sopenharmony_cithat the run just found is still high in the memory hierarchy.  We also can't
7db96d56Sopenharmony_cidelay merging "too long" because it consumes memory to remember the runs that
7db96d56Sopenharmony_ciare still unmerged, and the stack has a fixed size.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe original version of this code used the first thing I made up that didn't
7db96d56Sopenharmony_ciobviously suck ;-) It was loosely based on invariants involving the Fibonacci
7db96d56Sopenharmony_cisequence.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIt worked OK, but it was hard to reason about, and was subtle enough that the
7db96d56Sopenharmony_ciintended invariants weren't actually preserved.  Researchers discovered that
7db96d56Sopenharmony_ciwhen trying to complete a computer-generated correctness proof.  That was
7db96d56Sopenharmony_cieasily-enough repaired, but the discovery spurred quite a bit of academic
7db96d56Sopenharmony_ciinterest in truly good ways to manage incremental merging on the fly.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciAt least a dozen different approaches were developed, some provably having
7db96d56Sopenharmony_cinear-optimal worst case behavior with respect to the entropy of the
7db96d56Sopenharmony_cidistribution of run lengths.  Some details can be found in bpo-34561.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe code now uses the "powersort" merge strategy from:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
7db96d56Sopenharmony_ci     That Optimally Adapt to Existing Runs"
7db96d56Sopenharmony_ci    J. Ian Munro and Sebastian Wild
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe code is pretty simple, but the justification is quite involved, as it's
7db96d56Sopenharmony_cibased on fast approximations to optimal binary search trees, which are
7db96d56Sopenharmony_cisubstantial topics on their own.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciHere we'll just cover some pragmatic details:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe `powerloop()` function computes a run's "power". Say two adjacent runs
7db96d56Sopenharmony_cibegin at index s1. The first run has length n1, and the second run (starting
7db96d56Sopenharmony_ciat index s1+n1, called "s2" below) has length n2. The list has total length n.
7db96d56Sopenharmony_ciThe "power" of the first run is a small integer, the depth of the node
7db96d56Sopenharmony_ciconnecting the two runs in an ideal binary merge tree, where power 1 is the
7db96d56Sopenharmony_ciroot node, and the power increases by 1 for each level deeper in the tree.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe power is the least integer L such that the "midpoint interval" contains
7db96d56Sopenharmony_cia rational number of the form J/2**L. The midpoint interval is the semi-
7db96d56Sopenharmony_ciclosed interval:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    ((s1 + n1/2)/n, (s2 + n2/2)/n]
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciYes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
7db96d56Sopenharmony_ci(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
7db96d56Sopenharmony_cithe first position at which the 2**-L bit differs between the expansions.
7db96d56Sopenharmony_ciSince the left end of the interval is less than the right end, the first
7db96d56Sopenharmony_cidiffering bit must be a 0 bit in the left quotient and a 1 bit in the right
7db96d56Sopenharmony_ciquotient.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
7db96d56Sopenharmony_cisubtractions, and shifts in a loop.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciYou'll notice the paper uses an O(1) method instead, but that relies on two
7db96d56Sopenharmony_cithings we don't have:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
7db96d56Sopenharmony_ci  extension on most platforms, but not all, and there's no uniform spelling
7db96d56Sopenharmony_ci  on the platforms that support it.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci- Integer division on an integer type twice as wide as needed to hold the
7db96d56Sopenharmony_ci  list length. But the latter is Py_ssize_t for us, and is typically the
7db96d56Sopenharmony_ci  widest native signed integer type the platform supports.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciBut since runs in our algorithm are almost never very short, the once-per-run
7db96d56Sopenharmony_cioverhead of `powerloop()` seems lost in the noise.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciDetail: why is Py_ssize_t "wide enough" in `powerloop()`?  We do, after all,
7db96d56Sopenharmony_cishift integers of that width left by 1.  How do we know that won't spill into
7db96d56Sopenharmony_cithe sign bit?  The trick is that we have some slop. `n` (the total list
7db96d56Sopenharmony_cilength) is the number of list elements, which is at most 4 times (on a 32-box,
7db96d56Sopenharmony_ciwith 4-byte pointers) smaller than than the largest size_t. So at least the
7db96d56Sopenharmony_cileading two bits of the integers we're using are clear.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciSince we can't compute a run's power before seeing the run that follows it,
7db96d56Sopenharmony_cithe most-recently identified run is never merged by `found_new_run()`.
7db96d56Sopenharmony_ciInstead a new run is only used to compute the 2nd-most-recent run's power.
7db96d56Sopenharmony_ciThen adjacent runs are merged so long as their saved power (tree depth) is
7db96d56Sopenharmony_cigreater than that newly computed power. When found_new_run() returns, only
7db96d56Sopenharmony_cithen is a new run pushed on to the stack of pending runs.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciA key invariant is that powers on the run stack are strictly decreasing
7db96d56Sopenharmony_ci(starting from the run at the top of the stack).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciNote that even powersort's strategy isn't always truly optimal. It can't be.
7db96d56Sopenharmony_ciComputing an optimal merge sequence can be done in time quadratic in the
7db96d56Sopenharmony_cinumber of runs, which is very much slower, and also requires finding &
7db96d56Sopenharmony_ciremembering _all_ the runs' lengths (of which there may be billions) in
7db96d56Sopenharmony_ciadvance.  It's remarkable, though, how close to optimal this strategy gets.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciCurious factoid: of all the alternatives I've seen in the literature,
7db96d56Sopenharmony_cipowersort's is the only one that's always truly optimal for a collection of 3
7db96d56Sopenharmony_cirun lengths (for three lengths A B C, it's always optimal to first merge the
7db96d56Sopenharmony_cishorter of A and C with B).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciMerge Memory
7db96d56Sopenharmony_ci------------
7db96d56Sopenharmony_ciMerging adjacent runs of lengths A and B in-place, and in linear time, is
7db96d56Sopenharmony_cidifficult.  Theoretical constructions are known that can do it, but they're
7db96d56Sopenharmony_citoo difficult and slow for practical use.  But if we have temp memory equal
7db96d56Sopenharmony_cito min(A, B), it's easy.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf A is smaller (function merge_lo), copy A to a temp array, leave B alone,
7db96d56Sopenharmony_ciand then we can do the obvious merge algorithm left to right, from the temp
7db96d56Sopenharmony_ciarea and B, starting the stores into where A used to live.  There's always a
7db96d56Sopenharmony_cifree area in the original area comprising a number of elements equal to the
7db96d56Sopenharmony_cinumber not yet merged from the temp array (trivially true at the start;
7db96d56Sopenharmony_ciproceed by induction).  The only tricky bit is that if a comparison raises an
7db96d56Sopenharmony_ciexception, we have to remember to copy the remaining elements back in from
7db96d56Sopenharmony_cithe temp area, lest the array end up with duplicate entries from B.  But
7db96d56Sopenharmony_cithat's exactly the same thing we need to do if we reach the end of B first,
7db96d56Sopenharmony_ciso the exit code is pleasantly common to both the normal and error cases.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf B is smaller (function merge_hi, which is merge_lo's "mirror image"),
7db96d56Sopenharmony_cimuch the same, except that we need to merge right to left, copying B into a
7db96d56Sopenharmony_citemp array and starting the stores at the right end of where B used to live.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciA refinement:  When we're about to merge adjacent runs A and B, we first do
7db96d56Sopenharmony_cia form of binary search (more on that later) to see where B[0] should end up
7db96d56Sopenharmony_ciin A.  Elements in A preceding that point are already in their final
7db96d56Sopenharmony_cipositions, effectively shrinking the size of A.  Likewise we also search to
7db96d56Sopenharmony_cisee where A[-1] should end up in B, and elements of B after that point can
7db96d56Sopenharmony_cialso be ignored.  This cuts the amount of temp memory needed by the same
7db96d56Sopenharmony_ciamount.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThese preliminary searches may not pay off, and can be expected *not* to
7db96d56Sopenharmony_cirepay their cost if the data is random.  But they can win huge in all of
7db96d56Sopenharmony_citime, copying, and memory savings when they do pay, so this is one of the
7db96d56Sopenharmony_ci"per-merge overheads" mentioned above that we're happy to endure because
7db96d56Sopenharmony_cithere is at most one very short run.  It's generally true in this algorithm
7db96d56Sopenharmony_cithat we're willing to gamble a little to win a lot, even though the net
7db96d56Sopenharmony_ciexpectation is negative for random data.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciMerge Algorithms
7db96d56Sopenharmony_ci----------------
7db96d56Sopenharmony_cimerge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
7db96d56Sopenharmony_cideals with runs where A <= B, and merge_hi where A > B.  They don't know
7db96d56Sopenharmony_ciwhether the data is clustered or uniform, but a lovely thing about merging
7db96d56Sopenharmony_ciis that many kinds of clustering "reveal themselves" by how many times in a
7db96d56Sopenharmony_cirow the winning merge element comes from the same run.  We'll only discuss
7db96d56Sopenharmony_cimerge_lo here; merge_hi is exactly analogous.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciMerging begins in the usual, obvious way, comparing the first element of A
7db96d56Sopenharmony_cito the first of B, and moving B[0] to the merge area if it's less than A[0],
7db96d56Sopenharmony_cielse moving A[0] to the merge area.  Call that the "one pair at a time"
7db96d56Sopenharmony_cimode.  The only twist here is keeping track of how many times in a row "the
7db96d56Sopenharmony_ciwinner" comes from the same run.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
7db96d56Sopenharmony_ciwe *search* B for where A[0] belongs, and move over all the B's before
7db96d56Sopenharmony_cithat point in one chunk to the merge area, then move A[0] to the merge
7db96d56Sopenharmony_ciarea.  Then we search A for where B[0] belongs, and similarly move a
7db96d56Sopenharmony_cislice of A in one chunk.  Then back to searching B for where A[0] belongs,
7db96d56Sopenharmony_cietc.  We stay in galloping mode until both searches find slices to copy
7db96d56Sopenharmony_ciless than MIN_GALLOP elements long, at which point we go back to one-pair-
7db96d56Sopenharmony_ciat-a-time mode.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciA refinement:  The MergeState struct contains the value of min_gallop that
7db96d56Sopenharmony_cicontrols when we enter galloping mode, initialized to MIN_GALLOP.
7db96d56Sopenharmony_cimerge_lo() and merge_hi() adjust this higher when galloping isn't paying
7db96d56Sopenharmony_cioff, and lower when it is.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciGalloping
7db96d56Sopenharmony_ci---------
7db96d56Sopenharmony_ciStill without loss of generality, assume A is the shorter run.  In galloping
7db96d56Sopenharmony_cimode, we first look for A[0] in B.  We do this via "galloping", comparing
7db96d56Sopenharmony_ciA[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
7db96d56Sopenharmony_cithe k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
7db96d56Sopenharmony_ciroughly lg(B) comparisons, and, unlike a straight binary search, favors
7db96d56Sopenharmony_cifinding the right spot early in B (more on that later).
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciAfter finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
7db96d56Sopenharmony_ciconsecutive elements, and a straight binary search requires exactly k-1
7db96d56Sopenharmony_ciadditional comparisons to nail it (see note REGION OF UNCERTAINTY).  Then we
7db96d56Sopenharmony_cicopy all the B's up to that point in one chunk, and then copy A[0].  Note
7db96d56Sopenharmony_cithat no matter where A[0] belongs in B, the combination of galloping + binary
7db96d56Sopenharmony_cisearch finds it in no more than about 2*lg(B) comparisons.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf we did a straight binary search, we could find it in no more than
7db96d56Sopenharmony_ciceiling(lg(B+1)) comparisons -- but straight binary search takes that many
7db96d56Sopenharmony_cicomparisons no matter where A[0] belongs.  Straight binary search thus loses
7db96d56Sopenharmony_cito galloping unless the run is quite long, and we simply can't guess
7db96d56Sopenharmony_ciwhether it is in advance.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIf data is random and runs have the same length, A[0] belongs at B[0] half
7db96d56Sopenharmony_cithe time, at B[1] a quarter of the time, and so on:  a consecutive winning
7db96d56Sopenharmony_cisub-run in B of length k occurs with probability 1/2**(k+1).  So long
7db96d56Sopenharmony_ciwinning sub-runs are extremely unlikely in random data, and guessing that a
7db96d56Sopenharmony_ciwinning sub-run is going to be long is a dangerous game.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciOTOH, if data is lopsided or lumpy or contains many duplicates, long
7db96d56Sopenharmony_cistretches of winning sub-runs are very likely, and cutting the number of
7db96d56Sopenharmony_cicomparisons needed to find one from O(B) to O(log B) is a huge win.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciGalloping compromises by getting out fast if there isn't a long winning
7db96d56Sopenharmony_cisub-run, yet finding such very efficiently when they exist.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciI first learned about the galloping strategy in a related context; see:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    "Adaptive Set Intersections, Unions, and Differences" (2000)
7db96d56Sopenharmony_ci    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciand its followup(s).  An earlier paper called the same strategy
7db96d56Sopenharmony_ci"exponential search":
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci   "Optimistic Sorting and Information Theoretic Complexity"
7db96d56Sopenharmony_ci   Peter McIlroy
7db96d56Sopenharmony_ci   SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
7db96d56Sopenharmony_ci   467-474, Austin, Texas, 25-27 January 1993.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciand it probably dates back to an earlier paper by Bentley and Yao.  The
7db96d56Sopenharmony_ciMcIlroy paper in particular has good analysis of a mergesort that's
7db96d56Sopenharmony_ciprobably strongly related to this one in its galloping strategy.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciGalloping with a Broken Leg
7db96d56Sopenharmony_ci---------------------------
7db96d56Sopenharmony_ciSo why don't we always gallop?  Because it can lose, on two counts:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci1. While we're willing to endure small per-merge overheads, per-comparison
7db96d56Sopenharmony_ci   overheads are a different story.  Calling Yet Another Function per
7db96d56Sopenharmony_ci   comparison is expensive, and gallop_left() and gallop_right() are
7db96d56Sopenharmony_ci   too long-winded for sane inlining.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci2. Galloping can-- alas --require more comparisons than linear one-at-time
7db96d56Sopenharmony_ci   search, depending on the data.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci#2 requires details.  If A[0] belongs before B[0], galloping requires 1
7db96d56Sopenharmony_cicompare to determine that, same as linear search, except it costs more
7db96d56Sopenharmony_cito call the gallop function.  If A[0] belongs right before B[1], galloping
7db96d56Sopenharmony_cirequires 2 compares, again same as linear search.  On the third compare,
7db96d56Sopenharmony_cigalloping checks A[0] against B[3], and if it's <=, requires one more
7db96d56Sopenharmony_cicompare to determine whether A[0] belongs at B[2] or B[3].  That's a total
7db96d56Sopenharmony_ciof 4 compares, but if A[0] does belong at B[2], linear search would have
7db96d56Sopenharmony_cidiscovered that in only 3 compares, and that's a huge loss!  Really.  It's
7db96d56Sopenharmony_cian increase of 33% in the number of compares needed, and comparisons are
7db96d56Sopenharmony_ciexpensive in Python.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciindex in B where    # compares linear  # gallop  # binary  gallop
7db96d56Sopenharmony_ciA[0] belongs        search needs       compares  compares  total
7db96d56Sopenharmony_ci----------------    -----------------  --------  --------  ------
7db96d56Sopenharmony_ci               0                    1         1         0       1
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci               1                    2         2         0       2
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci               2                    3         3         1       4
7db96d56Sopenharmony_ci               3                    4         3         1       4
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci               4                    5         4         2       6
7db96d56Sopenharmony_ci               5                    6         4         2       6
7db96d56Sopenharmony_ci               6                    7         4         2       6
7db96d56Sopenharmony_ci               7                    8         4         2       6
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci               8                    9         5         3       8
7db96d56Sopenharmony_ci               9                   10         5         3       8
7db96d56Sopenharmony_ci              10                   11         5         3       8
7db96d56Sopenharmony_ci              11                   12         5         3       8
7db96d56Sopenharmony_ci                                        ...
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIn general, if A[0] belongs at B[i], linear search requires i+1 comparisons
7db96d56Sopenharmony_cito determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
7db96d56Sopenharmony_ciThe advantage of galloping is unbounded as i grows, but it doesn't win at
7db96d56Sopenharmony_ciall until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
7db96d56Sopenharmony_ciat the other values.  At and after i=6, galloping always wins.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciWe can't guess in advance when it's going to win, though, so we do one pair
7db96d56Sopenharmony_ciat a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
7db96d56Sopenharmony_ciis 7, and that's pretty strong evidence.  However, if the data is random, it
7db96d56Sopenharmony_cisimply will trigger galloping mode purely by luck every now and again, and
7db96d56Sopenharmony_ciit's quite likely to hit one of the losing cases next.  On the other hand,
7db96d56Sopenharmony_ciin cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
7db96d56Sopenharmony_ci"should be" then.  So the MergeState struct keeps a min_gallop variable
7db96d56Sopenharmony_cithat merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
7db96d56Sopenharmony_cithe smaller min_gallop gets, making it easier to transition back to
7db96d56Sopenharmony_cigalloping mode (if we ever leave it in the current merge, and at the
7db96d56Sopenharmony_cistart of the next merge).  But whenever the gallop loop doesn't pay,
7db96d56Sopenharmony_cimin_gallop is increased by one, making it harder to transition back
7db96d56Sopenharmony_cito galloping mode (and again both within a merge and across merges).  For
7db96d56Sopenharmony_cirandom data, this all but eliminates the gallop penalty:  min_gallop grows
7db96d56Sopenharmony_cilarge enough that we almost never get into galloping mode.  And for cases
7db96d56Sopenharmony_cilike ~sort, min_gallop can fall to as low as 1.  This seems to work well,
7db96d56Sopenharmony_cibut in all it's a minor improvement over using a fixed MIN_GALLOP value.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciGalloping Complication
7db96d56Sopenharmony_ci----------------------
7db96d56Sopenharmony_ciThe description above was for merge_lo.  merge_hi has to merge "from the
7db96d56Sopenharmony_ciother end", and really needs to gallop starting at the last element in a run
7db96d56Sopenharmony_ciinstead of the first.  Galloping from the first still works, but does more
7db96d56Sopenharmony_cicomparisons than it should (this is significant -- I timed it both ways). For
7db96d56Sopenharmony_cithis reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
7db96d56Sopenharmony_cifunctions have a "hint" argument, which is the index at which galloping
7db96d56Sopenharmony_cishould begin.  So galloping can actually start at any index, and proceed at
7db96d56Sopenharmony_cioffsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciIn the code as I type it's always called with either 0 or n-1 (where n is
7db96d56Sopenharmony_cithe # of elements in a run).  It's tempting to try to do something fancier,
7db96d56Sopenharmony_cimelding galloping with some form of interpolation search; for example, if
7db96d56Sopenharmony_ciwe're merging a run of length 1 with a run of length 10000, index 5000 is
7db96d56Sopenharmony_ciprobably a better guess at the final result than either 0 or 9999.  But
7db96d56Sopenharmony_ciit's unclear how to generalize that intuition usefully, and merging of
7db96d56Sopenharmony_ciwildly unbalanced runs already enjoys excellent performance.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci~sort is a good example of when balanced runs could benefit from a better
7db96d56Sopenharmony_cihint value:  to the extent possible, this would like to use a starting
7db96d56Sopenharmony_cioffset equal to the previous value of acount/bcount.  Doing so saves about
7db96d56Sopenharmony_ci10% of the compares in ~sort.  However, doing so is also a mixed bag,
7db96d56Sopenharmony_cihurting other cases.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciComparing Average # of Compares on Random Arrays
7db96d56Sopenharmony_ci------------------------------------------------
7db96d56Sopenharmony_ci[NOTE:  This was done when the new algorithm used about 0.1% more compares
7db96d56Sopenharmony_ci on random data than does its current incarnation.]
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciHere list.sort() is samplesort, and list.msort() this sort:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci"""
7db96d56Sopenharmony_ciimport random
7db96d56Sopenharmony_cifrom time import clock as now
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cidef fill(n):
7db96d56Sopenharmony_ci    from random import random
7db96d56Sopenharmony_ci    return [random() for i in range(n)]
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cidef mycmp(x, y):
7db96d56Sopenharmony_ci    global ncmp
7db96d56Sopenharmony_ci    ncmp += 1
7db96d56Sopenharmony_ci    return cmp(x, y)
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cidef timeit(values, method):
7db96d56Sopenharmony_ci    global ncmp
7db96d56Sopenharmony_ci    X = values[:]
7db96d56Sopenharmony_ci    bound = getattr(X, method)
7db96d56Sopenharmony_ci    ncmp = 0
7db96d56Sopenharmony_ci    t1 = now()
7db96d56Sopenharmony_ci    bound(mycmp)
7db96d56Sopenharmony_ci    t2 = now()
7db96d56Sopenharmony_ci    return t2-t1, ncmp
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciformat = "%5s  %9.2f  %11d"
7db96d56Sopenharmony_cif2     = "%5s  %9.2f  %11.2f"
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cidef drive():
7db96d56Sopenharmony_ci    count = sst = sscmp = mst = mscmp = nelts = 0
7db96d56Sopenharmony_ci    while True:
7db96d56Sopenharmony_ci        n = random.randrange(100000)
7db96d56Sopenharmony_ci        nelts += n
7db96d56Sopenharmony_ci        x = fill(n)
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci        t, c = timeit(x, 'sort')
7db96d56Sopenharmony_ci        sst += t
7db96d56Sopenharmony_ci        sscmp += c
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci        t, c = timeit(x, 'msort')
7db96d56Sopenharmony_ci        mst += t
7db96d56Sopenharmony_ci        mscmp += c
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci        count += 1
7db96d56Sopenharmony_ci        if count % 10:
7db96d56Sopenharmony_ci            continue
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci        print "count", count, "nelts", nelts
7db96d56Sopenharmony_ci        print format % ("sort",  sst, sscmp)
7db96d56Sopenharmony_ci        print format % ("msort", mst, mscmp)
7db96d56Sopenharmony_ci        print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cidrive()
7db96d56Sopenharmony_ci"""
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciI ran this on Windows and kept using the computer lightly while it was
7db96d56Sopenharmony_cirunning.  time.clock() is wall-clock time on Windows, with better than
7db96d56Sopenharmony_cimicrosecond resolution.  samplesort started with a 1.52% #-of-comparisons
7db96d56Sopenharmony_cidisadvantage, fell quickly to 1.48%, and then fluctuated within that small
7db96d56Sopenharmony_cirange.  Here's the last chunk of output before I killed the job:
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cicount 2630 nelts 130906543
7db96d56Sopenharmony_ci sort    6110.80   1937887573
7db96d56Sopenharmony_cimsort    6002.78   1909389381
7db96d56Sopenharmony_ci            1.80         1.49
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciWe've done nearly 2 billion comparisons apiece at Python speed there, and
7db96d56Sopenharmony_cithat's enough <wink>.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciFor random arrays of size 2 (yes, there are only 2 interesting ones),
7db96d56Sopenharmony_cisamplesort has a 50%(!) comparison disadvantage.  This is a consequence of
7db96d56Sopenharmony_cisamplesort special-casing at most one ascending run at the start, then
7db96d56Sopenharmony_cifalling back to the general case if it doesn't find an ascending run
7db96d56Sopenharmony_ciimmediately.  The consequence is that it ends up using two compares to sort
7db96d56Sopenharmony_ci[2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
7db96d56Sopenharmony_citaught how to deal with mixtures of ascending and descending runs
7db96d56Sopenharmony_ciefficiently in all cases.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciNOTES
7db96d56Sopenharmony_ci-----
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciBINSORT
7db96d56Sopenharmony_ciA "binary insertion sort" is just like a textbook insertion sort, but instead
7db96d56Sopenharmony_ciof locating the correct position of the next item via linear (one at a time)
7db96d56Sopenharmony_cisearch, an equivalent to Python's bisect.bisect_right is used to find the
7db96d56Sopenharmony_cicorrect position in logarithmic time.  Most texts don't mention this
7db96d56Sopenharmony_civariation, and those that do usually say it's not worth the bother:  insertion
7db96d56Sopenharmony_cisort remains quadratic (expected and worst cases) either way.  Speeding the
7db96d56Sopenharmony_cisearch doesn't reduce the quadratic data movement costs.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciBut in CPython's case, comparisons are extraordinarily expensive compared to
7db96d56Sopenharmony_cimoving data, and the details matter.  Moving objects is just copying
7db96d56Sopenharmony_cipointers.  Comparisons can be arbitrarily expensive (can invoke arbitrary
7db96d56Sopenharmony_ciuser-supplied Python code), but even in simple cases (like 3 < 4) _all_
7db96d56Sopenharmony_cidecisions are made at runtime:  what's the type of the left comparand?  the
7db96d56Sopenharmony_citype of the right?  do they need to be coerced to a common type?  where's the
7db96d56Sopenharmony_cicode to compare these types?  And so on.  Even the simplest Python comparison
7db96d56Sopenharmony_citriggers a large pile of C-level pointer dereferences, conditionals, and
7db96d56Sopenharmony_cifunction calls.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciSo cutting the number of compares is almost always measurably helpful in
7db96d56Sopenharmony_ciCPython, and the savings swamp the quadratic-time data movement costs for
7db96d56Sopenharmony_cireasonable minrun values.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciLEFT OR RIGHT
7db96d56Sopenharmony_cigallop_left() and gallop_right() are akin to the Python bisect module's
7db96d56Sopenharmony_cibisect_left() and bisect_right():  they're the same unless the slice they're
7db96d56Sopenharmony_cisearching contains a (at least one) value equal to the value being searched
7db96d56Sopenharmony_cifor.  In that case, gallop_left() returns the position immediately before the
7db96d56Sopenharmony_cileftmost equal value, and gallop_right() the position immediately after the
7db96d56Sopenharmony_cirightmost equal value.  The distinction is needed to preserve stability.  In
7db96d56Sopenharmony_cigeneral, when merging adjacent runs A and B, gallop_left is used to search
7db96d56Sopenharmony_cithru B for where an element from A belongs, and gallop_right to search thru A
7db96d56Sopenharmony_cifor where an element from B belongs.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciREGION OF UNCERTAINTY
7db96d56Sopenharmony_ciTwo kinds of confusion seem to be common about the claim that after finding
7db96d56Sopenharmony_cia k such that
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ci    B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_cithen a binary search requires exactly k-1 tries to find A[0]'s proper
7db96d56Sopenharmony_cilocation. For concreteness, say k=3, so B[3] < A[0] <= B[7].
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe first confusion takes the form "OK, then the region of uncertainty is at
7db96d56Sopenharmony_ciindices 3, 4, 5, 6 and 7:  that's 5 elements, not the claimed 2**(k-1) - 1 =
7db96d56Sopenharmony_ci3"; or the region is viewed as a Python slice and the objection is "but that's
7db96d56Sopenharmony_cithe slice B[3:7], so has 7-3 = 4 elements".  Resolution:  we've already
7db96d56Sopenharmony_cicompared A[0] against B[3] and against B[7], so A[0]'s correct location is
7db96d56Sopenharmony_cialready known wrt _both_ endpoints.  What remains is to find A[0]'s correct
7db96d56Sopenharmony_cilocation wrt B[4], B[5] and B[6], which spans 3 elements.  Or in general, the
7db96d56Sopenharmony_cislice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
7db96d56Sopenharmony_ciinclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
7db96d56Sopenharmony_ci    (2**k-1)-1 - 2**(k-1) + 1 =
7db96d56Sopenharmony_ci    2**k-1 - 2**(k-1) =
7db96d56Sopenharmony_ci    2*2**(k-1)-1 - 2**(k-1) =
7db96d56Sopenharmony_ci    (2-1)*2**(k-1) - 1 =
7db96d56Sopenharmony_ci    2**(k-1) - 1
7db96d56Sopenharmony_cielements.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciThe second confusion:  "k-1 = 2 binary searches can find the correct location
7db96d56Sopenharmony_ciamong 2**(k-1) = 4 elements, but you're only applying it to 3 elements:  we
7db96d56Sopenharmony_cicould make this more efficient by arranging for the region of uncertainty to
7db96d56Sopenharmony_cispan 2**(k-1) elements."  Resolution:  that confuses "elements" with
7db96d56Sopenharmony_ci"locations".  In a slice with N elements, there are N+1 _locations_.  In the
7db96d56Sopenharmony_ciexample, with the region of uncertainty B[4], B[5], B[6], there are 4
7db96d56Sopenharmony_cilocations:  before B[4], between B[4] and B[5], between B[5] and B[6], and
7db96d56Sopenharmony_ciafter B[6].  In general, across 2**(k-1)-1 elements, there are 2**(k-1)
7db96d56Sopenharmony_cilocations.  That's why k-1 binary searches are necessary and sufficient.
7db96d56Sopenharmony_ci
7db96d56Sopenharmony_ciOPTIMIZATION OF INDIVIDUAL COMPARISONS
7db96d56Sopenharmony_ciAs noted above, even the simplest Python comparison triggers a large pile of
7db96d56Sopenharmony_ciC-level pointer dereferences, conditionals, and function calls.  This can be
7db96d56Sopenharmony_cipartially mitigated by pre-scanning the data to determine whether the data is
7db96d56Sopenharmony_cihomogeneous with respect to type.  If so, it is sometimes possible to
7db96d56Sopenharmony_cisubstitute faster type-specific comparisons for the slower, generic
7db96d56Sopenharmony_ciPyObject_RichCompareBool.