17db96d56Sopenharmony_ciIntro 27db96d56Sopenharmony_ci----- 37db96d56Sopenharmony_ciThis describes an adaptive, stable, natural mergesort, modestly called 47db96d56Sopenharmony_citimsort (hey, I earned it <wink>). It has supernatural performance on many 57db96d56Sopenharmony_cikinds of partially ordered arrays (less than lg(N!) comparisons needed, and 67db96d56Sopenharmony_cias few as N-1), yet as fast as Python's previous highly tuned samplesort 77db96d56Sopenharmony_cihybrid on random arrays. 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ciIn a nutshell, the main routine marches over the array once, left to right, 107db96d56Sopenharmony_cialternately identifying the next run, then merging it into the previous 117db96d56Sopenharmony_ciruns "intelligently". Everything else is complication for speed, and some 127db96d56Sopenharmony_cihard-won measure of memory efficiency. 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ci 157db96d56Sopenharmony_ciComparison with Python's Samplesort Hybrid 167db96d56Sopenharmony_ci------------------------------------------ 177db96d56Sopenharmony_ci+ timsort can require a temp array containing as many as N//2 pointers, 187db96d56Sopenharmony_ci which means as many as 2*N extra bytes on 32-bit boxes. It can be 197db96d56Sopenharmony_ci expected to require a temp array this large when sorting random data; on 207db96d56Sopenharmony_ci data with significant structure, it may get away without using any extra 217db96d56Sopenharmony_ci heap memory. This appears to be the strongest argument against it, but 227db96d56Sopenharmony_ci compared to the size of an object, 2 temp bytes worst-case (also expected- 237db96d56Sopenharmony_ci case for random data) doesn't scare me much. 247db96d56Sopenharmony_ci 257db96d56Sopenharmony_ci It turns out that Perl is moving to a stable mergesort, and the code for 267db96d56Sopenharmony_ci that appears always to require a temp array with room for at least N 277db96d56Sopenharmony_ci pointers. (Note that I wouldn't want to do that even if space weren't an 287db96d56Sopenharmony_ci issue; I believe its efforts at memory frugality also save timsort 297db96d56Sopenharmony_ci significant pointer-copying costs, and allow it to have a smaller working 307db96d56Sopenharmony_ci set.) 317db96d56Sopenharmony_ci 327db96d56Sopenharmony_ci+ Across about four hours of generating random arrays, and sorting them 337db96d56Sopenharmony_ci under both methods, samplesort required about 1.5% more comparisons 347db96d56Sopenharmony_ci (the program is at the end of this file). 357db96d56Sopenharmony_ci 367db96d56Sopenharmony_ci+ In real life, this may be faster or slower on random arrays than 377db96d56Sopenharmony_ci samplesort was, depending on platform quirks. Since it does fewer 387db96d56Sopenharmony_ci comparisons on average, it can be expected to do better the more 397db96d56Sopenharmony_ci expensive a comparison function is. OTOH, it does more data movement 407db96d56Sopenharmony_ci (pointer copying) than samplesort, and that may negate its small 417db96d56Sopenharmony_ci comparison advantage (depending on platform quirks) unless comparison 427db96d56Sopenharmony_ci is very expensive. 437db96d56Sopenharmony_ci 447db96d56Sopenharmony_ci+ On arrays with many kinds of pre-existing order, this blows samplesort out 457db96d56Sopenharmony_ci of the water. It's significantly faster than samplesort even on some 467db96d56Sopenharmony_ci cases samplesort was special-casing the snot out of. I believe that lists 477db96d56Sopenharmony_ci very often do have exploitable partial order in real life, and this is the 487db96d56Sopenharmony_ci strongest argument in favor of timsort (indeed, samplesort's special cases 497db96d56Sopenharmony_ci for extreme partial order are appreciated by real users, and timsort goes 507db96d56Sopenharmony_ci much deeper than those, in particular naturally covering every case where 517db96d56Sopenharmony_ci someone has suggested "and it would be cool if list.sort() had a special 527db96d56Sopenharmony_ci case for this too ... and for that ..."). 537db96d56Sopenharmony_ci 547db96d56Sopenharmony_ci+ Here are exact comparison counts across all the tests in sortperf.py, 557db96d56Sopenharmony_ci when run with arguments "15 20 1". 567db96d56Sopenharmony_ci 577db96d56Sopenharmony_ci Column Key: 587db96d56Sopenharmony_ci *sort: random data 597db96d56Sopenharmony_ci \sort: descending data 607db96d56Sopenharmony_ci /sort: ascending data 617db96d56Sopenharmony_ci 3sort: ascending, then 3 random exchanges 627db96d56Sopenharmony_ci +sort: ascending, then 10 random at the end 637db96d56Sopenharmony_ci %sort: ascending, then randomly replace 1% of elements w/ random values 647db96d56Sopenharmony_ci ~sort: many duplicates 657db96d56Sopenharmony_ci =sort: all equal 667db96d56Sopenharmony_ci !sort: worst case scenario 677db96d56Sopenharmony_ci 687db96d56Sopenharmony_ci First the trivial cases, trivial for samplesort because it special-cased 697db96d56Sopenharmony_ci them, and trivial for timsort because it naturally works on runs. Within 707db96d56Sopenharmony_ci an "n" block, the first line gives the # of compares done by samplesort, 717db96d56Sopenharmony_ci the second line by timsort, and the third line is the percentage by 727db96d56Sopenharmony_ci which the samplesort count exceeds the timsort count: 737db96d56Sopenharmony_ci 747db96d56Sopenharmony_ci n \sort /sort =sort 757db96d56Sopenharmony_ci------- ------ ------ ------ 767db96d56Sopenharmony_ci 32768 32768 32767 32767 samplesort 777db96d56Sopenharmony_ci 32767 32767 32767 timsort 787db96d56Sopenharmony_ci 0.00% 0.00% 0.00% (samplesort - timsort) / timsort 797db96d56Sopenharmony_ci 807db96d56Sopenharmony_ci 65536 65536 65535 65535 817db96d56Sopenharmony_ci 65535 65535 65535 827db96d56Sopenharmony_ci 0.00% 0.00% 0.00% 837db96d56Sopenharmony_ci 847db96d56Sopenharmony_ci 131072 131072 131071 131071 857db96d56Sopenharmony_ci 131071 131071 131071 867db96d56Sopenharmony_ci 0.00% 0.00% 0.00% 877db96d56Sopenharmony_ci 887db96d56Sopenharmony_ci 262144 262144 262143 262143 897db96d56Sopenharmony_ci 262143 262143 262143 907db96d56Sopenharmony_ci 0.00% 0.00% 0.00% 917db96d56Sopenharmony_ci 927db96d56Sopenharmony_ci 524288 524288 524287 524287 937db96d56Sopenharmony_ci 524287 524287 524287 947db96d56Sopenharmony_ci 0.00% 0.00% 0.00% 957db96d56Sopenharmony_ci 967db96d56Sopenharmony_ci1048576 1048576 1048575 1048575 977db96d56Sopenharmony_ci 1048575 1048575 1048575 987db96d56Sopenharmony_ci 0.00% 0.00% 0.00% 997db96d56Sopenharmony_ci 1007db96d56Sopenharmony_ci The algorithms are effectively identical in these cases, except that 1017db96d56Sopenharmony_ci timsort does one less compare in \sort. 1027db96d56Sopenharmony_ci 1037db96d56Sopenharmony_ci Now for the more interesting cases. Where lg(x) is the logarithm of x to 1047db96d56Sopenharmony_ci the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for 1057db96d56Sopenharmony_ci the best any comparison-based sorting algorithm can do on average (across 1067db96d56Sopenharmony_ci all permutations). When a method gets significantly below that, it's 1077db96d56Sopenharmony_ci either astronomically lucky, or is finding exploitable structure in the 1087db96d56Sopenharmony_ci data. 1097db96d56Sopenharmony_ci 1107db96d56Sopenharmony_ci 1117db96d56Sopenharmony_ci n lg(n!) *sort 3sort +sort %sort ~sort !sort 1127db96d56Sopenharmony_ci------- ------- ------ ------- ------- ------ ------- -------- 1137db96d56Sopenharmony_ci 32768 444255 453096 453614 32908 452871 130491 469141 old 1147db96d56Sopenharmony_ci 448885 33016 33007 50426 182083 65534 new 1157db96d56Sopenharmony_ci 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new 1167db96d56Sopenharmony_ci 1177db96d56Sopenharmony_ci 65536 954037 972699 981940 65686 973104 260029 1004607 1187db96d56Sopenharmony_ci 962991 65821 65808 101667 364341 131070 1197db96d56Sopenharmony_ci 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% 1207db96d56Sopenharmony_ci 1217db96d56Sopenharmony_ci 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 1227db96d56Sopenharmony_ci 2057533 131410 131361 206193 728871 262142 1237db96d56Sopenharmony_ci 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 1267db96d56Sopenharmony_ci 4377402 262437 262459 416347 1457945 524286 1277db96d56Sopenharmony_ci 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% 1287db96d56Sopenharmony_ci 1297db96d56Sopenharmony_ci 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 1307db96d56Sopenharmony_ci 9278734 524580 524633 837947 2916107 1048574 1317db96d56Sopenharmony_ci 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% 1327db96d56Sopenharmony_ci 1337db96d56Sopenharmony_ci1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 1347db96d56Sopenharmony_ci 19606028 1048958 1048941 1694896 5832445 2097150 1357db96d56Sopenharmony_ci 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% 1367db96d56Sopenharmony_ci 1377db96d56Sopenharmony_ci Discussion of cases: 1387db96d56Sopenharmony_ci 1397db96d56Sopenharmony_ci *sort: There's no structure in random data to exploit, so the theoretical 1407db96d56Sopenharmony_ci limit is lg(n!). Both methods get close to that, and timsort is hugging 1417db96d56Sopenharmony_ci it (indeed, in a *marginal* sense, it's a spectacular improvement -- 1427db96d56Sopenharmony_ci there's only about 1% left before hitting the wall, and timsort knows 1437db96d56Sopenharmony_ci darned well it's doing compares that won't pay on random data -- but so 1447db96d56Sopenharmony_ci does the samplesort hybrid). For contrast, Hoare's original random-pivot 1457db96d56Sopenharmony_ci quicksort does about 39% more compares than the limit, and the median-of-3 1467db96d56Sopenharmony_ci variant about 19% more. 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci 3sort, %sort, and !sort: No contest; there's structure in this data, but 1497db96d56Sopenharmony_ci not of the specific kinds samplesort special-cases. Note that structure 1507db96d56Sopenharmony_ci in !sort wasn't put there on purpose -- it was crafted as a worst case for 1517db96d56Sopenharmony_ci a previous quicksort implementation. That timsort nails it came as a 1527db96d56Sopenharmony_ci surprise to me (although it's obvious in retrospect). 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ci +sort: samplesort special-cases this data, and does a few less compares 1557db96d56Sopenharmony_ci than timsort. However, timsort runs this case significantly faster on all 1567db96d56Sopenharmony_ci boxes we have timings for, because timsort is in the business of merging 1577db96d56Sopenharmony_ci runs efficiently, while samplesort does much more data movement in this 1587db96d56Sopenharmony_ci (for it) special case. 1597db96d56Sopenharmony_ci 1607db96d56Sopenharmony_ci ~sort: samplesort's special cases for large masses of equal elements are 1617db96d56Sopenharmony_ci extremely effective on ~sort's specific data pattern, and timsort just 1627db96d56Sopenharmony_ci isn't going to get close to that, despite that it's clearly getting a 1637db96d56Sopenharmony_ci great deal of benefit out of the duplicates (the # of compares is much less 1647db96d56Sopenharmony_ci than lg(n!)). ~sort has a perfectly uniform distribution of just 4 1657db96d56Sopenharmony_ci distinct values, and as the distribution gets more skewed, samplesort's 1667db96d56Sopenharmony_ci equal-element gimmicks become less effective, while timsort's adaptive 1677db96d56Sopenharmony_ci strategies find more to exploit; in a database supplied by Kevin Altis, a 1687db96d56Sopenharmony_ci sort on its highly skewed "on which stock exchange does this company's 1697db96d56Sopenharmony_ci stock trade?" field ran over twice as fast under timsort. 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci However, despite that timsort does many more comparisons on ~sort, and 1727db96d56Sopenharmony_ci that on several platforms ~sort runs highly significantly slower under 1737db96d56Sopenharmony_ci timsort, on other platforms ~sort runs highly significantly faster under 1747db96d56Sopenharmony_ci timsort. No other kind of data has shown this wild x-platform behavior, 1757db96d56Sopenharmony_ci and we don't have an explanation for it. The only thing I can think of 1767db96d56Sopenharmony_ci that could transform what "should be" highly significant slowdowns into 1777db96d56Sopenharmony_ci highly significant speedups on some boxes are catastrophic cache effects 1787db96d56Sopenharmony_ci in samplesort. 1797db96d56Sopenharmony_ci 1807db96d56Sopenharmony_ci But timsort "should be" slower than samplesort on ~sort, so it's hard 1817db96d56Sopenharmony_ci to count that it isn't on some boxes as a strike against it <wink>. 1827db96d56Sopenharmony_ci 1837db96d56Sopenharmony_ci+ Here's the highwater mark for the number of heap-based temp slots (4 1847db96d56Sopenharmony_ci bytes each on this box) needed by each test, again with arguments 1857db96d56Sopenharmony_ci "15 20 1": 1867db96d56Sopenharmony_ci 1877db96d56Sopenharmony_ci 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 1887db96d56Sopenharmony_ci 32768 16384 0 0 6256 0 10821 12288 0 16383 1897db96d56Sopenharmony_ci 65536 32766 0 0 21652 0 31276 24576 0 32767 1907db96d56Sopenharmony_ci 131072 65534 0 0 17258 0 58112 49152 0 65535 1917db96d56Sopenharmony_ci 262144 131072 0 0 35660 0 123561 98304 0 131071 1927db96d56Sopenharmony_ci 524288 262142 0 0 31302 0 212057 196608 0 262143 1937db96d56Sopenharmony_ci1048576 524286 0 0 312438 0 484942 393216 0 524287 1947db96d56Sopenharmony_ci 1957db96d56Sopenharmony_ci Discussion: The tests that end up doing (close to) perfectly balanced 1967db96d56Sopenharmony_ci merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort 1977db96d56Sopenharmony_ci also ends up doing balanced merges, but systematically benefits a lot from 1987db96d56Sopenharmony_ci the preliminary pre-merge searches described under "Merge Memory" later. 1997db96d56Sopenharmony_ci %sort approaches having a balanced merge at the end because the random 2007db96d56Sopenharmony_ci selection of elements to replace is expected to produce an out-of-order 2017db96d56Sopenharmony_ci element near the midpoint. \sort, /sort, =sort are the trivial one-run 2027db96d56Sopenharmony_ci cases, needing no merging at all. +sort ends up having one very long run 2037db96d56Sopenharmony_ci and one very short, and so gets all the temp space it needs from the small 2047db96d56Sopenharmony_ci temparray member of the MergeState struct (note that the same would be 2057db96d56Sopenharmony_ci true if the new random elements were prefixed to the sorted list instead, 2067db96d56Sopenharmony_ci but not if they appeared "in the middle"). 3sort approaches N//3 temp 2077db96d56Sopenharmony_ci slots twice, but the run lengths that remain after 3 random exchanges 2087db96d56Sopenharmony_ci clearly has very high variance. 2097db96d56Sopenharmony_ci 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ciA detailed description of timsort follows. 2127db96d56Sopenharmony_ci 2137db96d56Sopenharmony_ciRuns 2147db96d56Sopenharmony_ci---- 2157db96d56Sopenharmony_cicount_run() returns the # of elements in the next run. A run is either 2167db96d56Sopenharmony_ci"ascending", which means non-decreasing: 2177db96d56Sopenharmony_ci 2187db96d56Sopenharmony_ci a0 <= a1 <= a2 <= ... 2197db96d56Sopenharmony_ci 2207db96d56Sopenharmony_cior "descending", which means strictly decreasing: 2217db96d56Sopenharmony_ci 2227db96d56Sopenharmony_ci a0 > a1 > a2 > ... 2237db96d56Sopenharmony_ci 2247db96d56Sopenharmony_ciNote that a run is always at least 2 long, unless we start at the array's 2257db96d56Sopenharmony_cilast element. 2267db96d56Sopenharmony_ci 2277db96d56Sopenharmony_ciThe definition of descending is strict, because the main routine reverses 2287db96d56Sopenharmony_cia descending run in-place, transforming a descending run into an ascending 2297db96d56Sopenharmony_cirun. Reversal is done via the obvious fast "swap elements starting at each 2307db96d56Sopenharmony_ciend, and converge at the middle" method, and that can violate stability if 2317db96d56Sopenharmony_cithe slice contains any equal elements. Using a strict definition of 2327db96d56Sopenharmony_cidescending ensures that a descending run contains distinct elements. 2337db96d56Sopenharmony_ci 2347db96d56Sopenharmony_ciIf an array is random, it's very unlikely we'll see long runs. If a natural 2357db96d56Sopenharmony_cirun contains less than minrun elements (see next section), the main loop 2367db96d56Sopenharmony_ciartificially boosts it to minrun elements, via a stable binary insertion sort 2377db96d56Sopenharmony_ciapplied to the right number of array elements following the short natural 2387db96d56Sopenharmony_cirun. In a random array, *all* runs are likely to be minrun long as a 2397db96d56Sopenharmony_ciresult. This has two primary good effects: 2407db96d56Sopenharmony_ci 2417db96d56Sopenharmony_ci1. Random data strongly tends then toward perfectly balanced (both runs have 2427db96d56Sopenharmony_ci the same length) merges, which is the most efficient way to proceed when 2437db96d56Sopenharmony_ci data is random. 2447db96d56Sopenharmony_ci 2457db96d56Sopenharmony_ci2. Because runs are never very short, the rest of the code doesn't make 2467db96d56Sopenharmony_ci heroic efforts to shave a few cycles off per-merge overheads. For 2477db96d56Sopenharmony_ci example, reasonable use of function calls is made, rather than trying to 2487db96d56Sopenharmony_ci inline everything. Since there are no more than N/minrun runs to begin 2497db96d56Sopenharmony_ci with, a few "extra" function calls per merge is barely measurable. 2507db96d56Sopenharmony_ci 2517db96d56Sopenharmony_ci 2527db96d56Sopenharmony_ciComputing minrun 2537db96d56Sopenharmony_ci---------------- 2547db96d56Sopenharmony_ciIf N < 64, minrun is N. IOW, binary insertion sort is used for the whole 2557db96d56Sopenharmony_ciarray then; it's hard to beat that given the overheads of trying something 2567db96d56Sopenharmony_cifancier (see note BINSORT). 2577db96d56Sopenharmony_ci 2587db96d56Sopenharmony_ciWhen N is a power of 2, testing on random data showed that minrun values of 2597db96d56Sopenharmony_ci16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost 2607db96d56Sopenharmony_ciin binary insertion sort clearly hurt, and at 8 the increase in the number 2617db96d56Sopenharmony_ciof function calls clearly hurt. Picking *some* power of 2 is important 2627db96d56Sopenharmony_cihere, so that the merges end up perfectly balanced (see next section). We 2637db96d56Sopenharmony_cipick 32 as a good value in the sweet range; picking a value at the low end 2647db96d56Sopenharmony_ciallows the adaptive gimmicks more opportunity to exploit shorter natural 2657db96d56Sopenharmony_ciruns. 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ciBecause sortperf.py only tries powers of 2, it took a long time to notice 2687db96d56Sopenharmony_cithat 32 isn't a good choice for the general case! Consider N=2112: 2697db96d56Sopenharmony_ci 2707db96d56Sopenharmony_ci>>> divmod(2112, 32) 2717db96d56Sopenharmony_ci(66, 0) 2727db96d56Sopenharmony_ci>>> 2737db96d56Sopenharmony_ci 2747db96d56Sopenharmony_ciIf the data is randomly ordered, we're very likely to end up with 66 runs 2757db96d56Sopenharmony_cieach of length 32. The first 64 of these trigger a sequence of perfectly 2767db96d56Sopenharmony_cibalanced merges (see next section), leaving runs of lengths 2048 and 64 to 2777db96d56Sopenharmony_cimerge at the end. The adaptive gimmicks can do that with fewer than 2048+64 2787db96d56Sopenharmony_cicompares, but it's still more compares than necessary, and-- mergesort's 2797db96d56Sopenharmony_cibugaboo relative to samplesort --a lot more data movement (O(N) copies just 2807db96d56Sopenharmony_cito get 64 elements into place). 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ciIf we take minrun=33 in this case, then we're very likely to end up with 64 2837db96d56Sopenharmony_ciruns each of length 33, and then all merges are perfectly balanced. Better! 2847db96d56Sopenharmony_ci 2857db96d56Sopenharmony_ciWhat we want to avoid is picking minrun such that in 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ci q, r = divmod(N, minrun) 2887db96d56Sopenharmony_ci 2897db96d56Sopenharmony_ciq is a power of 2 and r>0 (then the last merge only gets r elements into 2907db96d56Sopenharmony_ciplace, and r < minrun is small compared to N), or q a little larger than a 2917db96d56Sopenharmony_cipower of 2 regardless of r (then we've got a case similar to "2112", again 2927db96d56Sopenharmony_cileaving too little work for the last merge to do). 2937db96d56Sopenharmony_ci 2947db96d56Sopenharmony_ciInstead we pick a minrun in range(32, 65) such that N/minrun is exactly a 2957db96d56Sopenharmony_cipower of 2, or if that isn't possible, is close to, but strictly less than, 2967db96d56Sopenharmony_cia power of 2. This is easier to do than it may sound: take the first 6 2977db96d56Sopenharmony_cibits of N, and add 1 if any of the remaining bits are set. In fact, that 2987db96d56Sopenharmony_cirule covers every case in this section, including small N and exact powers 2997db96d56Sopenharmony_ciof 2; merge_compute_minrun() is a deceptively simple function. 3007db96d56Sopenharmony_ci 3017db96d56Sopenharmony_ci 3027db96d56Sopenharmony_ciThe Merge Pattern 3037db96d56Sopenharmony_ci----------------- 3047db96d56Sopenharmony_ciIn order to exploit regularities in the data, we're merging on natural 3057db96d56Sopenharmony_cirun lengths, and they can become wildly unbalanced. That's a Good Thing 3067db96d56Sopenharmony_cifor this sort! It means we have to find a way to manage an assortment of 3077db96d56Sopenharmony_cipotentially very different run lengths, though. 3087db96d56Sopenharmony_ci 3097db96d56Sopenharmony_ciStability constrains permissible merging patterns. For example, if we have 3107db96d56Sopenharmony_ci3 consecutive runs of lengths 3117db96d56Sopenharmony_ci 3127db96d56Sopenharmony_ci A:10000 B:20000 C:10000 3137db96d56Sopenharmony_ci 3147db96d56Sopenharmony_ciwe dare not merge A with C first, because if A, B and C happen to contain 3157db96d56Sopenharmony_cia common element, it would get out of order wrt its occurrence(s) in B. The 3167db96d56Sopenharmony_cimerging must be done as (A+B)+C or A+(B+C) instead. 3177db96d56Sopenharmony_ci 3187db96d56Sopenharmony_ciSo merging is always done on two consecutive runs at a time, and in-place, 3197db96d56Sopenharmony_cialthough this may require some temp memory (more on that later). 3207db96d56Sopenharmony_ci 3217db96d56Sopenharmony_ciWhen a run is identified, its length is passed to found_new_run() to 3227db96d56Sopenharmony_cipotentially merge runs on a stack of pending runs. We would like to delay 3237db96d56Sopenharmony_cimerging as long as possible in order to exploit patterns that may come up 3247db96d56Sopenharmony_cilater, but we like even more to do merging as soon as possible to exploit 3257db96d56Sopenharmony_cithat the run just found is still high in the memory hierarchy. We also can't 3267db96d56Sopenharmony_cidelay merging "too long" because it consumes memory to remember the runs that 3277db96d56Sopenharmony_ciare still unmerged, and the stack has a fixed size. 3287db96d56Sopenharmony_ci 3297db96d56Sopenharmony_ciThe original version of this code used the first thing I made up that didn't 3307db96d56Sopenharmony_ciobviously suck ;-) It was loosely based on invariants involving the Fibonacci 3317db96d56Sopenharmony_cisequence. 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ciIt worked OK, but it was hard to reason about, and was subtle enough that the 3347db96d56Sopenharmony_ciintended invariants weren't actually preserved. Researchers discovered that 3357db96d56Sopenharmony_ciwhen trying to complete a computer-generated correctness proof. That was 3367db96d56Sopenharmony_cieasily-enough repaired, but the discovery spurred quite a bit of academic 3377db96d56Sopenharmony_ciinterest in truly good ways to manage incremental merging on the fly. 3387db96d56Sopenharmony_ci 3397db96d56Sopenharmony_ciAt least a dozen different approaches were developed, some provably having 3407db96d56Sopenharmony_cinear-optimal worst case behavior with respect to the entropy of the 3417db96d56Sopenharmony_cidistribution of run lengths. Some details can be found in bpo-34561. 3427db96d56Sopenharmony_ci 3437db96d56Sopenharmony_ciThe code now uses the "powersort" merge strategy from: 3447db96d56Sopenharmony_ci 3457db96d56Sopenharmony_ci "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods 3467db96d56Sopenharmony_ci That Optimally Adapt to Existing Runs" 3477db96d56Sopenharmony_ci J. Ian Munro and Sebastian Wild 3487db96d56Sopenharmony_ci 3497db96d56Sopenharmony_ciThe code is pretty simple, but the justification is quite involved, as it's 3507db96d56Sopenharmony_cibased on fast approximations to optimal binary search trees, which are 3517db96d56Sopenharmony_cisubstantial topics on their own. 3527db96d56Sopenharmony_ci 3537db96d56Sopenharmony_ciHere we'll just cover some pragmatic details: 3547db96d56Sopenharmony_ci 3557db96d56Sopenharmony_ciThe `powerloop()` function computes a run's "power". Say two adjacent runs 3567db96d56Sopenharmony_cibegin at index s1. The first run has length n1, and the second run (starting 3577db96d56Sopenharmony_ciat index s1+n1, called "s2" below) has length n2. The list has total length n. 3587db96d56Sopenharmony_ciThe "power" of the first run is a small integer, the depth of the node 3597db96d56Sopenharmony_ciconnecting the two runs in an ideal binary merge tree, where power 1 is the 3607db96d56Sopenharmony_ciroot node, and the power increases by 1 for each level deeper in the tree. 3617db96d56Sopenharmony_ci 3627db96d56Sopenharmony_ciThe power is the least integer L such that the "midpoint interval" contains 3637db96d56Sopenharmony_cia rational number of the form J/2**L. The midpoint interval is the semi- 3647db96d56Sopenharmony_ciclosed interval: 3657db96d56Sopenharmony_ci 3667db96d56Sopenharmony_ci ((s1 + n1/2)/n, (s2 + n2/2)/n] 3677db96d56Sopenharmony_ci 3687db96d56Sopenharmony_ciYes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and 3697db96d56Sopenharmony_ci(s2 + n2/2)/n are computed to infinite precision in binary, the power L is 3707db96d56Sopenharmony_cithe first position at which the 2**-L bit differs between the expansions. 3717db96d56Sopenharmony_ciSince the left end of the interval is less than the right end, the first 3727db96d56Sopenharmony_cidiffering bit must be a 0 bit in the left quotient and a 1 bit in the right 3737db96d56Sopenharmony_ciquotient. 3747db96d56Sopenharmony_ci 3757db96d56Sopenharmony_ci`powerloop()` emulates these divisions, 1 bit at a time, using comparisons, 3767db96d56Sopenharmony_cisubtractions, and shifts in a loop. 3777db96d56Sopenharmony_ci 3787db96d56Sopenharmony_ciYou'll notice the paper uses an O(1) method instead, but that relies on two 3797db96d56Sopenharmony_cithings we don't have: 3807db96d56Sopenharmony_ci 3817db96d56Sopenharmony_ci- An O(1) "count leading zeroes" primitive. We can find such a thing as a C 3827db96d56Sopenharmony_ci extension on most platforms, but not all, and there's no uniform spelling 3837db96d56Sopenharmony_ci on the platforms that support it. 3847db96d56Sopenharmony_ci 3857db96d56Sopenharmony_ci- Integer division on an integer type twice as wide as needed to hold the 3867db96d56Sopenharmony_ci list length. But the latter is Py_ssize_t for us, and is typically the 3877db96d56Sopenharmony_ci widest native signed integer type the platform supports. 3887db96d56Sopenharmony_ci 3897db96d56Sopenharmony_ciBut since runs in our algorithm are almost never very short, the once-per-run 3907db96d56Sopenharmony_cioverhead of `powerloop()` seems lost in the noise. 3917db96d56Sopenharmony_ci 3927db96d56Sopenharmony_ciDetail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all, 3937db96d56Sopenharmony_cishift integers of that width left by 1. How do we know that won't spill into 3947db96d56Sopenharmony_cithe sign bit? The trick is that we have some slop. `n` (the total list 3957db96d56Sopenharmony_cilength) is the number of list elements, which is at most 4 times (on a 32-box, 3967db96d56Sopenharmony_ciwith 4-byte pointers) smaller than than the largest size_t. So at least the 3977db96d56Sopenharmony_cileading two bits of the integers we're using are clear. 3987db96d56Sopenharmony_ci 3997db96d56Sopenharmony_ciSince we can't compute a run's power before seeing the run that follows it, 4007db96d56Sopenharmony_cithe most-recently identified run is never merged by `found_new_run()`. 4017db96d56Sopenharmony_ciInstead a new run is only used to compute the 2nd-most-recent run's power. 4027db96d56Sopenharmony_ciThen adjacent runs are merged so long as their saved power (tree depth) is 4037db96d56Sopenharmony_cigreater than that newly computed power. When found_new_run() returns, only 4047db96d56Sopenharmony_cithen is a new run pushed on to the stack of pending runs. 4057db96d56Sopenharmony_ci 4067db96d56Sopenharmony_ciA key invariant is that powers on the run stack are strictly decreasing 4077db96d56Sopenharmony_ci(starting from the run at the top of the stack). 4087db96d56Sopenharmony_ci 4097db96d56Sopenharmony_ciNote that even powersort's strategy isn't always truly optimal. It can't be. 4107db96d56Sopenharmony_ciComputing an optimal merge sequence can be done in time quadratic in the 4117db96d56Sopenharmony_cinumber of runs, which is very much slower, and also requires finding & 4127db96d56Sopenharmony_ciremembering _all_ the runs' lengths (of which there may be billions) in 4137db96d56Sopenharmony_ciadvance. It's remarkable, though, how close to optimal this strategy gets. 4147db96d56Sopenharmony_ci 4157db96d56Sopenharmony_ciCurious factoid: of all the alternatives I've seen in the literature, 4167db96d56Sopenharmony_cipowersort's is the only one that's always truly optimal for a collection of 3 4177db96d56Sopenharmony_cirun lengths (for three lengths A B C, it's always optimal to first merge the 4187db96d56Sopenharmony_cishorter of A and C with B). 4197db96d56Sopenharmony_ci 4207db96d56Sopenharmony_ci 4217db96d56Sopenharmony_ciMerge Memory 4227db96d56Sopenharmony_ci------------ 4237db96d56Sopenharmony_ciMerging adjacent runs of lengths A and B in-place, and in linear time, is 4247db96d56Sopenharmony_cidifficult. Theoretical constructions are known that can do it, but they're 4257db96d56Sopenharmony_citoo difficult and slow for practical use. But if we have temp memory equal 4267db96d56Sopenharmony_cito min(A, B), it's easy. 4277db96d56Sopenharmony_ci 4287db96d56Sopenharmony_ciIf A is smaller (function merge_lo), copy A to a temp array, leave B alone, 4297db96d56Sopenharmony_ciand then we can do the obvious merge algorithm left to right, from the temp 4307db96d56Sopenharmony_ciarea and B, starting the stores into where A used to live. There's always a 4317db96d56Sopenharmony_cifree area in the original area comprising a number of elements equal to the 4327db96d56Sopenharmony_cinumber not yet merged from the temp array (trivially true at the start; 4337db96d56Sopenharmony_ciproceed by induction). The only tricky bit is that if a comparison raises an 4347db96d56Sopenharmony_ciexception, we have to remember to copy the remaining elements back in from 4357db96d56Sopenharmony_cithe temp area, lest the array end up with duplicate entries from B. But 4367db96d56Sopenharmony_cithat's exactly the same thing we need to do if we reach the end of B first, 4377db96d56Sopenharmony_ciso the exit code is pleasantly common to both the normal and error cases. 4387db96d56Sopenharmony_ci 4397db96d56Sopenharmony_ciIf B is smaller (function merge_hi, which is merge_lo's "mirror image"), 4407db96d56Sopenharmony_cimuch the same, except that we need to merge right to left, copying B into a 4417db96d56Sopenharmony_citemp array and starting the stores at the right end of where B used to live. 4427db96d56Sopenharmony_ci 4437db96d56Sopenharmony_ciA refinement: When we're about to merge adjacent runs A and B, we first do 4447db96d56Sopenharmony_cia form of binary search (more on that later) to see where B[0] should end up 4457db96d56Sopenharmony_ciin A. Elements in A preceding that point are already in their final 4467db96d56Sopenharmony_cipositions, effectively shrinking the size of A. Likewise we also search to 4477db96d56Sopenharmony_cisee where A[-1] should end up in B, and elements of B after that point can 4487db96d56Sopenharmony_cialso be ignored. This cuts the amount of temp memory needed by the same 4497db96d56Sopenharmony_ciamount. 4507db96d56Sopenharmony_ci 4517db96d56Sopenharmony_ciThese preliminary searches may not pay off, and can be expected *not* to 4527db96d56Sopenharmony_cirepay their cost if the data is random. But they can win huge in all of 4537db96d56Sopenharmony_citime, copying, and memory savings when they do pay, so this is one of the 4547db96d56Sopenharmony_ci"per-merge overheads" mentioned above that we're happy to endure because 4557db96d56Sopenharmony_cithere is at most one very short run. It's generally true in this algorithm 4567db96d56Sopenharmony_cithat we're willing to gamble a little to win a lot, even though the net 4577db96d56Sopenharmony_ciexpectation is negative for random data. 4587db96d56Sopenharmony_ci 4597db96d56Sopenharmony_ci 4607db96d56Sopenharmony_ciMerge Algorithms 4617db96d56Sopenharmony_ci---------------- 4627db96d56Sopenharmony_cimerge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo 4637db96d56Sopenharmony_cideals with runs where A <= B, and merge_hi where A > B. They don't know 4647db96d56Sopenharmony_ciwhether the data is clustered or uniform, but a lovely thing about merging 4657db96d56Sopenharmony_ciis that many kinds of clustering "reveal themselves" by how many times in a 4667db96d56Sopenharmony_cirow the winning merge element comes from the same run. We'll only discuss 4677db96d56Sopenharmony_cimerge_lo here; merge_hi is exactly analogous. 4687db96d56Sopenharmony_ci 4697db96d56Sopenharmony_ciMerging begins in the usual, obvious way, comparing the first element of A 4707db96d56Sopenharmony_cito the first of B, and moving B[0] to the merge area if it's less than A[0], 4717db96d56Sopenharmony_cielse moving A[0] to the merge area. Call that the "one pair at a time" 4727db96d56Sopenharmony_cimode. The only twist here is keeping track of how many times in a row "the 4737db96d56Sopenharmony_ciwinner" comes from the same run. 4747db96d56Sopenharmony_ci 4757db96d56Sopenharmony_ciIf that count reaches MIN_GALLOP, we switch to "galloping mode". Here 4767db96d56Sopenharmony_ciwe *search* B for where A[0] belongs, and move over all the B's before 4777db96d56Sopenharmony_cithat point in one chunk to the merge area, then move A[0] to the merge 4787db96d56Sopenharmony_ciarea. Then we search A for where B[0] belongs, and similarly move a 4797db96d56Sopenharmony_cislice of A in one chunk. Then back to searching B for where A[0] belongs, 4807db96d56Sopenharmony_cietc. We stay in galloping mode until both searches find slices to copy 4817db96d56Sopenharmony_ciless than MIN_GALLOP elements long, at which point we go back to one-pair- 4827db96d56Sopenharmony_ciat-a-time mode. 4837db96d56Sopenharmony_ci 4847db96d56Sopenharmony_ciA refinement: The MergeState struct contains the value of min_gallop that 4857db96d56Sopenharmony_cicontrols when we enter galloping mode, initialized to MIN_GALLOP. 4867db96d56Sopenharmony_cimerge_lo() and merge_hi() adjust this higher when galloping isn't paying 4877db96d56Sopenharmony_cioff, and lower when it is. 4887db96d56Sopenharmony_ci 4897db96d56Sopenharmony_ci 4907db96d56Sopenharmony_ciGalloping 4917db96d56Sopenharmony_ci--------- 4927db96d56Sopenharmony_ciStill without loss of generality, assume A is the shorter run. In galloping 4937db96d56Sopenharmony_cimode, we first look for A[0] in B. We do this via "galloping", comparing 4947db96d56Sopenharmony_ciA[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding 4957db96d56Sopenharmony_cithe k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most 4967db96d56Sopenharmony_ciroughly lg(B) comparisons, and, unlike a straight binary search, favors 4977db96d56Sopenharmony_cifinding the right spot early in B (more on that later). 4987db96d56Sopenharmony_ci 4997db96d56Sopenharmony_ciAfter finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 5007db96d56Sopenharmony_ciconsecutive elements, and a straight binary search requires exactly k-1 5017db96d56Sopenharmony_ciadditional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we 5027db96d56Sopenharmony_cicopy all the B's up to that point in one chunk, and then copy A[0]. Note 5037db96d56Sopenharmony_cithat no matter where A[0] belongs in B, the combination of galloping + binary 5047db96d56Sopenharmony_cisearch finds it in no more than about 2*lg(B) comparisons. 5057db96d56Sopenharmony_ci 5067db96d56Sopenharmony_ciIf we did a straight binary search, we could find it in no more than 5077db96d56Sopenharmony_ciceiling(lg(B+1)) comparisons -- but straight binary search takes that many 5087db96d56Sopenharmony_cicomparisons no matter where A[0] belongs. Straight binary search thus loses 5097db96d56Sopenharmony_cito galloping unless the run is quite long, and we simply can't guess 5107db96d56Sopenharmony_ciwhether it is in advance. 5117db96d56Sopenharmony_ci 5127db96d56Sopenharmony_ciIf data is random and runs have the same length, A[0] belongs at B[0] half 5137db96d56Sopenharmony_cithe time, at B[1] a quarter of the time, and so on: a consecutive winning 5147db96d56Sopenharmony_cisub-run in B of length k occurs with probability 1/2**(k+1). So long 5157db96d56Sopenharmony_ciwinning sub-runs are extremely unlikely in random data, and guessing that a 5167db96d56Sopenharmony_ciwinning sub-run is going to be long is a dangerous game. 5177db96d56Sopenharmony_ci 5187db96d56Sopenharmony_ciOTOH, if data is lopsided or lumpy or contains many duplicates, long 5197db96d56Sopenharmony_cistretches of winning sub-runs are very likely, and cutting the number of 5207db96d56Sopenharmony_cicomparisons needed to find one from O(B) to O(log B) is a huge win. 5217db96d56Sopenharmony_ci 5227db96d56Sopenharmony_ciGalloping compromises by getting out fast if there isn't a long winning 5237db96d56Sopenharmony_cisub-run, yet finding such very efficiently when they exist. 5247db96d56Sopenharmony_ci 5257db96d56Sopenharmony_ciI first learned about the galloping strategy in a related context; see: 5267db96d56Sopenharmony_ci 5277db96d56Sopenharmony_ci "Adaptive Set Intersections, Unions, and Differences" (2000) 5287db96d56Sopenharmony_ci Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro 5297db96d56Sopenharmony_ci 5307db96d56Sopenharmony_ciand its followup(s). An earlier paper called the same strategy 5317db96d56Sopenharmony_ci"exponential search": 5327db96d56Sopenharmony_ci 5337db96d56Sopenharmony_ci "Optimistic Sorting and Information Theoretic Complexity" 5347db96d56Sopenharmony_ci Peter McIlroy 5357db96d56Sopenharmony_ci SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 5367db96d56Sopenharmony_ci 467-474, Austin, Texas, 25-27 January 1993. 5377db96d56Sopenharmony_ci 5387db96d56Sopenharmony_ciand it probably dates back to an earlier paper by Bentley and Yao. The 5397db96d56Sopenharmony_ciMcIlroy paper in particular has good analysis of a mergesort that's 5407db96d56Sopenharmony_ciprobably strongly related to this one in its galloping strategy. 5417db96d56Sopenharmony_ci 5427db96d56Sopenharmony_ci 5437db96d56Sopenharmony_ciGalloping with a Broken Leg 5447db96d56Sopenharmony_ci--------------------------- 5457db96d56Sopenharmony_ciSo why don't we always gallop? Because it can lose, on two counts: 5467db96d56Sopenharmony_ci 5477db96d56Sopenharmony_ci1. While we're willing to endure small per-merge overheads, per-comparison 5487db96d56Sopenharmony_ci overheads are a different story. Calling Yet Another Function per 5497db96d56Sopenharmony_ci comparison is expensive, and gallop_left() and gallop_right() are 5507db96d56Sopenharmony_ci too long-winded for sane inlining. 5517db96d56Sopenharmony_ci 5527db96d56Sopenharmony_ci2. Galloping can-- alas --require more comparisons than linear one-at-time 5537db96d56Sopenharmony_ci search, depending on the data. 5547db96d56Sopenharmony_ci 5557db96d56Sopenharmony_ci#2 requires details. If A[0] belongs before B[0], galloping requires 1 5567db96d56Sopenharmony_cicompare to determine that, same as linear search, except it costs more 5577db96d56Sopenharmony_cito call the gallop function. If A[0] belongs right before B[1], galloping 5587db96d56Sopenharmony_cirequires 2 compares, again same as linear search. On the third compare, 5597db96d56Sopenharmony_cigalloping checks A[0] against B[3], and if it's <=, requires one more 5607db96d56Sopenharmony_cicompare to determine whether A[0] belongs at B[2] or B[3]. That's a total 5617db96d56Sopenharmony_ciof 4 compares, but if A[0] does belong at B[2], linear search would have 5627db96d56Sopenharmony_cidiscovered that in only 3 compares, and that's a huge loss! Really. It's 5637db96d56Sopenharmony_cian increase of 33% in the number of compares needed, and comparisons are 5647db96d56Sopenharmony_ciexpensive in Python. 5657db96d56Sopenharmony_ci 5667db96d56Sopenharmony_ciindex in B where # compares linear # gallop # binary gallop 5677db96d56Sopenharmony_ciA[0] belongs search needs compares compares total 5687db96d56Sopenharmony_ci---------------- ----------------- -------- -------- ------ 5697db96d56Sopenharmony_ci 0 1 1 0 1 5707db96d56Sopenharmony_ci 5717db96d56Sopenharmony_ci 1 2 2 0 2 5727db96d56Sopenharmony_ci 5737db96d56Sopenharmony_ci 2 3 3 1 4 5747db96d56Sopenharmony_ci 3 4 3 1 4 5757db96d56Sopenharmony_ci 5767db96d56Sopenharmony_ci 4 5 4 2 6 5777db96d56Sopenharmony_ci 5 6 4 2 6 5787db96d56Sopenharmony_ci 6 7 4 2 6 5797db96d56Sopenharmony_ci 7 8 4 2 6 5807db96d56Sopenharmony_ci 5817db96d56Sopenharmony_ci 8 9 5 3 8 5827db96d56Sopenharmony_ci 9 10 5 3 8 5837db96d56Sopenharmony_ci 10 11 5 3 8 5847db96d56Sopenharmony_ci 11 12 5 3 8 5857db96d56Sopenharmony_ci ... 5867db96d56Sopenharmony_ci 5877db96d56Sopenharmony_ciIn general, if A[0] belongs at B[i], linear search requires i+1 comparisons 5887db96d56Sopenharmony_cito determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. 5897db96d56Sopenharmony_ciThe advantage of galloping is unbounded as i grows, but it doesn't win at 5907db96d56Sopenharmony_ciall until i=6. Before then, it loses twice (at i=2 and i=4), and ties 5917db96d56Sopenharmony_ciat the other values. At and after i=6, galloping always wins. 5927db96d56Sopenharmony_ci 5937db96d56Sopenharmony_ciWe can't guess in advance when it's going to win, though, so we do one pair 5947db96d56Sopenharmony_ciat a time until the evidence seems strong that galloping may pay. MIN_GALLOP 5957db96d56Sopenharmony_ciis 7, and that's pretty strong evidence. However, if the data is random, it 5967db96d56Sopenharmony_cisimply will trigger galloping mode purely by luck every now and again, and 5977db96d56Sopenharmony_ciit's quite likely to hit one of the losing cases next. On the other hand, 5987db96d56Sopenharmony_ciin cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it 5997db96d56Sopenharmony_ci"should be" then. So the MergeState struct keeps a min_gallop variable 6007db96d56Sopenharmony_cithat merge_lo and merge_hi adjust: the longer we stay in galloping mode, 6017db96d56Sopenharmony_cithe smaller min_gallop gets, making it easier to transition back to 6027db96d56Sopenharmony_cigalloping mode (if we ever leave it in the current merge, and at the 6037db96d56Sopenharmony_cistart of the next merge). But whenever the gallop loop doesn't pay, 6047db96d56Sopenharmony_cimin_gallop is increased by one, making it harder to transition back 6057db96d56Sopenharmony_cito galloping mode (and again both within a merge and across merges). For 6067db96d56Sopenharmony_cirandom data, this all but eliminates the gallop penalty: min_gallop grows 6077db96d56Sopenharmony_cilarge enough that we almost never get into galloping mode. And for cases 6087db96d56Sopenharmony_cilike ~sort, min_gallop can fall to as low as 1. This seems to work well, 6097db96d56Sopenharmony_cibut in all it's a minor improvement over using a fixed MIN_GALLOP value. 6107db96d56Sopenharmony_ci 6117db96d56Sopenharmony_ci 6127db96d56Sopenharmony_ciGalloping Complication 6137db96d56Sopenharmony_ci---------------------- 6147db96d56Sopenharmony_ciThe description above was for merge_lo. merge_hi has to merge "from the 6157db96d56Sopenharmony_ciother end", and really needs to gallop starting at the last element in a run 6167db96d56Sopenharmony_ciinstead of the first. Galloping from the first still works, but does more 6177db96d56Sopenharmony_cicomparisons than it should (this is significant -- I timed it both ways). For 6187db96d56Sopenharmony_cithis reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) 6197db96d56Sopenharmony_cifunctions have a "hint" argument, which is the index at which galloping 6207db96d56Sopenharmony_cishould begin. So galloping can actually start at any index, and proceed at 6217db96d56Sopenharmony_cioffsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index. 6227db96d56Sopenharmony_ci 6237db96d56Sopenharmony_ciIn the code as I type it's always called with either 0 or n-1 (where n is 6247db96d56Sopenharmony_cithe # of elements in a run). It's tempting to try to do something fancier, 6257db96d56Sopenharmony_cimelding galloping with some form of interpolation search; for example, if 6267db96d56Sopenharmony_ciwe're merging a run of length 1 with a run of length 10000, index 5000 is 6277db96d56Sopenharmony_ciprobably a better guess at the final result than either 0 or 9999. But 6287db96d56Sopenharmony_ciit's unclear how to generalize that intuition usefully, and merging of 6297db96d56Sopenharmony_ciwildly unbalanced runs already enjoys excellent performance. 6307db96d56Sopenharmony_ci 6317db96d56Sopenharmony_ci~sort is a good example of when balanced runs could benefit from a better 6327db96d56Sopenharmony_cihint value: to the extent possible, this would like to use a starting 6337db96d56Sopenharmony_cioffset equal to the previous value of acount/bcount. Doing so saves about 6347db96d56Sopenharmony_ci10% of the compares in ~sort. However, doing so is also a mixed bag, 6357db96d56Sopenharmony_cihurting other cases. 6367db96d56Sopenharmony_ci 6377db96d56Sopenharmony_ci 6387db96d56Sopenharmony_ciComparing Average # of Compares on Random Arrays 6397db96d56Sopenharmony_ci------------------------------------------------ 6407db96d56Sopenharmony_ci[NOTE: This was done when the new algorithm used about 0.1% more compares 6417db96d56Sopenharmony_ci on random data than does its current incarnation.] 6427db96d56Sopenharmony_ci 6437db96d56Sopenharmony_ciHere list.sort() is samplesort, and list.msort() this sort: 6447db96d56Sopenharmony_ci 6457db96d56Sopenharmony_ci""" 6467db96d56Sopenharmony_ciimport random 6477db96d56Sopenharmony_cifrom time import clock as now 6487db96d56Sopenharmony_ci 6497db96d56Sopenharmony_cidef fill(n): 6507db96d56Sopenharmony_ci from random import random 6517db96d56Sopenharmony_ci return [random() for i in range(n)] 6527db96d56Sopenharmony_ci 6537db96d56Sopenharmony_cidef mycmp(x, y): 6547db96d56Sopenharmony_ci global ncmp 6557db96d56Sopenharmony_ci ncmp += 1 6567db96d56Sopenharmony_ci return cmp(x, y) 6577db96d56Sopenharmony_ci 6587db96d56Sopenharmony_cidef timeit(values, method): 6597db96d56Sopenharmony_ci global ncmp 6607db96d56Sopenharmony_ci X = values[:] 6617db96d56Sopenharmony_ci bound = getattr(X, method) 6627db96d56Sopenharmony_ci ncmp = 0 6637db96d56Sopenharmony_ci t1 = now() 6647db96d56Sopenharmony_ci bound(mycmp) 6657db96d56Sopenharmony_ci t2 = now() 6667db96d56Sopenharmony_ci return t2-t1, ncmp 6677db96d56Sopenharmony_ci 6687db96d56Sopenharmony_ciformat = "%5s %9.2f %11d" 6697db96d56Sopenharmony_cif2 = "%5s %9.2f %11.2f" 6707db96d56Sopenharmony_ci 6717db96d56Sopenharmony_cidef drive(): 6727db96d56Sopenharmony_ci count = sst = sscmp = mst = mscmp = nelts = 0 6737db96d56Sopenharmony_ci while True: 6747db96d56Sopenharmony_ci n = random.randrange(100000) 6757db96d56Sopenharmony_ci nelts += n 6767db96d56Sopenharmony_ci x = fill(n) 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci t, c = timeit(x, 'sort') 6797db96d56Sopenharmony_ci sst += t 6807db96d56Sopenharmony_ci sscmp += c 6817db96d56Sopenharmony_ci 6827db96d56Sopenharmony_ci t, c = timeit(x, 'msort') 6837db96d56Sopenharmony_ci mst += t 6847db96d56Sopenharmony_ci mscmp += c 6857db96d56Sopenharmony_ci 6867db96d56Sopenharmony_ci count += 1 6877db96d56Sopenharmony_ci if count % 10: 6887db96d56Sopenharmony_ci continue 6897db96d56Sopenharmony_ci 6907db96d56Sopenharmony_ci print "count", count, "nelts", nelts 6917db96d56Sopenharmony_ci print format % ("sort", sst, sscmp) 6927db96d56Sopenharmony_ci print format % ("msort", mst, mscmp) 6937db96d56Sopenharmony_ci print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) 6947db96d56Sopenharmony_ci 6957db96d56Sopenharmony_cidrive() 6967db96d56Sopenharmony_ci""" 6977db96d56Sopenharmony_ci 6987db96d56Sopenharmony_ciI ran this on Windows and kept using the computer lightly while it was 6997db96d56Sopenharmony_cirunning. time.clock() is wall-clock time on Windows, with better than 7007db96d56Sopenharmony_cimicrosecond resolution. samplesort started with a 1.52% #-of-comparisons 7017db96d56Sopenharmony_cidisadvantage, fell quickly to 1.48%, and then fluctuated within that small 7027db96d56Sopenharmony_cirange. Here's the last chunk of output before I killed the job: 7037db96d56Sopenharmony_ci 7047db96d56Sopenharmony_cicount 2630 nelts 130906543 7057db96d56Sopenharmony_ci sort 6110.80 1937887573 7067db96d56Sopenharmony_cimsort 6002.78 1909389381 7077db96d56Sopenharmony_ci 1.80 1.49 7087db96d56Sopenharmony_ci 7097db96d56Sopenharmony_ciWe've done nearly 2 billion comparisons apiece at Python speed there, and 7107db96d56Sopenharmony_cithat's enough <wink>. 7117db96d56Sopenharmony_ci 7127db96d56Sopenharmony_ciFor random arrays of size 2 (yes, there are only 2 interesting ones), 7137db96d56Sopenharmony_cisamplesort has a 50%(!) comparison disadvantage. This is a consequence of 7147db96d56Sopenharmony_cisamplesort special-casing at most one ascending run at the start, then 7157db96d56Sopenharmony_cifalling back to the general case if it doesn't find an ascending run 7167db96d56Sopenharmony_ciimmediately. The consequence is that it ends up using two compares to sort 7177db96d56Sopenharmony_ci[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be 7187db96d56Sopenharmony_citaught how to deal with mixtures of ascending and descending runs 7197db96d56Sopenharmony_ciefficiently in all cases. 7207db96d56Sopenharmony_ci 7217db96d56Sopenharmony_ci 7227db96d56Sopenharmony_ciNOTES 7237db96d56Sopenharmony_ci----- 7247db96d56Sopenharmony_ci 7257db96d56Sopenharmony_ciBINSORT 7267db96d56Sopenharmony_ciA "binary insertion sort" is just like a textbook insertion sort, but instead 7277db96d56Sopenharmony_ciof locating the correct position of the next item via linear (one at a time) 7287db96d56Sopenharmony_cisearch, an equivalent to Python's bisect.bisect_right is used to find the 7297db96d56Sopenharmony_cicorrect position in logarithmic time. Most texts don't mention this 7307db96d56Sopenharmony_civariation, and those that do usually say it's not worth the bother: insertion 7317db96d56Sopenharmony_cisort remains quadratic (expected and worst cases) either way. Speeding the 7327db96d56Sopenharmony_cisearch doesn't reduce the quadratic data movement costs. 7337db96d56Sopenharmony_ci 7347db96d56Sopenharmony_ciBut in CPython's case, comparisons are extraordinarily expensive compared to 7357db96d56Sopenharmony_cimoving data, and the details matter. Moving objects is just copying 7367db96d56Sopenharmony_cipointers. Comparisons can be arbitrarily expensive (can invoke arbitrary 7377db96d56Sopenharmony_ciuser-supplied Python code), but even in simple cases (like 3 < 4) _all_ 7387db96d56Sopenharmony_cidecisions are made at runtime: what's the type of the left comparand? the 7397db96d56Sopenharmony_citype of the right? do they need to be coerced to a common type? where's the 7407db96d56Sopenharmony_cicode to compare these types? And so on. Even the simplest Python comparison 7417db96d56Sopenharmony_citriggers a large pile of C-level pointer dereferences, conditionals, and 7427db96d56Sopenharmony_cifunction calls. 7437db96d56Sopenharmony_ci 7447db96d56Sopenharmony_ciSo cutting the number of compares is almost always measurably helpful in 7457db96d56Sopenharmony_ciCPython, and the savings swamp the quadratic-time data movement costs for 7467db96d56Sopenharmony_cireasonable minrun values. 7477db96d56Sopenharmony_ci 7487db96d56Sopenharmony_ci 7497db96d56Sopenharmony_ciLEFT OR RIGHT 7507db96d56Sopenharmony_cigallop_left() and gallop_right() are akin to the Python bisect module's 7517db96d56Sopenharmony_cibisect_left() and bisect_right(): they're the same unless the slice they're 7527db96d56Sopenharmony_cisearching contains a (at least one) value equal to the value being searched 7537db96d56Sopenharmony_cifor. In that case, gallop_left() returns the position immediately before the 7547db96d56Sopenharmony_cileftmost equal value, and gallop_right() the position immediately after the 7557db96d56Sopenharmony_cirightmost equal value. The distinction is needed to preserve stability. In 7567db96d56Sopenharmony_cigeneral, when merging adjacent runs A and B, gallop_left is used to search 7577db96d56Sopenharmony_cithru B for where an element from A belongs, and gallop_right to search thru A 7587db96d56Sopenharmony_cifor where an element from B belongs. 7597db96d56Sopenharmony_ci 7607db96d56Sopenharmony_ci 7617db96d56Sopenharmony_ciREGION OF UNCERTAINTY 7627db96d56Sopenharmony_ciTwo kinds of confusion seem to be common about the claim that after finding 7637db96d56Sopenharmony_cia k such that 7647db96d56Sopenharmony_ci 7657db96d56Sopenharmony_ci B[2**(k-1) - 1] < A[0] <= B[2**k - 1] 7667db96d56Sopenharmony_ci 7677db96d56Sopenharmony_cithen a binary search requires exactly k-1 tries to find A[0]'s proper 7687db96d56Sopenharmony_cilocation. For concreteness, say k=3, so B[3] < A[0] <= B[7]. 7697db96d56Sopenharmony_ci 7707db96d56Sopenharmony_ciThe first confusion takes the form "OK, then the region of uncertainty is at 7717db96d56Sopenharmony_ciindices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 = 7727db96d56Sopenharmony_ci3"; or the region is viewed as a Python slice and the objection is "but that's 7737db96d56Sopenharmony_cithe slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already 7747db96d56Sopenharmony_cicompared A[0] against B[3] and against B[7], so A[0]'s correct location is 7757db96d56Sopenharmony_cialready known wrt _both_ endpoints. What remains is to find A[0]'s correct 7767db96d56Sopenharmony_cilocation wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the 7777db96d56Sopenharmony_cislice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1 7787db96d56Sopenharmony_ciinclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has 7797db96d56Sopenharmony_ci (2**k-1)-1 - 2**(k-1) + 1 = 7807db96d56Sopenharmony_ci 2**k-1 - 2**(k-1) = 7817db96d56Sopenharmony_ci 2*2**(k-1)-1 - 2**(k-1) = 7827db96d56Sopenharmony_ci (2-1)*2**(k-1) - 1 = 7837db96d56Sopenharmony_ci 2**(k-1) - 1 7847db96d56Sopenharmony_cielements. 7857db96d56Sopenharmony_ci 7867db96d56Sopenharmony_ciThe second confusion: "k-1 = 2 binary searches can find the correct location 7877db96d56Sopenharmony_ciamong 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we 7887db96d56Sopenharmony_cicould make this more efficient by arranging for the region of uncertainty to 7897db96d56Sopenharmony_cispan 2**(k-1) elements." Resolution: that confuses "elements" with 7907db96d56Sopenharmony_ci"locations". In a slice with N elements, there are N+1 _locations_. In the 7917db96d56Sopenharmony_ciexample, with the region of uncertainty B[4], B[5], B[6], there are 4 7927db96d56Sopenharmony_cilocations: before B[4], between B[4] and B[5], between B[5] and B[6], and 7937db96d56Sopenharmony_ciafter B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) 7947db96d56Sopenharmony_cilocations. That's why k-1 binary searches are necessary and sufficient. 7957db96d56Sopenharmony_ci 7967db96d56Sopenharmony_ciOPTIMIZATION OF INDIVIDUAL COMPARISONS 7977db96d56Sopenharmony_ciAs noted above, even the simplest Python comparison triggers a large pile of 7987db96d56Sopenharmony_ciC-level pointer dereferences, conditionals, and function calls. This can be 7997db96d56Sopenharmony_cipartially mitigated by pre-scanning the data to determine whether the data is 8007db96d56Sopenharmony_cihomogeneous with respect to type. If so, it is sometimes possible to 8017db96d56Sopenharmony_cisubstitute faster type-specific comparisons for the slower, generic 8027db96d56Sopenharmony_ciPyObject_RichCompareBool. 803