17db96d56Sopenharmony_ciThis document explains Crochemore and Perrin's Two-Way string matching 27db96d56Sopenharmony_cialgorithm, in which a smaller string (the "pattern" or "needle") 37db96d56Sopenharmony_ciis searched for in a longer string (the "text" or "haystack"), 47db96d56Sopenharmony_cidetermining whether the needle is a substring of the haystack, and if 57db96d56Sopenharmony_ciso, at what index(es). It is to be used by Python's string 67db96d56Sopenharmony_ci(and bytes-like) objects when calling `find`, `index`, `__contains__`, 77db96d56Sopenharmony_cior implicitly in methods like `replace` or `partition`. 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ciThis is essentially a re-telling of the paper 107db96d56Sopenharmony_ci 117db96d56Sopenharmony_ci Crochemore M., Perrin D., 1991, Two-way string-matching, 127db96d56Sopenharmony_ci Journal of the ACM 38(3):651-675. 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_cifocused more on understanding and examples than on rigor. See also 157db96d56Sopenharmony_cithe code sample here: 167db96d56Sopenharmony_ci 177db96d56Sopenharmony_ci http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 187db96d56Sopenharmony_ci 197db96d56Sopenharmony_ciThe algorithm runs in O(len(needle) + len(haystack)) time and with 207db96d56Sopenharmony_ciO(1) space. However, since there is a larger preprocessing cost than 217db96d56Sopenharmony_cisimpler algorithms, this Two-Way algorithm is to be used only when the 227db96d56Sopenharmony_cineedle and haystack lengths meet certain thresholds. 237db96d56Sopenharmony_ci 247db96d56Sopenharmony_ci 257db96d56Sopenharmony_ciThese are the basic steps of the algorithm: 267db96d56Sopenharmony_ci 277db96d56Sopenharmony_ci * "Very carefully" cut the needle in two. 287db96d56Sopenharmony_ci * For each alignment attempted: 297db96d56Sopenharmony_ci 1. match the right part 307db96d56Sopenharmony_ci * On failure, jump by the amount matched + 1 317db96d56Sopenharmony_ci 2. then match the left part. 327db96d56Sopenharmony_ci * On failure jump by max(len(left), len(right)) + 1 337db96d56Sopenharmony_ci * If the needle is periodic, don't re-do comparisons; maintain 347db96d56Sopenharmony_ci a "memory" of how many characters you already know match. 357db96d56Sopenharmony_ci 367db96d56Sopenharmony_ci 377db96d56Sopenharmony_ci-------- Matching the right part -------- 387db96d56Sopenharmony_ci 397db96d56Sopenharmony_ciWe first scan the right part of the needle to check if it matches the 407db96d56Sopenharmony_cithe aligned characters in the haystack. We scan left-to-right, 417db96d56Sopenharmony_ciand if a mismatch occurs, we jump ahead by the amount matched plus 1. 427db96d56Sopenharmony_ci 437db96d56Sopenharmony_ciExample: 447db96d56Sopenharmony_ci 457db96d56Sopenharmony_ci text: ........EFGX................... 467db96d56Sopenharmony_ci pattern: ....abcdEFGH.... 477db96d56Sopenharmony_ci cut: <<<<>>>> 487db96d56Sopenharmony_ci 497db96d56Sopenharmony_ciMatched 3, so jump ahead by 4: 507db96d56Sopenharmony_ci 517db96d56Sopenharmony_ci text: ........EFGX................... 527db96d56Sopenharmony_ci pattern: ....abcdEFGH.... 537db96d56Sopenharmony_ci cut: <<<<>>>> 547db96d56Sopenharmony_ci 557db96d56Sopenharmony_ciWhy are we allowed to do this? Because we cut the needle very 567db96d56Sopenharmony_cicarefully, in such a way that if the cut is ...abcd + EFGH... then 577db96d56Sopenharmony_ciwe have 587db96d56Sopenharmony_ci 597db96d56Sopenharmony_ci d != E 607db96d56Sopenharmony_ci cd != EF 617db96d56Sopenharmony_ci bcd != EFG 627db96d56Sopenharmony_ci abcd != EFGH 637db96d56Sopenharmony_ci ... and so on. 647db96d56Sopenharmony_ci 657db96d56Sopenharmony_ciIf this is true for every pair of equal-length substrings around the 667db96d56Sopenharmony_cicut, then the following alignments do not work, so we can skip them: 677db96d56Sopenharmony_ci 687db96d56Sopenharmony_ci text: ........EFG.................... 697db96d56Sopenharmony_ci pattern: ....abcdEFGH.... 707db96d56Sopenharmony_ci ^ (Bad because d != E) 717db96d56Sopenharmony_ci text: ........EFG.................... 727db96d56Sopenharmony_ci pattern: ....abcdEFGH.... 737db96d56Sopenharmony_ci ^^ (Bad because cd != EF) 747db96d56Sopenharmony_ci text: ........EFG.................... 757db96d56Sopenharmony_ci pattern: ....abcdEFGH.... 767db96d56Sopenharmony_ci ^^^ (Bad because bcd != EFG) 777db96d56Sopenharmony_ci 787db96d56Sopenharmony_ciSkip 3 alignments => increment alignment by 4. 797db96d56Sopenharmony_ci 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci-------- If len(left_part) < len(right_part) -------- 827db96d56Sopenharmony_ci 837db96d56Sopenharmony_ciAbove is the core idea, and it begins to suggest how the algorithm can 847db96d56Sopenharmony_cibe linear-time. There is one bit of subtlety involving what to do 857db96d56Sopenharmony_ciaround the end of the needle: if the left half is shorter than the 867db96d56Sopenharmony_ciright, then we could run into something like this: 877db96d56Sopenharmony_ci 887db96d56Sopenharmony_ci text: .....EFG...... 897db96d56Sopenharmony_ci pattern: cdEFGH 907db96d56Sopenharmony_ci 917db96d56Sopenharmony_ciThe same argument holds that we can skip ahead by 4, so long as 927db96d56Sopenharmony_ci 937db96d56Sopenharmony_ci d != E 947db96d56Sopenharmony_ci cd != EF 957db96d56Sopenharmony_ci ?cd != EFG 967db96d56Sopenharmony_ci ??cd != EFGH 977db96d56Sopenharmony_ci etc. 987db96d56Sopenharmony_ci 997db96d56Sopenharmony_ciThe question marks represent "wildcards" that always match; they're 1007db96d56Sopenharmony_cioutside the limits of the needle, so there's no way for them to 1017db96d56Sopenharmony_ciinvalidate a match. To ensure that the inequalities above are always 1027db96d56Sopenharmony_citrue, we need them to be true for all possible '?' values. We thus 1037db96d56Sopenharmony_cineed cd != FG and cd != GH, etc. 1047db96d56Sopenharmony_ci 1057db96d56Sopenharmony_ci 1067db96d56Sopenharmony_ci-------- Matching the left part -------- 1077db96d56Sopenharmony_ci 1087db96d56Sopenharmony_ciOnce we have ensured the right part matches, we scan the left part 1097db96d56Sopenharmony_ci(order doesn't matter, but traditionally right-to-left), and if we 1107db96d56Sopenharmony_cifind a mismatch, we jump ahead by 1117db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1. That we can jump by 1127db96d56Sopenharmony_ciat least len(right_part) + 1 we have already seen: 1137db96d56Sopenharmony_ci 1147db96d56Sopenharmony_ci text: .....EFG..... 1157db96d56Sopenharmony_ci pattern: abcdEFG 1167db96d56Sopenharmony_ci Matched 3, so jump by 4, 1177db96d56Sopenharmony_ci using the fact that d != E, cd != EF, and bcd != EFG. 1187db96d56Sopenharmony_ci 1197db96d56Sopenharmony_ciBut we can also jump by at least len(left_part) + 1: 1207db96d56Sopenharmony_ci 1217db96d56Sopenharmony_ci text: ....cdEF..... 1227db96d56Sopenharmony_ci pattern: abcdEF 1237db96d56Sopenharmony_ci Jump by len('abcd') + 1 = 5. 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci Skip the alignments: 1267db96d56Sopenharmony_ci text: ....cdEF..... 1277db96d56Sopenharmony_ci pattern: abcdEF 1287db96d56Sopenharmony_ci text: ....cdEF..... 1297db96d56Sopenharmony_ci pattern: abcdEF 1307db96d56Sopenharmony_ci text: ....cdEF..... 1317db96d56Sopenharmony_ci pattern: abcdEF 1327db96d56Sopenharmony_ci text: ....cdEF..... 1337db96d56Sopenharmony_ci pattern: abcdEF 1347db96d56Sopenharmony_ci 1357db96d56Sopenharmony_ciThis requires the following facts: 1367db96d56Sopenharmony_ci d != E 1377db96d56Sopenharmony_ci cd != EF 1387db96d56Sopenharmony_ci bcd != EF? 1397db96d56Sopenharmony_ci abcd != EF?? 1407db96d56Sopenharmony_ci etc., for all values of ?s, as above. 1417db96d56Sopenharmony_ci 1427db96d56Sopenharmony_ciIf we have both sets of inequalities, then we can indeed jump by 1437db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1. Under the assumption of such 1447db96d56Sopenharmony_cia nice splitting of the needle, we now have enough to prove linear 1457db96d56Sopenharmony_citime for the search: consider the forward-progress/comparisons ratio 1467db96d56Sopenharmony_ciat each alignment position. If a mismatch occurs in the right part, 1477db96d56Sopenharmony_cithe ratio is 1 position forward per comparison. On the other hand, 1487db96d56Sopenharmony_ciif a mismatch occurs in the left half, we advance by more than 1497db96d56Sopenharmony_cilen(needle)//2 positions for at most len(needle) comparisons, 1507db96d56Sopenharmony_ciso this ratio is more than 1/2. This average "movement speed" is 1517db96d56Sopenharmony_cibounded below by the constant "1 position per 2 comparisons", so we 1527db96d56Sopenharmony_cihave linear time. 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ci 1557db96d56Sopenharmony_ci-------- The periodic case -------- 1567db96d56Sopenharmony_ci 1577db96d56Sopenharmony_ciThe sets of inequalities listed so far seem too good to be true in 1587db96d56Sopenharmony_cithe general case. Indeed, they fail when a needle is periodic: 1597db96d56Sopenharmony_cithere's no way to split 'AAbAAbAAbA' in two such that 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci (the stuff n characters to the left of the split) 1627db96d56Sopenharmony_ci cannot equal 1637db96d56Sopenharmony_ci (the stuff n characters to the right of the split) 1647db96d56Sopenharmony_ci for all n. 1657db96d56Sopenharmony_ci 1667db96d56Sopenharmony_ciThis is because no matter how you cut it, you'll get 1677db96d56Sopenharmony_cis[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the 1687db96d56Sopenharmony_cineedle in two so that n can be as big as possible. If we were to 1697db96d56Sopenharmony_cisplit it as 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci AAbA + AbAAbA 1727db96d56Sopenharmony_ci 1737db96d56Sopenharmony_cithen A == A at the split, so this is bad (we failed at length 1), but 1747db96d56Sopenharmony_ciif we split it as 1757db96d56Sopenharmony_ci 1767db96d56Sopenharmony_ci AA + bAAbAAbA 1777db96d56Sopenharmony_ci 1787db96d56Sopenharmony_ciwe at least have A != b and AA != bA, and we fail at length 3 1797db96d56Sopenharmony_cisince ?AA == bAA. We already knew that a cut to make length-3 1807db96d56Sopenharmony_cimismatch was impossible due to the period, but we now see that the 1817db96d56Sopenharmony_cibound is sharp; we can get length-1 and length-2 to mismatch. 1827db96d56Sopenharmony_ci 1837db96d56Sopenharmony_ciThis is exactly the content of the *critical factorization theorem*: 1847db96d56Sopenharmony_cithat no matter the period of the original needle, you can cut it in 1857db96d56Sopenharmony_cisuch a way that (with the appropriate question marks), 1867db96d56Sopenharmony_cineedle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. 1877db96d56Sopenharmony_ci 1887db96d56Sopenharmony_ciEven "non-periodic" strings are periodic with a period equal to 1897db96d56Sopenharmony_citheir length, so for such needles, the CFT already guarantees that 1907db96d56Sopenharmony_cithe algorithm described so far will work, since we can cut the needle 1917db96d56Sopenharmony_ciso that the length-k chunks on either side of the cut mismatch for all 1927db96d56Sopenharmony_cik < len(needle). Looking closer at the algorithm, we only actually 1937db96d56Sopenharmony_cirequire that k go up to max(len(left_part), len(right_part)). 1947db96d56Sopenharmony_ciSo long as the period exceeds that, we're good. 1957db96d56Sopenharmony_ci 1967db96d56Sopenharmony_ciThe more general shorter-period case is a bit harder. The essentials 1977db96d56Sopenharmony_ciare the same, except we use the periodicity to our advantage by 1987db96d56Sopenharmony_ci"remembering" periods that we've already compared. In our running 1997db96d56Sopenharmony_ciexample, say we're computing 2007db96d56Sopenharmony_ci 2017db96d56Sopenharmony_ci "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". 2027db96d56Sopenharmony_ci 2037db96d56Sopenharmony_ciWe cut as AA + bAAbAAbA, and then the algorithm runs as follows: 2047db96d56Sopenharmony_ci 2057db96d56Sopenharmony_ci First alignment: 2067db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2077db96d56Sopenharmony_ci AAbAAbAAbA 2087db96d56Sopenharmony_ci ^^X 2097db96d56Sopenharmony_ci - Mismatch at third position, so jump by 3. 2107db96d56Sopenharmony_ci - This requires that A!=b and AA != bA. 2117db96d56Sopenharmony_ci 2127db96d56Sopenharmony_ci Second alignment: 2137db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2147db96d56Sopenharmony_ci AAbAAbAAbA 2157db96d56Sopenharmony_ci ^^^^^^^^ 2167db96d56Sopenharmony_ci X 2177db96d56Sopenharmony_ci - Matched entire right part 2187db96d56Sopenharmony_ci - Mismatch at left part. 2197db96d56Sopenharmony_ci - Jump forward a period, remembering the existing comparisons 2207db96d56Sopenharmony_ci 2217db96d56Sopenharmony_ci Third alignment: 2227db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2237db96d56Sopenharmony_ci AAbAAbAAbA 2247db96d56Sopenharmony_ci mmmmmmm^^X 2257db96d56Sopenharmony_ci - There's "memory": a bunch of characters were already matched. 2267db96d56Sopenharmony_ci - Two more characters match beyond that. 2277db96d56Sopenharmony_ci - The 8th character of the right part mismatched, so jump by 8 2287db96d56Sopenharmony_ci - The above rule is more complicated than usual: we don't have 2297db96d56Sopenharmony_ci the right inequalities for lengths 1 through 7, but we do have 2307db96d56Sopenharmony_ci shifted copies of the length-1 and length-2 inequalities, 2317db96d56Sopenharmony_ci along with knowledge of the mismatch. We can skip all of these 2327db96d56Sopenharmony_ci alignments at once: 2337db96d56Sopenharmony_ci 2347db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2357db96d56Sopenharmony_ci AAbAAbAAbA 2367db96d56Sopenharmony_ci ~ A != b at the cut 2377db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2387db96d56Sopenharmony_ci AAbAAbAAbA 2397db96d56Sopenharmony_ci ~~ AA != bA at the cut 2407db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2417db96d56Sopenharmony_ci AAbAAbAAbA 2427db96d56Sopenharmony_ci ^^^^X 7-3=4 match, and the 5th misses. 2437db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2447db96d56Sopenharmony_ci AAbAAbAAbA 2457db96d56Sopenharmony_ci ~ A != b at the cut 2467db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2477db96d56Sopenharmony_ci AAbAAbAAbA 2487db96d56Sopenharmony_ci ~~ AA != bA at the cut 2497db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2507db96d56Sopenharmony_ci AAbAAbAAbA 2517db96d56Sopenharmony_ci ^X 7-3-3=1 match and the 2nd misses. 2527db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2537db96d56Sopenharmony_ci AAbAAbAAbA 2547db96d56Sopenharmony_ci ~ A != b at the cut 2557db96d56Sopenharmony_ci 2567db96d56Sopenharmony_ci Fourth alignment: 2577db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2587db96d56Sopenharmony_ci AAbAAbAAbA 2597db96d56Sopenharmony_ci ^X 2607db96d56Sopenharmony_ci - Second character mismatches, so jump by 2. 2617db96d56Sopenharmony_ci 2627db96d56Sopenharmony_ci Fifth alignment: 2637db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2647db96d56Sopenharmony_ci AAbAAbAAbA 2657db96d56Sopenharmony_ci ^^^^^^^^ 2667db96d56Sopenharmony_ci X 2677db96d56Sopenharmony_ci - Right half matches, so use memory and skip ahead by period=3 2687db96d56Sopenharmony_ci 2697db96d56Sopenharmony_ci Sixth alignment: 2707db96d56Sopenharmony_ci bbbAbbAAbAAbAAbbbAAbAAbAAbAA 2717db96d56Sopenharmony_ci AAbAAbAAbA 2727db96d56Sopenharmony_ci mmmmmmmm^^ 2737db96d56Sopenharmony_ci - Right part matches, left part is remembered, found a match! 2747db96d56Sopenharmony_ci 2757db96d56Sopenharmony_ciThe one tricky skip by 8 here generalizes: if we have a period of p, 2767db96d56Sopenharmony_cithen the CFT says we can ensure the cut has the inequality property 2777db96d56Sopenharmony_cifor lengths 1 through p-1, and jumping by p would line up the 2787db96d56Sopenharmony_cimatching characters and mismatched character one period earlier. 2797db96d56Sopenharmony_ciInductively, this proves that we can skip by the number of characters 2807db96d56Sopenharmony_cimatched in the right half, plus 1, just as in the original algorithm. 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ciTo make it explicit, the memory is set whenever the entire right part 2837db96d56Sopenharmony_ciis matched and is then used as a starting point in the next alignment. 2847db96d56Sopenharmony_ciIn such a case, the alignment jumps forward one period, and the right 2857db96d56Sopenharmony_cihalf matches all except possibly the last period. Additionally, 2867db96d56Sopenharmony_ciif we cut so that the left part has a length strictly less than the 2877db96d56Sopenharmony_ciperiod (we always can!), then we can know that the left part already 2887db96d56Sopenharmony_cimatches. The memory is reset to 0 whenever there is a mismatch in the 2897db96d56Sopenharmony_ciright part. 2907db96d56Sopenharmony_ci 2917db96d56Sopenharmony_ciTo prove linearity for the periodic case, note that if a right-part 2927db96d56Sopenharmony_cicharacter mismatches, then we advance forward 1 unit per comparison. 2937db96d56Sopenharmony_ciOn the other hand, if the entire right part matches, then the skipping 2947db96d56Sopenharmony_ciforward by one period "defers" some of the comparisons to the next 2957db96d56Sopenharmony_cialignment, where they will then be spent at the usual rate of 2967db96d56Sopenharmony_cione comparison per step forward. Even if left-half comparisons 2977db96d56Sopenharmony_ciare always "wasted", they constitute less than half of all 2987db96d56Sopenharmony_cicomparisons, so the average rate is certainly at least 1 move forward 2997db96d56Sopenharmony_ciper 2 comparisons. 3007db96d56Sopenharmony_ci 3017db96d56Sopenharmony_ci 3027db96d56Sopenharmony_ci-------- When to choose the periodic algorithm --------- 3037db96d56Sopenharmony_ci 3047db96d56Sopenharmony_ciThe periodic algorithm is always valid but has an overhead of one 3057db96d56Sopenharmony_cimore "memory" register and some memory computation steps, so the 3067db96d56Sopenharmony_cihere-described-first non-periodic/long-period algorithm -- skipping by 3077db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1 rather than the period -- 3087db96d56Sopenharmony_cishould be preferred when possible. 3097db96d56Sopenharmony_ci 3107db96d56Sopenharmony_ciInterestingly, the long-period algorithm does not require an exact 3117db96d56Sopenharmony_cicomputation of the period; it works even with some long-period, but 3127db96d56Sopenharmony_ciundeniably "periodic" needles: 3137db96d56Sopenharmony_ci 3147db96d56Sopenharmony_ci Cut: AbcdefAbc == Abcde + fAbc 3157db96d56Sopenharmony_ci 3167db96d56Sopenharmony_ciThis cut gives these inequalities: 3177db96d56Sopenharmony_ci 3187db96d56Sopenharmony_ci e != f 3197db96d56Sopenharmony_ci de != fA 3207db96d56Sopenharmony_ci cde != fAb 3217db96d56Sopenharmony_ci bcde != fAbc 3227db96d56Sopenharmony_ci Abcde != fAbc? 3237db96d56Sopenharmony_ci The first failure is a period long, per the CFT: 3247db96d56Sopenharmony_ci ?Abcde == fAbc?? 3257db96d56Sopenharmony_ci 3267db96d56Sopenharmony_ciA sufficient condition for using the long-period algorithm is having 3277db96d56Sopenharmony_cithe period of the needle be greater than 3287db96d56Sopenharmony_cimax(len(left_part), len(right_part)). This way, after choosing a good 3297db96d56Sopenharmony_cisplit, we get all of the max(len(left_part), len(right_part)) 3307db96d56Sopenharmony_ciinequalities around the cut that were required in the long-period 3317db96d56Sopenharmony_civersion of the algorithm. 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ciWith all of this in mind, here's how we choose: 3347db96d56Sopenharmony_ci 3357db96d56Sopenharmony_ci (1) Choose a "critical factorization" of the needle -- a cut 3367db96d56Sopenharmony_ci where we have period minus 1 inequalities in a row. 3377db96d56Sopenharmony_ci More specifically, choose a cut so that the left_part 3387db96d56Sopenharmony_ci is less than one period long. 3397db96d56Sopenharmony_ci (2) Determine the period P_r of the right_part. 3407db96d56Sopenharmony_ci (3) Check if the left part is just an extension of the pattern of 3417db96d56Sopenharmony_ci the right part, so that the whole needle has period P_r. 3427db96d56Sopenharmony_ci Explicitly, check if 3437db96d56Sopenharmony_ci needle[0:cut] == needle[0+P_r:cut+P_r] 3447db96d56Sopenharmony_ci If so, we use the periodic algorithm. If not equal, we use the 3457db96d56Sopenharmony_ci long-period algorithm. 3467db96d56Sopenharmony_ci 3477db96d56Sopenharmony_ciNote that if equality holds in (3), then the period of the whole 3487db96d56Sopenharmony_cistring is P_r. On the other hand, suppose equality does not hold. 3497db96d56Sopenharmony_ciThe period of the needle is then strictly greater than P_r. Here's 3507db96d56Sopenharmony_cia general fact: 3517db96d56Sopenharmony_ci 3527db96d56Sopenharmony_ci If p is a substring of s and p has period r, then the period 3537db96d56Sopenharmony_ci of s is either equal to r or greater than len(p). 3547db96d56Sopenharmony_ci 3557db96d56Sopenharmony_ciWe know that needle_period != P_r, 3567db96d56Sopenharmony_ciand therefore needle_period > len(right_part). 3577db96d56Sopenharmony_ciAdditionally, we'll choose the cut (see below) 3587db96d56Sopenharmony_ciso that len(left_part) < needle_period. 3597db96d56Sopenharmony_ci 3607db96d56Sopenharmony_ciThus, in the case where equality does not hold, we have that 3617db96d56Sopenharmony_cineedle_period >= max(len(left_part), len(right_part)) + 1, 3627db96d56Sopenharmony_ciso the long-period algorithm works, but otherwise, we know the period 3637db96d56Sopenharmony_ciof the needle. 3647db96d56Sopenharmony_ci 3657db96d56Sopenharmony_ciNote that this decision process doesn't always require an exact 3667db96d56Sopenharmony_cicomputation of the period -- we can get away with only computing P_r! 3677db96d56Sopenharmony_ci 3687db96d56Sopenharmony_ci 3697db96d56Sopenharmony_ci-------- Computing the cut -------- 3707db96d56Sopenharmony_ci 3717db96d56Sopenharmony_ciOur remaining tasks are now to compute a cut of the needle with as 3727db96d56Sopenharmony_cimany inequalities as possible, ensuring that cut < needle_period. 3737db96d56Sopenharmony_ciMeanwhile, we must also compute the period P_r of the right_part. 3747db96d56Sopenharmony_ci 3757db96d56Sopenharmony_ciThe computation is relatively simple, essentially doing this: 3767db96d56Sopenharmony_ci 3777db96d56Sopenharmony_ci suffix1 = max(needle[i:] for i in range(len(needle))) 3787db96d56Sopenharmony_ci suffix2 = ... # the same as above, but invert the alphabet 3797db96d56Sopenharmony_ci cut1 = len(needle) - len(suffix1) 3807db96d56Sopenharmony_ci cut2 = len(needle) - len(suffix2) 3817db96d56Sopenharmony_ci cut = max(cut1, cut2) # the later cut 3827db96d56Sopenharmony_ci 3837db96d56Sopenharmony_ciFor cut2, "invert the alphabet" is different than saying min(...), 3847db96d56Sopenharmony_cisince in lexicographic order, we still put "py" < "python", even 3857db96d56Sopenharmony_ciif the alphabet is inverted. Computing these, along with the method 3867db96d56Sopenharmony_ciof computing the period of the right half, is easiest to read directly 3877db96d56Sopenharmony_cifrom the source code in fastsearch.h, in which these are computed 3887db96d56Sopenharmony_ciin linear time. 3897db96d56Sopenharmony_ci 3907db96d56Sopenharmony_ciCrochemore & Perrin's Theorem 3.1 give that "cut" above is a 3917db96d56Sopenharmony_cicritical factorization less than the period, but a very brief sketch 3927db96d56Sopenharmony_ciof their proof goes something like this (this is far from complete): 3937db96d56Sopenharmony_ci 3947db96d56Sopenharmony_ci * If this cut splits the needle as some 3957db96d56Sopenharmony_ci needle == (a + w) + (w + b), meaning there's a bad equality 3967db96d56Sopenharmony_ci w == w, it's impossible for w + b to be bigger than both 3977db96d56Sopenharmony_ci b and w + w + b, so this can't happen. We thus have all of 3987db96d56Sopenharmony_ci the ineuqalities with no question marks. 3997db96d56Sopenharmony_ci * By maximality, the right part is not a substring of the left 4007db96d56Sopenharmony_ci part. Thus, we have all of the inequalities involving no 4017db96d56Sopenharmony_ci left-side question marks. 4027db96d56Sopenharmony_ci * If you have all of the inequalities without right-side question 4037db96d56Sopenharmony_ci marks, we have a critical factorization. 4047db96d56Sopenharmony_ci * If one such inequality fails, then there's a smaller period, 4057db96d56Sopenharmony_ci but the factorization is nonetheless critical. Here's where 4067db96d56Sopenharmony_ci you need the redundancy coming from computing both cuts and 4077db96d56Sopenharmony_ci choosing the later one. 4087db96d56Sopenharmony_ci 4097db96d56Sopenharmony_ci 4107db96d56Sopenharmony_ci-------- Some more Bells and Whistles -------- 4117db96d56Sopenharmony_ci 4127db96d56Sopenharmony_ciBeyond Crochemore & Perrin's original algorithm, we can use a couple 4137db96d56Sopenharmony_cimore tricks for speed in fastsearch.h: 4147db96d56Sopenharmony_ci 4157db96d56Sopenharmony_ci 1. Even though C&P has a best-case O(n/m) time, this doesn't occur 4167db96d56Sopenharmony_ci very often, so we add a Boyer-Moore bad character table to 4177db96d56Sopenharmony_ci achieve sublinear time in more cases. 4187db96d56Sopenharmony_ci 4197db96d56Sopenharmony_ci 2. The prework of computing the cut/period is expensive per 4207db96d56Sopenharmony_ci needle character, so we shouldn't do it if it won't pay off. 4217db96d56Sopenharmony_ci For this reason, if the needle and haystack are long enough, 4227db96d56Sopenharmony_ci only automatically start with two-way if the needle's length 4237db96d56Sopenharmony_ci is a small percentage of the length of the haystack. 4247db96d56Sopenharmony_ci 4257db96d56Sopenharmony_ci 3. In cases where the needle and haystack are large but the needle 4267db96d56Sopenharmony_ci makes up a significant percentage of the length of the 4277db96d56Sopenharmony_ci haystack, don't pay the expensive two-way preprocessing cost 4287db96d56Sopenharmony_ci if you don't need to. Instead, keep track of how many 4297db96d56Sopenharmony_ci character comparisons are equal, and if that exceeds 4307db96d56Sopenharmony_ci O(len(needle)), then pay that cost, since the simpler algorithm 4317db96d56Sopenharmony_ci isn't doing very well. 432