17db96d56Sopenharmony_ciThis document explains Crochemore and Perrin's Two-Way string matching
27db96d56Sopenharmony_cialgorithm, in which a smaller string (the "pattern" or "needle")
37db96d56Sopenharmony_ciis searched for in a longer string (the "text" or "haystack"),
47db96d56Sopenharmony_cidetermining whether the needle is a substring of the haystack, and if
57db96d56Sopenharmony_ciso, at what index(es). It is to be used by Python's string
67db96d56Sopenharmony_ci(and bytes-like) objects when calling `find`, `index`, `__contains__`,
77db96d56Sopenharmony_cior implicitly in methods like `replace` or `partition`.
87db96d56Sopenharmony_ci
97db96d56Sopenharmony_ciThis is essentially a re-telling of the paper
107db96d56Sopenharmony_ci
117db96d56Sopenharmony_ci    Crochemore M., Perrin D., 1991, Two-way string-matching,
127db96d56Sopenharmony_ci        Journal of the ACM 38(3):651-675.
137db96d56Sopenharmony_ci
147db96d56Sopenharmony_cifocused more on understanding and examples than on rigor. See also
157db96d56Sopenharmony_cithe code sample here:
167db96d56Sopenharmony_ci
177db96d56Sopenharmony_ci    http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
187db96d56Sopenharmony_ci
197db96d56Sopenharmony_ciThe algorithm runs in O(len(needle) + len(haystack)) time and with
207db96d56Sopenharmony_ciO(1) space. However, since there is a larger preprocessing cost than
217db96d56Sopenharmony_cisimpler algorithms, this Two-Way algorithm is to be used only when the
227db96d56Sopenharmony_cineedle and haystack lengths meet certain thresholds.
237db96d56Sopenharmony_ci
247db96d56Sopenharmony_ci
257db96d56Sopenharmony_ciThese are the basic steps of the algorithm:
267db96d56Sopenharmony_ci
277db96d56Sopenharmony_ci    * "Very carefully" cut the needle in two.
287db96d56Sopenharmony_ci    * For each alignment attempted:
297db96d56Sopenharmony_ci        1. match the right part
307db96d56Sopenharmony_ci            * On failure, jump by the amount matched + 1
317db96d56Sopenharmony_ci        2. then match the left part.
327db96d56Sopenharmony_ci            * On failure jump by max(len(left), len(right)) + 1
337db96d56Sopenharmony_ci    * If the needle is periodic, don't re-do comparisons; maintain
347db96d56Sopenharmony_ci      a "memory" of how many characters you already know match.
357db96d56Sopenharmony_ci
367db96d56Sopenharmony_ci
377db96d56Sopenharmony_ci-------- Matching the right part --------
387db96d56Sopenharmony_ci
397db96d56Sopenharmony_ciWe first scan the right part of the needle to check if it matches the
407db96d56Sopenharmony_cithe aligned characters in the haystack. We scan left-to-right,
417db96d56Sopenharmony_ciand if a mismatch occurs, we jump ahead by the amount matched plus 1.
427db96d56Sopenharmony_ci
437db96d56Sopenharmony_ciExample:
447db96d56Sopenharmony_ci
457db96d56Sopenharmony_ci       text:    ........EFGX...................
467db96d56Sopenharmony_ci    pattern:    ....abcdEFGH....
477db96d56Sopenharmony_ci        cut:        <<<<>>>>
487db96d56Sopenharmony_ci
497db96d56Sopenharmony_ciMatched 3, so jump ahead by 4:
507db96d56Sopenharmony_ci
517db96d56Sopenharmony_ci       text:    ........EFGX...................
527db96d56Sopenharmony_ci    pattern:        ....abcdEFGH....
537db96d56Sopenharmony_ci        cut:            <<<<>>>>
547db96d56Sopenharmony_ci
557db96d56Sopenharmony_ciWhy are we allowed to do this? Because we cut the needle very
567db96d56Sopenharmony_cicarefully, in such a way that if the cut is ...abcd + EFGH... then
577db96d56Sopenharmony_ciwe have
587db96d56Sopenharmony_ci
597db96d56Sopenharmony_ci        d != E
607db96d56Sopenharmony_ci       cd != EF
617db96d56Sopenharmony_ci      bcd != EFG
627db96d56Sopenharmony_ci     abcd != EFGH
637db96d56Sopenharmony_ci          ... and so on.
647db96d56Sopenharmony_ci
657db96d56Sopenharmony_ciIf this is true for every pair of equal-length substrings around the
667db96d56Sopenharmony_cicut, then the following alignments do not work, so we can skip them:
677db96d56Sopenharmony_ci
687db96d56Sopenharmony_ci       text:    ........EFG....................
697db96d56Sopenharmony_ci    pattern:     ....abcdEFGH....
707db96d56Sopenharmony_ci                        ^   (Bad because d != E)
717db96d56Sopenharmony_ci       text:    ........EFG....................
727db96d56Sopenharmony_ci    pattern:      ....abcdEFGH....
737db96d56Sopenharmony_ci                        ^^   (Bad because cd != EF)
747db96d56Sopenharmony_ci       text:    ........EFG....................
757db96d56Sopenharmony_ci    pattern:       ....abcdEFGH....
767db96d56Sopenharmony_ci                        ^^^   (Bad because bcd != EFG)
777db96d56Sopenharmony_ci
787db96d56Sopenharmony_ciSkip 3 alignments => increment alignment by 4.
797db96d56Sopenharmony_ci
807db96d56Sopenharmony_ci
817db96d56Sopenharmony_ci-------- If len(left_part) < len(right_part) --------
827db96d56Sopenharmony_ci
837db96d56Sopenharmony_ciAbove is the core idea, and it begins to suggest how the algorithm can
847db96d56Sopenharmony_cibe linear-time. There is one bit of subtlety involving what to do
857db96d56Sopenharmony_ciaround the end of the needle: if the left half is shorter than the
867db96d56Sopenharmony_ciright, then we could run into something like this:
877db96d56Sopenharmony_ci
887db96d56Sopenharmony_ci       text:    .....EFG......
897db96d56Sopenharmony_ci    pattern:       cdEFGH
907db96d56Sopenharmony_ci
917db96d56Sopenharmony_ciThe same argument holds that we can skip ahead by 4, so long as
927db96d56Sopenharmony_ci
937db96d56Sopenharmony_ci       d != E
947db96d56Sopenharmony_ci      cd != EF
957db96d56Sopenharmony_ci     ?cd != EFG
967db96d56Sopenharmony_ci    ??cd != EFGH
977db96d56Sopenharmony_ci         etc.
987db96d56Sopenharmony_ci
997db96d56Sopenharmony_ciThe question marks represent "wildcards" that always match; they're
1007db96d56Sopenharmony_cioutside the limits of the needle, so there's no way for them to
1017db96d56Sopenharmony_ciinvalidate a match. To ensure that the inequalities above are always
1027db96d56Sopenharmony_citrue, we need them to be true for all possible '?' values. We thus
1037db96d56Sopenharmony_cineed cd != FG and cd != GH, etc.
1047db96d56Sopenharmony_ci
1057db96d56Sopenharmony_ci
1067db96d56Sopenharmony_ci-------- Matching the left part --------
1077db96d56Sopenharmony_ci
1087db96d56Sopenharmony_ciOnce we have ensured the right part matches, we scan the left part
1097db96d56Sopenharmony_ci(order doesn't matter, but traditionally right-to-left), and if we
1107db96d56Sopenharmony_cifind a mismatch, we jump ahead by
1117db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1. That we can jump by
1127db96d56Sopenharmony_ciat least len(right_part) + 1 we have already seen:
1137db96d56Sopenharmony_ci
1147db96d56Sopenharmony_ci       text: .....EFG.....
1157db96d56Sopenharmony_ci    pattern:  abcdEFG
1167db96d56Sopenharmony_ci    Matched 3, so jump by 4,
1177db96d56Sopenharmony_ci    using the fact that d != E, cd != EF, and bcd != EFG.
1187db96d56Sopenharmony_ci
1197db96d56Sopenharmony_ciBut we can also jump by at least len(left_part) + 1:
1207db96d56Sopenharmony_ci
1217db96d56Sopenharmony_ci       text: ....cdEF.....
1227db96d56Sopenharmony_ci    pattern:   abcdEF
1237db96d56Sopenharmony_ci    Jump by len('abcd') + 1 = 5.
1247db96d56Sopenharmony_ci
1257db96d56Sopenharmony_ci    Skip the alignments:
1267db96d56Sopenharmony_ci       text: ....cdEF.....
1277db96d56Sopenharmony_ci    pattern:    abcdEF
1287db96d56Sopenharmony_ci       text: ....cdEF.....
1297db96d56Sopenharmony_ci    pattern:     abcdEF
1307db96d56Sopenharmony_ci       text: ....cdEF.....
1317db96d56Sopenharmony_ci    pattern:      abcdEF
1327db96d56Sopenharmony_ci       text: ....cdEF.....
1337db96d56Sopenharmony_ci    pattern:       abcdEF
1347db96d56Sopenharmony_ci
1357db96d56Sopenharmony_ciThis requires the following facts:
1367db96d56Sopenharmony_ci       d != E
1377db96d56Sopenharmony_ci      cd != EF
1387db96d56Sopenharmony_ci     bcd != EF?
1397db96d56Sopenharmony_ci    abcd != EF??
1407db96d56Sopenharmony_ci         etc., for all values of ?s, as above.
1417db96d56Sopenharmony_ci
1427db96d56Sopenharmony_ciIf we have both sets of inequalities, then we can indeed jump by
1437db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1. Under the assumption of such
1447db96d56Sopenharmony_cia nice splitting of the needle, we now have enough to prove linear
1457db96d56Sopenharmony_citime for the search: consider the forward-progress/comparisons ratio
1467db96d56Sopenharmony_ciat each alignment position. If a mismatch occurs in the right part,
1477db96d56Sopenharmony_cithe ratio is 1 position forward per comparison. On the other hand,
1487db96d56Sopenharmony_ciif a mismatch occurs in the left half, we advance by more than
1497db96d56Sopenharmony_cilen(needle)//2 positions for at most len(needle) comparisons,
1507db96d56Sopenharmony_ciso this ratio is more than 1/2. This average "movement speed" is
1517db96d56Sopenharmony_cibounded below by the constant "1 position per 2 comparisons", so we
1527db96d56Sopenharmony_cihave linear time.
1537db96d56Sopenharmony_ci
1547db96d56Sopenharmony_ci
1557db96d56Sopenharmony_ci-------- The periodic case --------
1567db96d56Sopenharmony_ci
1577db96d56Sopenharmony_ciThe sets of inequalities listed so far seem too good to be true in
1587db96d56Sopenharmony_cithe general case. Indeed, they fail when a needle is periodic:
1597db96d56Sopenharmony_cithere's no way to split 'AAbAAbAAbA' in two such that
1607db96d56Sopenharmony_ci
1617db96d56Sopenharmony_ci    (the stuff n characters to the left of the split)
1627db96d56Sopenharmony_ci    cannot equal
1637db96d56Sopenharmony_ci    (the stuff n characters to the right of the split)
1647db96d56Sopenharmony_ci    for all n.
1657db96d56Sopenharmony_ci
1667db96d56Sopenharmony_ciThis is because no matter how you cut it, you'll get
1677db96d56Sopenharmony_cis[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
1687db96d56Sopenharmony_cineedle in two so that n can be as big as possible. If we were to
1697db96d56Sopenharmony_cisplit it as
1707db96d56Sopenharmony_ci
1717db96d56Sopenharmony_ci    AAbA + AbAAbA
1727db96d56Sopenharmony_ci
1737db96d56Sopenharmony_cithen A == A at the split, so this is bad (we failed at length 1), but
1747db96d56Sopenharmony_ciif we split it as
1757db96d56Sopenharmony_ci
1767db96d56Sopenharmony_ci    AA + bAAbAAbA
1777db96d56Sopenharmony_ci
1787db96d56Sopenharmony_ciwe at least have A != b and AA != bA, and we fail at length 3
1797db96d56Sopenharmony_cisince ?AA == bAA. We already knew that a cut to make length-3
1807db96d56Sopenharmony_cimismatch was impossible due to the period, but we now see that the
1817db96d56Sopenharmony_cibound is sharp; we can get length-1 and length-2 to mismatch.
1827db96d56Sopenharmony_ci
1837db96d56Sopenharmony_ciThis is exactly the content of the *critical factorization theorem*:
1847db96d56Sopenharmony_cithat no matter the period of the original needle, you can cut it in
1857db96d56Sopenharmony_cisuch a way that (with the appropriate question marks),
1867db96d56Sopenharmony_cineedle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
1877db96d56Sopenharmony_ci
1887db96d56Sopenharmony_ciEven "non-periodic" strings are periodic with a period equal to
1897db96d56Sopenharmony_citheir length, so for such needles, the CFT already guarantees that
1907db96d56Sopenharmony_cithe algorithm described so far will work, since we can cut the needle
1917db96d56Sopenharmony_ciso that the length-k chunks on either side of the cut mismatch for all
1927db96d56Sopenharmony_cik < len(needle). Looking closer at the algorithm, we only actually
1937db96d56Sopenharmony_cirequire that k go up to max(len(left_part), len(right_part)).
1947db96d56Sopenharmony_ciSo long as the period exceeds that, we're good.
1957db96d56Sopenharmony_ci
1967db96d56Sopenharmony_ciThe more general shorter-period case is a bit harder. The essentials
1977db96d56Sopenharmony_ciare the same, except we use the periodicity to our advantage by
1987db96d56Sopenharmony_ci"remembering" periods that we've already compared. In our running
1997db96d56Sopenharmony_ciexample, say we're computing
2007db96d56Sopenharmony_ci
2017db96d56Sopenharmony_ci    "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
2027db96d56Sopenharmony_ci
2037db96d56Sopenharmony_ciWe cut as AA + bAAbAAbA, and then the algorithm runs as follows:
2047db96d56Sopenharmony_ci
2057db96d56Sopenharmony_ci    First alignment:
2067db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2077db96d56Sopenharmony_ci    AAbAAbAAbA
2087db96d56Sopenharmony_ci      ^^X
2097db96d56Sopenharmony_ci    - Mismatch at third position, so jump by 3.
2107db96d56Sopenharmony_ci    - This requires that A!=b and AA != bA.
2117db96d56Sopenharmony_ci
2127db96d56Sopenharmony_ci    Second alignment:
2137db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2147db96d56Sopenharmony_ci       AAbAAbAAbA
2157db96d56Sopenharmony_ci         ^^^^^^^^
2167db96d56Sopenharmony_ci        X
2177db96d56Sopenharmony_ci    - Matched entire right part
2187db96d56Sopenharmony_ci    - Mismatch at left part.
2197db96d56Sopenharmony_ci    - Jump forward a period, remembering the existing comparisons
2207db96d56Sopenharmony_ci
2217db96d56Sopenharmony_ci    Third alignment:
2227db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2237db96d56Sopenharmony_ci          AAbAAbAAbA
2247db96d56Sopenharmony_ci          mmmmmmm^^X
2257db96d56Sopenharmony_ci    - There's "memory": a bunch of characters were already matched.
2267db96d56Sopenharmony_ci    - Two more characters match beyond that.
2277db96d56Sopenharmony_ci    - The 8th character of the right part mismatched, so jump by 8
2287db96d56Sopenharmony_ci    - The above rule is more complicated than usual: we don't have
2297db96d56Sopenharmony_ci      the right inequalities for lengths 1 through 7, but we do have
2307db96d56Sopenharmony_ci      shifted copies of the length-1 and length-2 inequalities,
2317db96d56Sopenharmony_ci      along with knowledge of the mismatch. We can skip all of these
2327db96d56Sopenharmony_ci      alignments at once:
2337db96d56Sopenharmony_ci
2347db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2357db96d56Sopenharmony_ci               AAbAAbAAbA
2367db96d56Sopenharmony_ci                ~                   A != b at the cut
2377db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2387db96d56Sopenharmony_ci                AAbAAbAAbA
2397db96d56Sopenharmony_ci                ~~                  AA != bA at the cut
2407db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2417db96d56Sopenharmony_ci                 AAbAAbAAbA
2427db96d56Sopenharmony_ci                 ^^^^X              7-3=4 match, and the 5th misses.
2437db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2447db96d56Sopenharmony_ci                  AAbAAbAAbA
2457db96d56Sopenharmony_ci                   ~                A != b at the cut
2467db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2477db96d56Sopenharmony_ci                   AAbAAbAAbA
2487db96d56Sopenharmony_ci                   ~~               AA != bA at the cut
2497db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2507db96d56Sopenharmony_ci                    AAbAAbAAbA
2517db96d56Sopenharmony_ci                      ^X            7-3-3=1 match and the 2nd misses.
2527db96d56Sopenharmony_ci        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2537db96d56Sopenharmony_ci                     AAbAAbAAbA
2547db96d56Sopenharmony_ci                      ~             A != b at the cut
2557db96d56Sopenharmony_ci
2567db96d56Sopenharmony_ci    Fourth alignment:
2577db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2587db96d56Sopenharmony_ci                 AAbAAbAAbA
2597db96d56Sopenharmony_ci                   ^X
2607db96d56Sopenharmony_ci    - Second character mismatches, so jump by 2.
2617db96d56Sopenharmony_ci
2627db96d56Sopenharmony_ci    Fifth alignment:
2637db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2647db96d56Sopenharmony_ci                  AAbAAbAAbA
2657db96d56Sopenharmony_ci                    ^^^^^^^^
2667db96d56Sopenharmony_ci                   X
2677db96d56Sopenharmony_ci    - Right half matches, so use memory and skip ahead by period=3
2687db96d56Sopenharmony_ci
2697db96d56Sopenharmony_ci    Sixth alignment:
2707db96d56Sopenharmony_ci    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
2717db96d56Sopenharmony_ci                     AAbAAbAAbA
2727db96d56Sopenharmony_ci                     mmmmmmmm^^
2737db96d56Sopenharmony_ci    - Right part matches, left part is remembered, found a match!
2747db96d56Sopenharmony_ci
2757db96d56Sopenharmony_ciThe one tricky skip by 8 here generalizes: if we have a period of p,
2767db96d56Sopenharmony_cithen the CFT says we can ensure the cut has the inequality property
2777db96d56Sopenharmony_cifor lengths 1 through p-1, and jumping by p would line up the
2787db96d56Sopenharmony_cimatching characters and mismatched character one period earlier.
2797db96d56Sopenharmony_ciInductively, this proves that we can skip by the number of characters
2807db96d56Sopenharmony_cimatched in the right half, plus 1, just as in the original algorithm.
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ciTo make it explicit, the memory is set whenever the entire right part
2837db96d56Sopenharmony_ciis matched and is then used as a starting point in the next alignment.
2847db96d56Sopenharmony_ciIn such a case, the alignment jumps forward one period, and the right
2857db96d56Sopenharmony_cihalf matches all except possibly the last period. Additionally,
2867db96d56Sopenharmony_ciif we cut so that the left part has a length strictly less than the
2877db96d56Sopenharmony_ciperiod (we always can!), then we can know that the left part already
2887db96d56Sopenharmony_cimatches. The memory is reset to 0 whenever there is a mismatch in the
2897db96d56Sopenharmony_ciright part.
2907db96d56Sopenharmony_ci
2917db96d56Sopenharmony_ciTo prove linearity for the periodic case, note that if a right-part
2927db96d56Sopenharmony_cicharacter mismatches, then we advance forward 1 unit per comparison.
2937db96d56Sopenharmony_ciOn the other hand, if the entire right part matches, then the skipping
2947db96d56Sopenharmony_ciforward by one period "defers" some of the comparisons to the next
2957db96d56Sopenharmony_cialignment, where they will then be spent at the usual rate of
2967db96d56Sopenharmony_cione comparison per step forward. Even if left-half comparisons
2977db96d56Sopenharmony_ciare always "wasted", they constitute less than half of all
2987db96d56Sopenharmony_cicomparisons, so the average rate is certainly at least 1 move forward
2997db96d56Sopenharmony_ciper 2 comparisons.
3007db96d56Sopenharmony_ci
3017db96d56Sopenharmony_ci
3027db96d56Sopenharmony_ci-------- When to choose the periodic algorithm ---------
3037db96d56Sopenharmony_ci
3047db96d56Sopenharmony_ciThe periodic algorithm is always valid but has an overhead of one
3057db96d56Sopenharmony_cimore "memory" register and some memory computation steps, so the
3067db96d56Sopenharmony_cihere-described-first non-periodic/long-period algorithm -- skipping by
3077db96d56Sopenharmony_cimax(len(left_part), len(right_part)) + 1 rather than the period --
3087db96d56Sopenharmony_cishould be preferred when possible.
3097db96d56Sopenharmony_ci
3107db96d56Sopenharmony_ciInterestingly, the long-period algorithm does not require an exact
3117db96d56Sopenharmony_cicomputation of the period; it works even with some long-period, but
3127db96d56Sopenharmony_ciundeniably "periodic" needles:
3137db96d56Sopenharmony_ci
3147db96d56Sopenharmony_ci    Cut: AbcdefAbc == Abcde + fAbc
3157db96d56Sopenharmony_ci
3167db96d56Sopenharmony_ciThis cut gives these inequalities:
3177db96d56Sopenharmony_ci
3187db96d56Sopenharmony_ci                 e != f
3197db96d56Sopenharmony_ci                de != fA
3207db96d56Sopenharmony_ci               cde != fAb
3217db96d56Sopenharmony_ci              bcde != fAbc
3227db96d56Sopenharmony_ci             Abcde != fAbc?
3237db96d56Sopenharmony_ci    The first failure is a period long, per the CFT:
3247db96d56Sopenharmony_ci            ?Abcde == fAbc??
3257db96d56Sopenharmony_ci
3267db96d56Sopenharmony_ciA sufficient condition for using the long-period algorithm is having
3277db96d56Sopenharmony_cithe period of the needle be greater than
3287db96d56Sopenharmony_cimax(len(left_part), len(right_part)). This way, after choosing a good
3297db96d56Sopenharmony_cisplit, we get all of the max(len(left_part), len(right_part))
3307db96d56Sopenharmony_ciinequalities around the cut that were required in the long-period
3317db96d56Sopenharmony_civersion of the algorithm.
3327db96d56Sopenharmony_ci
3337db96d56Sopenharmony_ciWith all of this in mind, here's how we choose:
3347db96d56Sopenharmony_ci
3357db96d56Sopenharmony_ci    (1) Choose a "critical factorization" of the needle -- a cut
3367db96d56Sopenharmony_ci        where we have period minus 1 inequalities in a row.
3377db96d56Sopenharmony_ci        More specifically, choose a cut so that the left_part
3387db96d56Sopenharmony_ci        is less than one period long.
3397db96d56Sopenharmony_ci    (2) Determine the period P_r of the right_part.
3407db96d56Sopenharmony_ci    (3) Check if the left part is just an extension of the pattern of
3417db96d56Sopenharmony_ci        the right part, so that the whole needle has period P_r.
3427db96d56Sopenharmony_ci        Explicitly, check if
3437db96d56Sopenharmony_ci            needle[0:cut] == needle[0+P_r:cut+P_r]
3447db96d56Sopenharmony_ci        If so, we use the periodic algorithm. If not equal, we use the
3457db96d56Sopenharmony_ci        long-period algorithm.
3467db96d56Sopenharmony_ci
3477db96d56Sopenharmony_ciNote that if equality holds in (3), then the period of the whole
3487db96d56Sopenharmony_cistring is P_r. On the other hand, suppose equality does not hold.
3497db96d56Sopenharmony_ciThe period of the needle is then strictly greater than P_r. Here's
3507db96d56Sopenharmony_cia general fact:
3517db96d56Sopenharmony_ci
3527db96d56Sopenharmony_ci    If p is a substring of s and p has period r, then the period
3537db96d56Sopenharmony_ci    of s is either equal to r or greater than len(p).
3547db96d56Sopenharmony_ci
3557db96d56Sopenharmony_ciWe know that needle_period != P_r,
3567db96d56Sopenharmony_ciand therefore needle_period > len(right_part).
3577db96d56Sopenharmony_ciAdditionally, we'll choose the cut (see below)
3587db96d56Sopenharmony_ciso that len(left_part) < needle_period.
3597db96d56Sopenharmony_ci
3607db96d56Sopenharmony_ciThus, in the case where equality does not hold, we have that
3617db96d56Sopenharmony_cineedle_period >= max(len(left_part), len(right_part)) + 1,
3627db96d56Sopenharmony_ciso the long-period algorithm works, but otherwise, we know the period
3637db96d56Sopenharmony_ciof the needle.
3647db96d56Sopenharmony_ci
3657db96d56Sopenharmony_ciNote that this decision process doesn't always require an exact
3667db96d56Sopenharmony_cicomputation of the period -- we can get away with only computing P_r!
3677db96d56Sopenharmony_ci
3687db96d56Sopenharmony_ci
3697db96d56Sopenharmony_ci-------- Computing the cut --------
3707db96d56Sopenharmony_ci
3717db96d56Sopenharmony_ciOur remaining tasks are now to compute a cut of the needle with as
3727db96d56Sopenharmony_cimany inequalities as possible, ensuring that cut < needle_period.
3737db96d56Sopenharmony_ciMeanwhile, we must also compute the period P_r of the right_part.
3747db96d56Sopenharmony_ci
3757db96d56Sopenharmony_ciThe computation is relatively simple, essentially doing this:
3767db96d56Sopenharmony_ci
3777db96d56Sopenharmony_ci    suffix1 = max(needle[i:] for i in range(len(needle)))
3787db96d56Sopenharmony_ci    suffix2 = ... # the same as above, but invert the alphabet
3797db96d56Sopenharmony_ci    cut1 = len(needle) - len(suffix1)
3807db96d56Sopenharmony_ci    cut2 = len(needle) - len(suffix2)
3817db96d56Sopenharmony_ci    cut = max(cut1, cut2) # the later cut
3827db96d56Sopenharmony_ci
3837db96d56Sopenharmony_ciFor cut2, "invert the alphabet" is different than saying min(...),
3847db96d56Sopenharmony_cisince in lexicographic order, we still put "py" < "python", even
3857db96d56Sopenharmony_ciif the alphabet is inverted. Computing these, along with the method
3867db96d56Sopenharmony_ciof computing the period of the right half, is easiest to read directly
3877db96d56Sopenharmony_cifrom the source code in fastsearch.h, in which these are computed
3887db96d56Sopenharmony_ciin linear time.
3897db96d56Sopenharmony_ci
3907db96d56Sopenharmony_ciCrochemore & Perrin's Theorem 3.1 give that "cut" above is a
3917db96d56Sopenharmony_cicritical factorization less than the period, but a very brief sketch
3927db96d56Sopenharmony_ciof their proof goes something like this (this is far from complete):
3937db96d56Sopenharmony_ci
3947db96d56Sopenharmony_ci    * If this cut splits the needle as some
3957db96d56Sopenharmony_ci      needle == (a + w) + (w + b), meaning there's a bad equality
3967db96d56Sopenharmony_ci      w == w, it's impossible for w + b to be bigger than both
3977db96d56Sopenharmony_ci      b and w + w + b, so this can't happen. We thus have all of
3987db96d56Sopenharmony_ci      the ineuqalities with no question marks.
3997db96d56Sopenharmony_ci    * By maximality, the right part is not a substring of the left
4007db96d56Sopenharmony_ci      part. Thus, we have all of the inequalities involving no
4017db96d56Sopenharmony_ci      left-side question marks.
4027db96d56Sopenharmony_ci    * If you have all of the inequalities without right-side question
4037db96d56Sopenharmony_ci      marks, we have a critical factorization.
4047db96d56Sopenharmony_ci    * If one such inequality fails, then there's a smaller period,
4057db96d56Sopenharmony_ci      but the factorization is nonetheless critical. Here's where
4067db96d56Sopenharmony_ci      you need the redundancy coming from computing both cuts and
4077db96d56Sopenharmony_ci      choosing the later one.
4087db96d56Sopenharmony_ci
4097db96d56Sopenharmony_ci
4107db96d56Sopenharmony_ci-------- Some more Bells and Whistles --------
4117db96d56Sopenharmony_ci
4127db96d56Sopenharmony_ciBeyond Crochemore & Perrin's original algorithm, we can use a couple
4137db96d56Sopenharmony_cimore tricks for speed in fastsearch.h:
4147db96d56Sopenharmony_ci
4157db96d56Sopenharmony_ci    1. Even though C&P has a best-case O(n/m) time, this doesn't occur
4167db96d56Sopenharmony_ci       very often, so we add a Boyer-Moore bad character table to
4177db96d56Sopenharmony_ci       achieve sublinear time in more cases.
4187db96d56Sopenharmony_ci
4197db96d56Sopenharmony_ci    2. The prework of computing the cut/period is expensive per
4207db96d56Sopenharmony_ci       needle character, so we shouldn't do it if it won't pay off.
4217db96d56Sopenharmony_ci       For this reason, if the needle and haystack are long enough,
4227db96d56Sopenharmony_ci       only automatically start with two-way if the needle's length
4237db96d56Sopenharmony_ci       is a small percentage of the length of the haystack.
4247db96d56Sopenharmony_ci
4257db96d56Sopenharmony_ci    3. In cases where the needle and haystack are large but the needle
4267db96d56Sopenharmony_ci       makes up a significant percentage of the length of the
4277db96d56Sopenharmony_ci       haystack, don't pay the expensive two-way preprocessing cost
4287db96d56Sopenharmony_ci       if you don't need to. Instead, keep track of how many
4297db96d56Sopenharmony_ci       character comparisons are equal, and if that exceeds
4307db96d56Sopenharmony_ci       O(len(needle)), then pay that cost, since the simpler algorithm
4317db96d56Sopenharmony_ci       isn't doing very well.
432