1This document explains Crochemore and Perrin's Two-Way string matching 2algorithm, in which a smaller string (the "pattern" or "needle") 3is searched for in a longer string (the "text" or "haystack"), 4determining whether the needle is a substring of the haystack, and if 5so, at what index(es). It is to be used by Python's string 6(and bytes-like) objects when calling `find`, `index`, `__contains__`, 7or implicitly in methods like `replace` or `partition`. 8 9This is essentially a re-telling of the paper 10 11 Crochemore M., Perrin D., 1991, Two-way string-matching, 12 Journal of the ACM 38(3):651-675. 13 14focused more on understanding and examples than on rigor. See also 15the code sample here: 16 17 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 18 19The algorithm runs in O(len(needle) + len(haystack)) time and with 20O(1) space. However, since there is a larger preprocessing cost than 21simpler algorithms, this Two-Way algorithm is to be used only when the 22needle and haystack lengths meet certain thresholds. 23 24 25These are the basic steps of the algorithm: 26 27 * "Very carefully" cut the needle in two. 28 * For each alignment attempted: 29 1. match the right part 30 * On failure, jump by the amount matched + 1 31 2. then match the left part. 32 * On failure jump by max(len(left), len(right)) + 1 33 * If the needle is periodic, don't re-do comparisons; maintain 34 a "memory" of how many characters you already know match. 35 36 37-------- Matching the right part -------- 38 39We first scan the right part of the needle to check if it matches the 40the aligned characters in the haystack. We scan left-to-right, 41and if a mismatch occurs, we jump ahead by the amount matched plus 1. 42 43Example: 44 45 text: ........EFGX................... 46 pattern: ....abcdEFGH.... 47 cut: <<<<>>>> 48 49Matched 3, so jump ahead by 4: 50 51 text: ........EFGX................... 52 pattern: ....abcdEFGH.... 53 cut: <<<<>>>> 54 55Why are we allowed to do this? Because we cut the needle very 56carefully, in such a way that if the cut is ...abcd + EFGH... then 57we have 58 59 d != E 60 cd != EF 61 bcd != EFG 62 abcd != EFGH 63 ... and so on. 64 65If this is true for every pair of equal-length substrings around the 66cut, then the following alignments do not work, so we can skip them: 67 68 text: ........EFG.................... 69 pattern: ....abcdEFGH.... 70 ^ (Bad because d != E) 71 text: ........EFG.................... 72 pattern: ....abcdEFGH.... 73 ^^ (Bad because cd != EF) 74 text: ........EFG.................... 75 pattern: ....abcdEFGH.... 76 ^^^ (Bad because bcd != EFG) 77 78Skip 3 alignments => increment alignment by 4. 79 80 81-------- If len(left_part) < len(right_part) -------- 82 83Above is the core idea, and it begins to suggest how the algorithm can 84be linear-time. There is one bit of subtlety involving what to do 85around the end of the needle: if the left half is shorter than the 86right, then we could run into something like this: 87 88 text: .....EFG...... 89 pattern: cdEFGH 90 91The same argument holds that we can skip ahead by 4, so long as 92 93 d != E 94 cd != EF 95 ?cd != EFG 96 ??cd != EFGH 97 etc. 98 99The question marks represent "wildcards" that always match; they're 100outside the limits of the needle, so there's no way for them to 101invalidate a match. To ensure that the inequalities above are always 102true, we need them to be true for all possible '?' values. We thus 103need cd != FG and cd != GH, etc. 104 105 106-------- Matching the left part -------- 107 108Once we have ensured the right part matches, we scan the left part 109(order doesn't matter, but traditionally right-to-left), and if we 110find a mismatch, we jump ahead by 111max(len(left_part), len(right_part)) + 1. That we can jump by 112at least len(right_part) + 1 we have already seen: 113 114 text: .....EFG..... 115 pattern: abcdEFG 116 Matched 3, so jump by 4, 117 using the fact that d != E, cd != EF, and bcd != EFG. 118 119But we can also jump by at least len(left_part) + 1: 120 121 text: ....cdEF..... 122 pattern: abcdEF 123 Jump by len('abcd') + 1 = 5. 124 125 Skip the alignments: 126 text: ....cdEF..... 127 pattern: abcdEF 128 text: ....cdEF..... 129 pattern: abcdEF 130 text: ....cdEF..... 131 pattern: abcdEF 132 text: ....cdEF..... 133 pattern: abcdEF 134 135This requires the following facts: 136 d != E 137 cd != EF 138 bcd != EF? 139 abcd != EF?? 140 etc., for all values of ?s, as above. 141 142If we have both sets of inequalities, then we can indeed jump by 143max(len(left_part), len(right_part)) + 1. Under the assumption of such 144a nice splitting of the needle, we now have enough to prove linear 145time for the search: consider the forward-progress/comparisons ratio 146at each alignment position. If a mismatch occurs in the right part, 147the ratio is 1 position forward per comparison. On the other hand, 148if a mismatch occurs in the left half, we advance by more than 149len(needle)//2 positions for at most len(needle) comparisons, 150so this ratio is more than 1/2. This average "movement speed" is 151bounded below by the constant "1 position per 2 comparisons", so we 152have linear time. 153 154 155-------- The periodic case -------- 156 157The sets of inequalities listed so far seem too good to be true in 158the general case. Indeed, they fail when a needle is periodic: 159there's no way to split 'AAbAAbAAbA' in two such that 160 161 (the stuff n characters to the left of the split) 162 cannot equal 163 (the stuff n characters to the right of the split) 164 for all n. 165 166This is because no matter how you cut it, you'll get 167s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the 168needle in two so that n can be as big as possible. If we were to 169split it as 170 171 AAbA + AbAAbA 172 173then A == A at the split, so this is bad (we failed at length 1), but 174if we split it as 175 176 AA + bAAbAAbA 177 178we at least have A != b and AA != bA, and we fail at length 3 179since ?AA == bAA. We already knew that a cut to make length-3 180mismatch was impossible due to the period, but we now see that the 181bound is sharp; we can get length-1 and length-2 to mismatch. 182 183This is exactly the content of the *critical factorization theorem*: 184that no matter the period of the original needle, you can cut it in 185such a way that (with the appropriate question marks), 186needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. 187 188Even "non-periodic" strings are periodic with a period equal to 189their length, so for such needles, the CFT already guarantees that 190the algorithm described so far will work, since we can cut the needle 191so that the length-k chunks on either side of the cut mismatch for all 192k < len(needle). Looking closer at the algorithm, we only actually 193require that k go up to max(len(left_part), len(right_part)). 194So long as the period exceeds that, we're good. 195 196The more general shorter-period case is a bit harder. The essentials 197are the same, except we use the periodicity to our advantage by 198"remembering" periods that we've already compared. In our running 199example, say we're computing 200 201 "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". 202 203We cut as AA + bAAbAAbA, and then the algorithm runs as follows: 204 205 First alignment: 206 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 207 AAbAAbAAbA 208 ^^X 209 - Mismatch at third position, so jump by 3. 210 - This requires that A!=b and AA != bA. 211 212 Second alignment: 213 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 214 AAbAAbAAbA 215 ^^^^^^^^ 216 X 217 - Matched entire right part 218 - Mismatch at left part. 219 - Jump forward a period, remembering the existing comparisons 220 221 Third alignment: 222 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 223 AAbAAbAAbA 224 mmmmmmm^^X 225 - There's "memory": a bunch of characters were already matched. 226 - Two more characters match beyond that. 227 - The 8th character of the right part mismatched, so jump by 8 228 - The above rule is more complicated than usual: we don't have 229 the right inequalities for lengths 1 through 7, but we do have 230 shifted copies of the length-1 and length-2 inequalities, 231 along with knowledge of the mismatch. We can skip all of these 232 alignments at once: 233 234 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 235 AAbAAbAAbA 236 ~ A != b at the cut 237 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 238 AAbAAbAAbA 239 ~~ AA != bA at the cut 240 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 241 AAbAAbAAbA 242 ^^^^X 7-3=4 match, and the 5th misses. 243 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 244 AAbAAbAAbA 245 ~ A != b at the cut 246 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 247 AAbAAbAAbA 248 ~~ AA != bA at the cut 249 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 250 AAbAAbAAbA 251 ^X 7-3-3=1 match and the 2nd misses. 252 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 253 AAbAAbAAbA 254 ~ A != b at the cut 255 256 Fourth alignment: 257 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 258 AAbAAbAAbA 259 ^X 260 - Second character mismatches, so jump by 2. 261 262 Fifth alignment: 263 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 264 AAbAAbAAbA 265 ^^^^^^^^ 266 X 267 - Right half matches, so use memory and skip ahead by period=3 268 269 Sixth alignment: 270 bbbAbbAAbAAbAAbbbAAbAAbAAbAA 271 AAbAAbAAbA 272 mmmmmmmm^^ 273 - Right part matches, left part is remembered, found a match! 274 275The one tricky skip by 8 here generalizes: if we have a period of p, 276then the CFT says we can ensure the cut has the inequality property 277for lengths 1 through p-1, and jumping by p would line up the 278matching characters and mismatched character one period earlier. 279Inductively, this proves that we can skip by the number of characters 280matched in the right half, plus 1, just as in the original algorithm. 281 282To make it explicit, the memory is set whenever the entire right part 283is matched and is then used as a starting point in the next alignment. 284In such a case, the alignment jumps forward one period, and the right 285half matches all except possibly the last period. Additionally, 286if we cut so that the left part has a length strictly less than the 287period (we always can!), then we can know that the left part already 288matches. The memory is reset to 0 whenever there is a mismatch in the 289right part. 290 291To prove linearity for the periodic case, note that if a right-part 292character mismatches, then we advance forward 1 unit per comparison. 293On the other hand, if the entire right part matches, then the skipping 294forward by one period "defers" some of the comparisons to the next 295alignment, where they will then be spent at the usual rate of 296one comparison per step forward. Even if left-half comparisons 297are always "wasted", they constitute less than half of all 298comparisons, so the average rate is certainly at least 1 move forward 299per 2 comparisons. 300 301 302-------- When to choose the periodic algorithm --------- 303 304The periodic algorithm is always valid but has an overhead of one 305more "memory" register and some memory computation steps, so the 306here-described-first non-periodic/long-period algorithm -- skipping by 307max(len(left_part), len(right_part)) + 1 rather than the period -- 308should be preferred when possible. 309 310Interestingly, the long-period algorithm does not require an exact 311computation of the period; it works even with some long-period, but 312undeniably "periodic" needles: 313 314 Cut: AbcdefAbc == Abcde + fAbc 315 316This cut gives these inequalities: 317 318 e != f 319 de != fA 320 cde != fAb 321 bcde != fAbc 322 Abcde != fAbc? 323 The first failure is a period long, per the CFT: 324 ?Abcde == fAbc?? 325 326A sufficient condition for using the long-period algorithm is having 327the period of the needle be greater than 328max(len(left_part), len(right_part)). This way, after choosing a good 329split, we get all of the max(len(left_part), len(right_part)) 330inequalities around the cut that were required in the long-period 331version of the algorithm. 332 333With all of this in mind, here's how we choose: 334 335 (1) Choose a "critical factorization" of the needle -- a cut 336 where we have period minus 1 inequalities in a row. 337 More specifically, choose a cut so that the left_part 338 is less than one period long. 339 (2) Determine the period P_r of the right_part. 340 (3) Check if the left part is just an extension of the pattern of 341 the right part, so that the whole needle has period P_r. 342 Explicitly, check if 343 needle[0:cut] == needle[0+P_r:cut+P_r] 344 If so, we use the periodic algorithm. If not equal, we use the 345 long-period algorithm. 346 347Note that if equality holds in (3), then the period of the whole 348string is P_r. On the other hand, suppose equality does not hold. 349The period of the needle is then strictly greater than P_r. Here's 350a general fact: 351 352 If p is a substring of s and p has period r, then the period 353 of s is either equal to r or greater than len(p). 354 355We know that needle_period != P_r, 356and therefore needle_period > len(right_part). 357Additionally, we'll choose the cut (see below) 358so that len(left_part) < needle_period. 359 360Thus, in the case where equality does not hold, we have that 361needle_period >= max(len(left_part), len(right_part)) + 1, 362so the long-period algorithm works, but otherwise, we know the period 363of the needle. 364 365Note that this decision process doesn't always require an exact 366computation of the period -- we can get away with only computing P_r! 367 368 369-------- Computing the cut -------- 370 371Our remaining tasks are now to compute a cut of the needle with as 372many inequalities as possible, ensuring that cut < needle_period. 373Meanwhile, we must also compute the period P_r of the right_part. 374 375The computation is relatively simple, essentially doing this: 376 377 suffix1 = max(needle[i:] for i in range(len(needle))) 378 suffix2 = ... # the same as above, but invert the alphabet 379 cut1 = len(needle) - len(suffix1) 380 cut2 = len(needle) - len(suffix2) 381 cut = max(cut1, cut2) # the later cut 382 383For cut2, "invert the alphabet" is different than saying min(...), 384since in lexicographic order, we still put "py" < "python", even 385if the alphabet is inverted. Computing these, along with the method 386of computing the period of the right half, is easiest to read directly 387from the source code in fastsearch.h, in which these are computed 388in linear time. 389 390Crochemore & Perrin's Theorem 3.1 give that "cut" above is a 391critical factorization less than the period, but a very brief sketch 392of their proof goes something like this (this is far from complete): 393 394 * If this cut splits the needle as some 395 needle == (a + w) + (w + b), meaning there's a bad equality 396 w == w, it's impossible for w + b to be bigger than both 397 b and w + w + b, so this can't happen. We thus have all of 398 the ineuqalities with no question marks. 399 * By maximality, the right part is not a substring of the left 400 part. Thus, we have all of the inequalities involving no 401 left-side question marks. 402 * If you have all of the inequalities without right-side question 403 marks, we have a critical factorization. 404 * If one such inequality fails, then there's a smaller period, 405 but the factorization is nonetheless critical. Here's where 406 you need the redundancy coming from computing both cuts and 407 choosing the later one. 408 409 410-------- Some more Bells and Whistles -------- 411 412Beyond Crochemore & Perrin's original algorithm, we can use a couple 413more tricks for speed in fastsearch.h: 414 415 1. Even though C&P has a best-case O(n/m) time, this doesn't occur 416 very often, so we add a Boyer-Moore bad character table to 417 achieve sublinear time in more cases. 418 419 2. The prework of computing the cut/period is expensive per 420 needle character, so we shouldn't do it if it won't pay off. 421 For this reason, if the needle and haystack are long enough, 422 only automatically start with two-way if the needle's length 423 is a small percentage of the length of the haystack. 424 425 3. In cases where the needle and haystack are large but the needle 426 makes up a significant percentage of the length of the 427 haystack, don't pay the expensive two-way preprocessing cost 428 if you don't need to. Instead, keep track of how many 429 character comparisons are equal, and if that exceeds 430 O(len(needle)), then pay that cost, since the simpler algorithm 431 isn't doing very well. 432