• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1This document explains Crochemore and Perrin's Two-Way string matching
2algorithm, in which a smaller string (the "pattern" or "needle")
3is searched for in a longer string (the "text" or "haystack"),
4determining whether the needle is a substring of the haystack, and if
5so, at what index(es). It is to be used by Python's string
6(and bytes-like) objects when calling `find`, `index`, `__contains__`,
7or implicitly in methods like `replace` or `partition`.
8
9This is essentially a re-telling of the paper
10
11    Crochemore M., Perrin D., 1991, Two-way string-matching,
12        Journal of the ACM 38(3):651-675.
13
14focused more on understanding and examples than on rigor. See also
15the code sample here:
16
17    http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
18
19The algorithm runs in O(len(needle) + len(haystack)) time and with
20O(1) space. However, since there is a larger preprocessing cost than
21simpler algorithms, this Two-Way algorithm is to be used only when the
22needle and haystack lengths meet certain thresholds.
23
24
25These are the basic steps of the algorithm:
26
27    * "Very carefully" cut the needle in two.
28    * For each alignment attempted:
29        1. match the right part
30            * On failure, jump by the amount matched + 1
31        2. then match the left part.
32            * On failure jump by max(len(left), len(right)) + 1
33    * If the needle is periodic, don't re-do comparisons; maintain
34      a "memory" of how many characters you already know match.
35
36
37-------- Matching the right part --------
38
39We first scan the right part of the needle to check if it matches the
40the aligned characters in the haystack. We scan left-to-right,
41and if a mismatch occurs, we jump ahead by the amount matched plus 1.
42
43Example:
44
45       text:    ........EFGX...................
46    pattern:    ....abcdEFGH....
47        cut:        <<<<>>>>
48
49Matched 3, so jump ahead by 4:
50
51       text:    ........EFGX...................
52    pattern:        ....abcdEFGH....
53        cut:            <<<<>>>>
54
55Why are we allowed to do this? Because we cut the needle very
56carefully, in such a way that if the cut is ...abcd + EFGH... then
57we have
58
59        d != E
60       cd != EF
61      bcd != EFG
62     abcd != EFGH
63          ... and so on.
64
65If this is true for every pair of equal-length substrings around the
66cut, then the following alignments do not work, so we can skip them:
67
68       text:    ........EFG....................
69    pattern:     ....abcdEFGH....
70                        ^   (Bad because d != E)
71       text:    ........EFG....................
72    pattern:      ....abcdEFGH....
73                        ^^   (Bad because cd != EF)
74       text:    ........EFG....................
75    pattern:       ....abcdEFGH....
76                        ^^^   (Bad because bcd != EFG)
77
78Skip 3 alignments => increment alignment by 4.
79
80
81-------- If len(left_part) < len(right_part) --------
82
83Above is the core idea, and it begins to suggest how the algorithm can
84be linear-time. There is one bit of subtlety involving what to do
85around the end of the needle: if the left half is shorter than the
86right, then we could run into something like this:
87
88       text:    .....EFG......
89    pattern:       cdEFGH
90
91The same argument holds that we can skip ahead by 4, so long as
92
93       d != E
94      cd != EF
95     ?cd != EFG
96    ??cd != EFGH
97         etc.
98
99The question marks represent "wildcards" that always match; they're
100outside the limits of the needle, so there's no way for them to
101invalidate a match. To ensure that the inequalities above are always
102true, we need them to be true for all possible '?' values. We thus
103need cd != FG and cd != GH, etc.
104
105
106-------- Matching the left part --------
107
108Once we have ensured the right part matches, we scan the left part
109(order doesn't matter, but traditionally right-to-left), and if we
110find a mismatch, we jump ahead by
111max(len(left_part), len(right_part)) + 1. That we can jump by
112at least len(right_part) + 1 we have already seen:
113
114       text: .....EFG.....
115    pattern:  abcdEFG
116    Matched 3, so jump by 4,
117    using the fact that d != E, cd != EF, and bcd != EFG.
118
119But we can also jump by at least len(left_part) + 1:
120
121       text: ....cdEF.....
122    pattern:   abcdEF
123    Jump by len('abcd') + 1 = 5.
124
125    Skip the alignments:
126       text: ....cdEF.....
127    pattern:    abcdEF
128       text: ....cdEF.....
129    pattern:     abcdEF
130       text: ....cdEF.....
131    pattern:      abcdEF
132       text: ....cdEF.....
133    pattern:       abcdEF
134
135This requires the following facts:
136       d != E
137      cd != EF
138     bcd != EF?
139    abcd != EF??
140         etc., for all values of ?s, as above.
141
142If we have both sets of inequalities, then we can indeed jump by
143max(len(left_part), len(right_part)) + 1. Under the assumption of such
144a nice splitting of the needle, we now have enough to prove linear
145time for the search: consider the forward-progress/comparisons ratio
146at each alignment position. If a mismatch occurs in the right part,
147the ratio is 1 position forward per comparison. On the other hand,
148if a mismatch occurs in the left half, we advance by more than
149len(needle)//2 positions for at most len(needle) comparisons,
150so this ratio is more than 1/2. This average "movement speed" is
151bounded below by the constant "1 position per 2 comparisons", so we
152have linear time.
153
154
155-------- The periodic case --------
156
157The sets of inequalities listed so far seem too good to be true in
158the general case. Indeed, they fail when a needle is periodic:
159there's no way to split 'AAbAAbAAbA' in two such that
160
161    (the stuff n characters to the left of the split)
162    cannot equal
163    (the stuff n characters to the right of the split)
164    for all n.
165
166This is because no matter how you cut it, you'll get
167s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
168needle in two so that n can be as big as possible. If we were to
169split it as
170
171    AAbA + AbAAbA
172
173then A == A at the split, so this is bad (we failed at length 1), but
174if we split it as
175
176    AA + bAAbAAbA
177
178we at least have A != b and AA != bA, and we fail at length 3
179since ?AA == bAA. We already knew that a cut to make length-3
180mismatch was impossible due to the period, but we now see that the
181bound is sharp; we can get length-1 and length-2 to mismatch.
182
183This is exactly the content of the *critical factorization theorem*:
184that no matter the period of the original needle, you can cut it in
185such a way that (with the appropriate question marks),
186needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
187
188Even "non-periodic" strings are periodic with a period equal to
189their length, so for such needles, the CFT already guarantees that
190the algorithm described so far will work, since we can cut the needle
191so that the length-k chunks on either side of the cut mismatch for all
192k < len(needle). Looking closer at the algorithm, we only actually
193require that k go up to max(len(left_part), len(right_part)).
194So long as the period exceeds that, we're good.
195
196The more general shorter-period case is a bit harder. The essentials
197are the same, except we use the periodicity to our advantage by
198"remembering" periods that we've already compared. In our running
199example, say we're computing
200
201    "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
202
203We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
204
205    First alignment:
206    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
207    AAbAAbAAbA
208      ^^X
209    - Mismatch at third position, so jump by 3.
210    - This requires that A!=b and AA != bA.
211
212    Second alignment:
213    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
214       AAbAAbAAbA
215         ^^^^^^^^
216        X
217    - Matched entire right part
218    - Mismatch at left part.
219    - Jump forward a period, remembering the existing comparisons
220
221    Third alignment:
222    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
223          AAbAAbAAbA
224          mmmmmmm^^X
225    - There's "memory": a bunch of characters were already matched.
226    - Two more characters match beyond that.
227    - The 8th character of the right part mismatched, so jump by 8
228    - The above rule is more complicated than usual: we don't have
229      the right inequalities for lengths 1 through 7, but we do have
230      shifted copies of the length-1 and length-2 inequalities,
231      along with knowledge of the mismatch. We can skip all of these
232      alignments at once:
233
234        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
235               AAbAAbAAbA
236                ~                   A != b at the cut
237        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
238                AAbAAbAAbA
239                ~~                  AA != bA at the cut
240        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
241                 AAbAAbAAbA
242                 ^^^^X              7-3=4 match, and the 5th misses.
243        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
244                  AAbAAbAAbA
245                   ~                A != b at the cut
246        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
247                   AAbAAbAAbA
248                   ~~               AA != bA at the cut
249        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
250                    AAbAAbAAbA
251                      ^X            7-3-3=1 match and the 2nd misses.
252        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
253                     AAbAAbAAbA
254                      ~             A != b at the cut
255
256    Fourth alignment:
257    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
258                 AAbAAbAAbA
259                   ^X
260    - Second character mismatches, so jump by 2.
261
262    Fifth alignment:
263    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
264                  AAbAAbAAbA
265                    ^^^^^^^^
266                   X
267    - Right half matches, so use memory and skip ahead by period=3
268
269    Sixth alignment:
270    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
271                     AAbAAbAAbA
272                     mmmmmmmm^^
273    - Right part matches, left part is remembered, found a match!
274
275The one tricky skip by 8 here generalizes: if we have a period of p,
276then the CFT says we can ensure the cut has the inequality property
277for lengths 1 through p-1, and jumping by p would line up the
278matching characters and mismatched character one period earlier.
279Inductively, this proves that we can skip by the number of characters
280matched in the right half, plus 1, just as in the original algorithm.
281
282To make it explicit, the memory is set whenever the entire right part
283is matched and is then used as a starting point in the next alignment.
284In such a case, the alignment jumps forward one period, and the right
285half matches all except possibly the last period. Additionally,
286if we cut so that the left part has a length strictly less than the
287period (we always can!), then we can know that the left part already
288matches. The memory is reset to 0 whenever there is a mismatch in the
289right part.
290
291To prove linearity for the periodic case, note that if a right-part
292character mismatches, then we advance forward 1 unit per comparison.
293On the other hand, if the entire right part matches, then the skipping
294forward by one period "defers" some of the comparisons to the next
295alignment, where they will then be spent at the usual rate of
296one comparison per step forward. Even if left-half comparisons
297are always "wasted", they constitute less than half of all
298comparisons, so the average rate is certainly at least 1 move forward
299per 2 comparisons.
300
301
302-------- When to choose the periodic algorithm ---------
303
304The periodic algorithm is always valid but has an overhead of one
305more "memory" register and some memory computation steps, so the
306here-described-first non-periodic/long-period algorithm -- skipping by
307max(len(left_part), len(right_part)) + 1 rather than the period --
308should be preferred when possible.
309
310Interestingly, the long-period algorithm does not require an exact
311computation of the period; it works even with some long-period, but
312undeniably "periodic" needles:
313
314    Cut: AbcdefAbc == Abcde + fAbc
315
316This cut gives these inequalities:
317
318                 e != f
319                de != fA
320               cde != fAb
321              bcde != fAbc
322             Abcde != fAbc?
323    The first failure is a period long, per the CFT:
324            ?Abcde == fAbc??
325
326A sufficient condition for using the long-period algorithm is having
327the period of the needle be greater than
328max(len(left_part), len(right_part)). This way, after choosing a good
329split, we get all of the max(len(left_part), len(right_part))
330inequalities around the cut that were required in the long-period
331version of the algorithm.
332
333With all of this in mind, here's how we choose:
334
335    (1) Choose a "critical factorization" of the needle -- a cut
336        where we have period minus 1 inequalities in a row.
337        More specifically, choose a cut so that the left_part
338        is less than one period long.
339    (2) Determine the period P_r of the right_part.
340    (3) Check if the left part is just an extension of the pattern of
341        the right part, so that the whole needle has period P_r.
342        Explicitly, check if
343            needle[0:cut] == needle[0+P_r:cut+P_r]
344        If so, we use the periodic algorithm. If not equal, we use the
345        long-period algorithm.
346
347Note that if equality holds in (3), then the period of the whole
348string is P_r. On the other hand, suppose equality does not hold.
349The period of the needle is then strictly greater than P_r. Here's
350a general fact:
351
352    If p is a substring of s and p has period r, then the period
353    of s is either equal to r or greater than len(p).
354
355We know that needle_period != P_r,
356and therefore needle_period > len(right_part).
357Additionally, we'll choose the cut (see below)
358so that len(left_part) < needle_period.
359
360Thus, in the case where equality does not hold, we have that
361needle_period >= max(len(left_part), len(right_part)) + 1,
362so the long-period algorithm works, but otherwise, we know the period
363of the needle.
364
365Note that this decision process doesn't always require an exact
366computation of the period -- we can get away with only computing P_r!
367
368
369-------- Computing the cut --------
370
371Our remaining tasks are now to compute a cut of the needle with as
372many inequalities as possible, ensuring that cut < needle_period.
373Meanwhile, we must also compute the period P_r of the right_part.
374
375The computation is relatively simple, essentially doing this:
376
377    suffix1 = max(needle[i:] for i in range(len(needle)))
378    suffix2 = ... # the same as above, but invert the alphabet
379    cut1 = len(needle) - len(suffix1)
380    cut2 = len(needle) - len(suffix2)
381    cut = max(cut1, cut2) # the later cut
382
383For cut2, "invert the alphabet" is different than saying min(...),
384since in lexicographic order, we still put "py" < "python", even
385if the alphabet is inverted. Computing these, along with the method
386of computing the period of the right half, is easiest to read directly
387from the source code in fastsearch.h, in which these are computed
388in linear time.
389
390Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
391critical factorization less than the period, but a very brief sketch
392of their proof goes something like this (this is far from complete):
393
394    * If this cut splits the needle as some
395      needle == (a + w) + (w + b), meaning there's a bad equality
396      w == w, it's impossible for w + b to be bigger than both
397      b and w + w + b, so this can't happen. We thus have all of
398      the ineuqalities with no question marks.
399    * By maximality, the right part is not a substring of the left
400      part. Thus, we have all of the inequalities involving no
401      left-side question marks.
402    * If you have all of the inequalities without right-side question
403      marks, we have a critical factorization.
404    * If one such inequality fails, then there's a smaller period,
405      but the factorization is nonetheless critical. Here's where
406      you need the redundancy coming from computing both cuts and
407      choosing the later one.
408
409
410-------- Some more Bells and Whistles --------
411
412Beyond Crochemore & Perrin's original algorithm, we can use a couple
413more tricks for speed in fastsearch.h:
414
415    1. Even though C&P has a best-case O(n/m) time, this doesn't occur
416       very often, so we add a Boyer-Moore bad character table to
417       achieve sublinear time in more cases.
418
419    2. The prework of computing the cut/period is expensive per
420       needle character, so we shouldn't do it if it won't pay off.
421       For this reason, if the needle and haystack are long enough,
422       only automatically start with two-way if the needle's length
423       is a small percentage of the length of the haystack.
424
425    3. In cases where the needle and haystack are large but the needle
426       makes up a significant percentage of the length of the
427       haystack, don't pay the expensive two-way preprocessing cost
428       if you don't need to. Instead, keep track of how many
429       character comparisons are equal, and if that exceeds
430       O(len(needle)), then pay that cost, since the simpler algorithm
431       isn't doing very well.
432