README.md
1These data sets are specifically crafted to try and defeat heuristic
2optimizations in various substring search implementations. The point of these
3is to make the costs of those heuristics clearer. In particular, the main idea
4behind heuristics is to sell out some rare or edge cases in favor of making
5some common cases *a lot* faster (potentially by orders of magnitude). The key
6to this is to make sure that those edge cases are impacted at tolerable levels.
7
8Below is a description of each.
9
10* `repeated-rare-*`: This is meant to be used with the needle `abczdef`. This
11 input defeats a heuristic in the old bstr and regex substring implementations
12 that looked for a rare byte (in this case, `z`) to run memchr on before
13 looking for an actual match. This particular input causes that heuristic to
14 stop on every byte in the input. In regex's case in particular, this causes
15 `O(mn)` time complexity. (In the case of `bstr`, it does a little better by
16 stopping this heuristic after a number of tries once it becomes clear that it
17 is ineffective.)
18* `defeat-simple-vector`: The corpus consists of `qaz` repeated over and over
19 again. The intended needle is `qbz`. This is meant to be difficult for the
20 "generic SIMD" algorithm[1] to handle. Namely, it will repeatedly find a
21 candidate match via the `q` and `z` bytes in the needle, but the overall
22 match will fail at the `memcmp` phase. Nevertheless, optimized versions of
23 [1] still do reasonably well on this benchmark because the `memcmp` can be
24 specialized to a single `u32` unaligned load and compare.
25* `defeat-simple-vector-freq`: This is similarish to `defeat-simple-vector`,
26 except it also attempts to defeat heuristic frequency analysis. The corpus
27 consists of `qjaz` repeated over and over again, with the intended needle
28 being `qja{49}z`. Heuristic frequency analysis might try either the `q` or
29 the `j`, in addition to `z`. Given the nature of the corpus, this will result
30 in a lot of false positive candidates, thus leading to an ineffective
31 prefilter.
32* `defeat-simple-vector-repeated`: This combines the "repeated-rare" and
33 "defeat-simple-vector" inputs. The corpus consists of `z` entirely, with only
34 the second to last byte being changed to `a`. The intended needle is
35 `z{135}az`. The key here is that in [1], a candidate match will be found at
36 every position in the haystack. And since the needle is very large, this will
37 result in a full `memcmp` call out. [1] effectively drowns in `memcmp` being
38 called at every position in the haystack. The algorithm in this crate does
39 a bit better by noticing that the prefilter is ineffective and falling back
40 to standard Two-Way.
41* `md5-huge`: This file contains one md5 hash per line for each word in the
42 `../sliceslice/words.txt` corpus. The intent of this benchmark is to defeat
43 frequency heuristics by using a corpus comprised of random data. That is,
44 no one bytes should be significantly more frequent than any other.
45* `random-huge`: Similar to `md5-huge`, but with longer lines and more
46 princpally random data. Generated via
47 `dd if=/dev/urandom bs=32 count=10000 | xxd -ps -c32`.
48 This was derived from a real world benchmark reported to ripgrep[2].
49 In particular, it originally motivated the addition of Boyer-Moore to
50 the regex crate, but now this case is handled just fine by the memmem
51 implementation in this crate.
52
53[1]: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
54[2]: https://github.com/BurntSushi/ripgrep/issues/617
55