• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1These data sets are specifically crafted to try and defeat heuristic
2optimizations in various substring search implementations. The point of these
3is to make the costs of those heuristics clearer. In particular, the main idea
4behind heuristics is to sell out some rare or edge cases in favor of making
5some common cases *a lot* faster (potentially by orders of magnitude). The key
6to this is to make sure that those edge cases are impacted at tolerable levels.
7
8Below is a description of each.
9
10* `repeated-rare-*`: This is meant to be used with the needle `abczdef`. This
11  input defeats a heuristic in the old bstr and regex substring implementations
12  that looked for a rare byte (in this case, `z`) to run memchr on before
13  looking for an actual match. This particular input causes that heuristic to
14  stop on every byte in the input. In regex's case in particular, this causes
15  `O(mn)` time complexity. (In the case of `bstr`, it does a little better by
16  stopping this heuristic after a number of tries once it becomes clear that it
17  is ineffective.)
18* `defeat-simple-vector`: The corpus consists of `qaz` repeated over and over
19  again. The intended needle is `qbz`. This is meant to be difficult for the
20  "generic SIMD" algorithm[1] to handle. Namely, it will repeatedly find a
21  candidate match via the `q` and `z` bytes in the needle, but the overall
22  match will fail at the `memcmp` phase. Nevertheless, optimized versions of
23  [1] still do reasonably well on this benchmark because the `memcmp` can be
24  specialized to a single `u32` unaligned load and compare.
25* `defeat-simple-vector-freq`: This is similarish to `defeat-simple-vector`,
26  except it also attempts to defeat heuristic frequency analysis. The corpus
27  consists of `qjaz` repeated over and over again, with the intended needle
28  being `qja{49}z`. Heuristic frequency analysis might try either the `q` or
29  the `j`, in addition to `z`. Given the nature of the corpus, this will result
30  in a lot of false positive candidates, thus leading to an ineffective
31  prefilter.
32* `defeat-simple-vector-repeated`: This combines the "repeated-rare" and
33  "defeat-simple-vector" inputs. The corpus consists of `z` entirely, with only
34  the second to last byte being changed to `a`. The intended needle is
35  `z{135}az`. The key here is that in [1], a candidate match will be found at
36  every position in the haystack. And since the needle is very large, this will
37  result in a full `memcmp` call out. [1] effectively drowns in `memcmp` being
38  called at every position in the haystack. The algorithm in this crate does
39  a bit better by noticing that the prefilter is ineffective and falling back
40  to standard Two-Way.
41* `md5-huge`: This file contains one md5 hash per line for each word in the
42  `../sliceslice/words.txt` corpus. The intent of this benchmark is to defeat
43  frequency heuristics by using a corpus comprised of random data. That is,
44  no one bytes should be significantly more frequent than any other.
45* `random-huge`: Similar to `md5-huge`, but with longer lines and more
46  princpally random data. Generated via
47  `dd if=/dev/urandom bs=32 count=10000 | xxd -ps -c32`.
48  This was derived from a real world benchmark reported to ripgrep[2].
49  In particular, it originally motivated the addition of Boyer-Moore to
50  the regex crate, but now this case is handled just fine by the memmem
51  implementation in this crate.
52
53[1]: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
54[2]: https://github.com/BurntSushi/ripgrep/issues/617
55