1These data sets are specifically crafted to try and defeat heuristic 2optimizations in various substring search implementations. The point of these 3is to make the costs of those heuristics clearer. In particular, the main idea 4behind heuristics is to sell out some rare or edge cases in favor of making 5some common cases *a lot* faster (potentially by orders of magnitude). The key 6to this is to make sure that those edge cases are impacted at tolerable levels. 7 8Below is a description of each. 9 10* `repeated-rare-*`: This is meant to be used with the needle `abczdef`. This 11 input defeats a heuristic in the old bstr and regex substring implementations 12 that looked for a rare byte (in this case, `z`) to run memchr on before 13 looking for an actual match. This particular input causes that heuristic to 14 stop on every byte in the input. In regex's case in particular, this causes 15 `O(mn)` time complexity. (In the case of `bstr`, it does a little better by 16 stopping this heuristic after a number of tries once it becomes clear that it 17 is ineffective.) 18* `defeat-simple-vector`: The corpus consists of `qaz` repeated over and over 19 again. The intended needle is `qbz`. This is meant to be difficult for the 20 "generic SIMD" algorithm[1] to handle. Namely, it will repeatedly find a 21 candidate match via the `q` and `z` bytes in the needle, but the overall 22 match will fail at the `memcmp` phase. Nevertheless, optimized versions of 23 [1] still do reasonably well on this benchmark because the `memcmp` can be 24 specialized to a single `u32` unaligned load and compare. 25* `defeat-simple-vector-freq`: This is similarish to `defeat-simple-vector`, 26 except it also attempts to defeat heuristic frequency analysis. The corpus 27 consists of `qjaz` repeated over and over again, with the intended needle 28 being `qja{49}z`. Heuristic frequency analysis might try either the `q` or 29 the `j`, in addition to `z`. Given the nature of the corpus, this will result 30 in a lot of false positive candidates, thus leading to an ineffective 31 prefilter. 32* `defeat-simple-vector-repeated`: This combines the "repeated-rare" and 33 "defeat-simple-vector" inputs. The corpus consists of `z` entirely, with only 34 the second to last byte being changed to `a`. The intended needle is 35 `z{135}az`. The key here is that in [1], a candidate match will be found at 36 every position in the haystack. And since the needle is very large, this will 37 result in a full `memcmp` call out. [1] effectively drowns in `memcmp` being 38 called at every position in the haystack. The algorithm in this crate does 39 a bit better by noticing that the prefilter is ineffective and falling back 40 to standard Two-Way. 41* `md5-huge`: This file contains one md5 hash per line for each word in the 42 `../sliceslice/words.txt` corpus. The intent of this benchmark is to defeat 43 frequency heuristics by using a corpus comprised of random data. That is, 44 no one bytes should be significantly more frequent than any other. 45* `random-huge`: Similar to `md5-huge`, but with longer lines and more 46 princpally random data. Generated via 47 `dd if=/dev/urandom bs=32 count=10000 | xxd -ps -c32`. 48 This was derived from a real world benchmark reported to ripgrep[2]. 49 In particular, it originally motivated the addition of Boyer-Moore to 50 the regex crate, but now this case is handled just fine by the memmem 51 implementation in this crate. 52 53[1]: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd 54[2]: https://github.com/BurntSushi/ripgrep/issues/617 55