1Your friendly guide to hacking and navigating the regex library. 2 3This guide assumes familiarity with Rust and Cargo, and at least a perusal of 4the user facing documentation for this crate. 5 6If you're looking for background on the implementation in this library, then 7you can do no better than Russ Cox's article series on implementing regular 8expressions using finite automata: https://swtch.com/~rsc/regexp/ 9 10 11## Architecture overview 12 13As you probably already know, this library executes regular expressions using 14finite automata. In particular, a design goal is to make searching linear 15with respect to both the regular expression and the text being searched. 16Meeting that design goal on its own is not so hard and can be done with an 17implementation of the Pike VM (similar to Thompson's construction, but supports 18capturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html 19--- This library contains such an implementation in src/pikevm.rs. 20 21Making it fast is harder. One of the key problems with the Pike VM is that it 22can be in more than one state at any point in time, and must shuffle capture 23positions between them. The Pike VM also spends a lot of time following the 24same epsilon transitions over and over again. We can employ one trick to 25speed up the Pike VM: extract one or more literal prefixes from the regular 26expression and execute specialized code to quickly find matches of those 27prefixes in the search text. The Pike VM can then be avoided for most the 28search, and instead only executed when a prefix is found. The code to find 29prefixes is in the regex-syntax crate (in this repository). The code to search 30for literals is in src/literals.rs. When more than one literal prefix is found, 31we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one 32literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and 33Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this 34library also uses elementary frequency analysis to choose the right byte to run 35`memchr` with. 36 37Of course, detecting prefix literals can only take us so far. Not all regular 38expressions have literal prefixes. To remedy this, we try another approach 39to executing the Pike VM: backtracking, whose implementation can be found in 40src/backtrack.rs. One reason why backtracking can be faster is that it avoids 41excessive shuffling of capture groups. Of course, backtracking is susceptible 42to exponential runtimes, so we keep track of every state we've visited to make 43sure we never visit it again. This guarantees linear time execution, but we 44pay for it with the memory required to track visited states. Because of the 45memory requirement, we only use this engine on small search strings *and* small 46regular expressions. 47 48Lastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs. 49It is distinct from the Pike VM in that the DFA is explicitly represented in 50memory and is only ever in one state at a time. It is said to be "lazy" because 51the DFA is computed as text is searched, where each byte in the search text 52results in at most one new DFA state. It is made fast by caching states. DFAs 53are susceptible to exponential state blow up (where the worst case is computing 54a new state for every input byte, regardless of what's in the state cache). To 55avoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache 56is full, it is wiped and state computation starts over again. If the cache is 57wiped too frequently, then the DFA gives up and searching falls back to one of 58the aforementioned algorithms. 59 60All of the above matching engines expose precisely the same matching semantics. 61This is indeed tested. (See the section below about testing.) 62 63The following sub-sections describe the rest of the library and how each of the 64matching engines are actually used. 65 66### Parsing 67 68Regular expressions are parsed using the regex-syntax crate, which is 69maintained in this repository. The regex-syntax crate defines an abstract 70syntax and provides very detailed error messages when a parse error is 71encountered. Parsing is done in a separate crate so that others may benefit 72from its existence, and because it is relatively divorced from the rest of the 73regex library. 74 75The regex-syntax crate also provides sophisticated support for extracting 76prefix and suffix literals from regular expressions. 77 78### Compilation 79 80The compiler is in src/compile.rs. The input to the compiler is some abstract 81syntax for a regular expression and the output is a sequence of opcodes that 82matching engines use to execute a search. (One can think of matching engines as 83mini virtual machines.) The sequence of opcodes is a particular encoding of a 84non-deterministic finite automaton. In particular, the opcodes explicitly rely 85on epsilon transitions. 86 87Consider a simple regular expression like `a|b`. Its compiled form looks like 88this: 89 90 000 Save(0) 91 001 Split(2, 3) 92 002 'a' (goto: 4) 93 003 'b' 94 004 Save(1) 95 005 Match 96 97The first column is the instruction pointer and the second column is the 98instruction. Save instructions indicate that the current position in the input 99should be stored in a captured location. Split instructions represent a binary 100branch in the program (i.e., epsilon transitions). The instructions `'a'` and 101`'b'` indicate that the literal bytes `'a'` or `'b'` should match. 102 103In older versions of this library, the compilation looked like this: 104 105 000 Save(0) 106 001 Split(2, 3) 107 002 'a' 108 003 Jump(5) 109 004 'b' 110 005 Save(1) 111 006 Match 112 113In particular, empty instructions that merely served to move execution from one 114point in the program to another were removed. Instead, every instruction has a 115`goto` pointer embedded into it. This resulted in a small performance boost for 116the Pike VM, because it was one fewer epsilon transition that it had to follow. 117 118There exist more instructions and they are defined and documented in 119src/prog.rs. 120 121Compilation has several knobs and a few unfortunately complicated invariants. 122Namely, the output of compilation can be one of two types of programs: a 123program that executes on Unicode scalar values or a program that executes 124on raw bytes. In the former case, the matching engine is responsible for 125performing UTF-8 decoding and executing instructions using Unicode codepoints. 126In the latter case, the program handles UTF-8 decoding implicitly, so that the 127matching engine can execute on raw bytes. All matching engines can execute 128either Unicode or byte based programs except for the lazy DFA, which requires 129byte based programs. In general, both representations were kept because (1) the 130lazy DFA requires byte based programs so that states can be encoded in a memory 131efficient manner and (2) the Pike VM benefits greatly from inlining Unicode 132character classes into fewer instructions as it results in fewer epsilon 133transitions. 134 135N.B. UTF-8 decoding is built into the compiled program by making use of the 136utf8-ranges crate. The compiler in this library factors out common suffixes to 137reduce the size of huge character classes (e.g., `\pL`). 138 139A regrettable consequence of this split in instruction sets is we generally 140need to compile two programs; one for NFA execution and one for the lazy DFA. 141 142In fact, it is worse than that: the lazy DFA is not capable of finding the 143starting location of a match in a single scan, and must instead execute a 144backwards search after finding the end location. To execute a backwards search, 145we must have compiled the regular expression *in reverse*. 146 147This means that every compilation of a regular expression generally results in 148three distinct programs. It would be possible to lazily compile the Unicode 149program, since it is never needed if (1) the regular expression uses no word 150boundary assertions and (2) the caller never asks for sub-capture locations. 151 152### Execution 153 154At the time of writing, there are four matching engines in this library: 155 1561. The Pike VM (supports captures). 1572. Bounded backtracking (supports captures). 1583. Literal substring or multi-substring search. 1594. Lazy DFA (no support for Unicode word boundary assertions). 160 161Only the first two matching engines are capable of executing every regular 162expression program. They also happen to be the slowest, which means we need 163some logic that (1) knows various facts about the regular expression and (2) 164knows what the caller wants. Using this information, we can determine which 165engine (or engines) to use. 166 167The logic for choosing which engine to execute is in src/exec.rs and is 168documented on the Exec type. Exec values contain regular expression Programs 169(defined in src/prog.rs), which contain all the necessary tidbits for actually 170executing a regular expression on search text. 171 172For the most part, the execution logic is straight-forward and follows the 173limitations of each engine described above pretty faithfully. The hairiest 174part of src/exec.rs by far is the execution of the lazy DFA, since it requires 175a forwards and backwards search, and then falls back to either the Pike VM or 176backtracking if the caller requested capture locations. 177 178The Exec type also contains mutable scratch space for each type of matching 179engine. This scratch space is used during search (for example, for the lazy 180DFA, it contains compiled states that are reused on subsequent searches). 181 182### Programs 183 184A regular expression program is essentially a sequence of opcodes produced by 185the compiler plus various facts about the regular expression (such as whether 186it is anchored, its capture names, etc.). 187 188### The regex! macro 189 190The `regex!` macro no longer exists. It was developed in a bygone era as a 191compiler plugin during the infancy of the regex crate. Back then, then only 192matching engine in the crate was the Pike VM. The `regex!` macro was, itself, 193also a Pike VM. The only advantages it offered over the dynamic Pike VM that 194was built at runtime were the following: 195 196 1. Syntax checking was done at compile time. Your Rust program wouldn't 197 compile if your regex didn't compile. 198 2. Reduction of overhead that was proportional to the size of the regex. 199 For the most part, this overhead consisted of heap allocation, which 200 was nearly eliminated in the compiler plugin. 201 202The main takeaway here is that the compiler plugin was a marginally faster 203version of a slow regex engine. As the regex crate evolved, it grew other regex 204engines (DFA, bounded backtracker) and sophisticated literal optimizations. 205The regex macro didn't keep pace, and it therefore became (dramatically) slower 206than the dynamic engines. The only reason left to use it was for the compile 207time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint 208tool) has a lint that checks your regular expression validity, which mostly 209replaces that use case. 210 211Additionally, the regex compiler plugin stopped receiving maintenance. Nobody 212complained. At that point, it seemed prudent to just remove it. 213 214Will a compiler plugin be brought back? The future is murky, but there is 215definitely an opportunity there to build something that is faster than the 216dynamic engines in some cases. But it will be challenging! As of now, there 217are no plans to work on this. 218 219 220## Testing 221 222A key aspect of any mature regex library is its test suite. A subset of the 223tests in this library come from Glenn Fowler's AT&T test suite (its online 224presence seems gone at the time of writing). The source of the test suite is 225located in src/testdata. The scripts/regex-match-tests.py takes the test suite 226in src/testdata and generates tests/matches.rs. 227 228There are also many other manually crafted tests and regression tests in 229tests/tests.rs. Some of these tests were taken from RE2. 230 231The biggest source of complexity in the tests is related to answering this 232question: how can we reuse the tests to check all of our matching engines? One 233approach would have been to encode every test into some kind of format (like 234the AT&T test suite) and code generate tests for each matching engine. The 235approach we use in this library is to create a Cargo.toml entry point for each 236matching engine we want to test. The entry points are: 237 238* `tests/test_default.rs` - tests `Regex::new` 239* `tests/test_default_bytes.rs` - tests `bytes::Regex::new` 240* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA 241 algorithm on every regex. 242* `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA 243 algorithm on every regex and use *arbitrary* byte based programs. 244* `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA 245 algorithm on every regex and use *UTF-8* byte based programs. 246* `tests/test_backtrack.rs` - tests `Regex::new`, forced to use 247 backtracking on every regex. 248* `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use 249 backtracking on every regex and use *arbitrary* byte based programs. 250* `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use 251 backtracking on every regex and use *UTF-8* byte based programs. 252* `tests/test_crates_regex.rs` - tests to make sure that all of the 253 backends behave in the same way against a number of quickcheck 254 generated random inputs. These tests need to be enabled through 255 the `RUST_REGEX_RANDOM_TEST` environment variable (see 256 below). 257 258The lazy DFA and pure literal engines are absent from this list because 259they cannot be used on every regular expression. Instead, we rely on 260`tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible. 261 262Since the tests are repeated several times, and because `cargo test` runs all 263entry points, it can take a while to compile everything. To reduce compile 264times slightly, try using `cargo test --test default`, which will only use the 265`tests/test_default.rs` entry point. 266 267The random testing takes quite a while, so it is not enabled by default. 268In order to run the random testing you can set the 269`RUST_REGEX_RANDOM_TEST` environment variable to anything before 270invoking `cargo test`. Note that this variable is inspected at compile 271time, so if the tests don't seem to be running, you may need to run 272`cargo clean`. 273 274## Benchmarking 275 276The benchmarking in this crate is made up of many micro-benchmarks. Currently, 277there are two primary sets of benchmarks: the benchmarks that were adopted 278at this library's inception (in `bench/src/misc.rs`) and a newer set of 279benchmarks meant to test various optimizations. Specifically, the latter set 280contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter 281set are all executed on the same lengthy input whereas the former benchmarks 282are executed on strings of varying length. 283 284There is also a smattering of benchmarks for parsing and compilation. 285 286Benchmarks are in a separate crate so that its dependencies can be managed 287separately from the main regex crate. 288 289Benchmarking follows a similarly wonky setup as tests. There are multiple entry 290points: 291 292* `bench_rust.rs` - benchmarks `Regex::new` 293* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new` 294* `bench_pcre.rs` - benchmarks PCRE 295* `bench_onig.rs` - benchmarks Oniguruma 296 297The PCRE and Oniguruma benchmarks exist as a comparison point to a mature 298regular expression library. In general, this regex library compares favorably 299(there are even a few benchmarks that PCRE simply runs too slowly on or 300outright can't execute at all). I would love to add other regular expression 301library benchmarks (especially RE2). 302 303If you're hacking on one of the matching engines and just want to see 304benchmarks, then all you need to run is: 305 306 $ (cd bench && ./run rust) 307 308If you want to compare your results with older benchmarks, then try: 309 310 $ (cd bench && ./run rust | tee old) 311 $ ... make it faster 312 $ (cd bench && ./run rust | tee new) 313 $ cargo benchcmp old new --improvements 314 315The `cargo-benchcmp` utility is available here: 316https://github.com/BurntSushi/cargo-benchcmp 317 318The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See 319`./bench/bench --help`. 320 321## Dev Docs 322 323When digging your teeth into the codebase for the first time, the 324crate documentation can be a great resource. By default `rustdoc` 325will strip out all documentation of private crate members in an 326effort to help consumers of the crate focus on the *interface* 327without having to concern themselves with the *implementation*. 328Normally this is a great thing, but if you want to start hacking 329on regex internals it is not what you want. Many of the private members 330of this crate are well documented with rustdoc style comments, and 331it would be a shame to miss out on the opportunity that presents. 332You can generate the private docs with: 333 334``` 335$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments 336``` 337 338Then just point your browser at `target/doc/regex/index.html`. 339 340See https://github.com/rust-lang/rust/issues/15347 for more info 341about generating developer docs for internal use. 342