• Home
  • Raw
  • Download

Lines Matching +full:to +full:- +full:regex

1 Your friendly guide to hacking and navigating the regex library.
14 finite automata. In particular, a design goal is to make searching linear
15 with respect to both the regular expression and the text being searched.
17 implementation of the Pike VM (similar to Thompson's construction, but supports
19 --- This library contains such an implementation in src/pikevm.rs.
24 same epsilon transitions over and over again. We can employ one trick to
26 expression and execute specialized code to quickly find matches of those
28 search, and instead only executed when a prefix is found. The code to find
29 prefixes is in the regex-syntax crate (in this repository). The code to search
31 we fall back to an Aho-Corasick DFA using the aho-corasick crate. For one
32 literal, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and
33 Boyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
34 library also uses elementary frequency analysis to choose the right byte to run
38 expressions have literal prefixes. To remedy this, we try another approach
39 to executing the Pike VM: backtracking, whose implementation can be found in
42 to exponential runtimes, so we keep track of every state we've visited to make
44 pay for it with the memory required to track visited states. Because of the
50 memory and is only ever in one state at a time. It is said to be "lazy" because
53 are susceptible to exponential state blow up (where the worst case is computing
54 a new state for every input byte, regardless of what's in the state cache). To
57 wiped too frequently, then the DFA gives up and searching falls back to one of
63 The following sub-sections describe the rest of the library and how each of the
68 Regular expressions are parsed using the regex-syntax crate, which is
69 maintained in this repository. The regex-syntax crate defines an abstract
73 regex library.
75 The regex-syntax crate also provides sophisticated support for extracting
80 The compiler is in src/compile.rs. The input to the compiler is some abstract
82 matching engines use to execute a search. (One can think of matching engines as
84 non-deterministic finite automaton. In particular, the opcodes explicitly rely
113 In particular, empty instructions that merely served to move execution from one
114 point in the program to another were removed. Instead, every instruction has a
116 the Pike VM, because it was one fewer epsilon transition that it had to follow.
125 performing UTF-8 decoding and executing instructions using Unicode codepoints.
126 In the latter case, the program handles UTF-8 decoding implicitly, so that the
135 N.B. UTF-8 decoding is built into the compiled program by making use of the
136 utf8-ranges crate. The compiler in this library factors out common suffixes to
140 need to compile two programs; one for NFA execution and one for the lazy DFA.
144 backwards search after finding the end location. To execute a backwards search,
148 three distinct programs. It would be possible to lazily compile the Unicode
150 boundary assertions and (2) the caller never asks for sub-capture locations.
158 3. Literal substring or multi-substring search.
162 expression program. They also happen to be the slowest, which means we need
165 engine (or engines) to use.
167 The logic for choosing which engine to execute is in src/exec.rs and is
172 For the most part, the execution logic is straight-forward and follows the
175 a forwards and backwards search, and then falls back to either the Pike VM or
188 ### The regex! macro
190 The `regex!` macro no longer exists. It was developed in a bygone era as a
191 compiler plugin during the infancy of the regex crate. Back then, then only
192 matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
197 compile if your regex didn't compile.
198 2. Reduction of overhead that was proportional to the size of the regex.
203 version of a slow regex engine. As the regex crate evolved, it grew other regex
205 The regex macro didn't keep pace, and it therefore became (dramatically) slower
206 than the dynamic engines. The only reason left to use it was for the compile
207 time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
211 Additionally, the regex compiler plugin stopped receiving maintenance. Nobody
212 complained. At that point, it seemed prudent to just remove it.
215 definitely an opportunity there to build something that is faster than the
217 are no plans to work on this.
222 A key aspect of any mature regex library is its test suite. A subset of the
225 located in src/testdata. The scripts/regex-match-tests.py takes the test suite
231 The biggest source of complexity in the tests is related to answering this
232 question: how can we reuse the tests to check all of our matching engines? One
233 approach would have been to encode every test into some kind of format (like
235 approach we use in this library is to create a Cargo.toml entry point for each
236 matching engine we want to test. The entry points are:
238 * `tests/test_default.rs` - tests `Regex::new`
239 * `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
240 * `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
241 algorithm on every regex.
242 * `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
243 algorithm on every regex and use *arbitrary* byte based programs.
244 * `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
245 algorithm on every regex and use *UTF-8* byte based programs.
246 * `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
247 backtracking on every regex.
248 * `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
249 backtracking on every regex and use *arbitrary* byte based programs.
250 * `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
251 backtracking on every regex and use *UTF-8* byte based programs.
252 * `tests/test_crates_regex.rs` - tests to make sure that all of the
254 generated random inputs. These tests need to be enabled through
260 `tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
263 entry points, it can take a while to compile everything. To reduce compile
264 times slightly, try using `cargo test --test default`, which will only use the
268 In order to run the random testing you can set the
269 `RUST_REGEX_RANDOM_TEST` environment variable to anything before
271 time, so if the tests don't seem to be running, you may need to run
276 The benchmarking in this crate is made up of many micro-benchmarks. Currently,
279 benchmarks meant to test various optimizations. Specifically, the latter set
287 separately from the main regex crate.
292 * `bench_rust.rs` - benchmarks `Regex::new`
293 * `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
294 * `bench_pcre.rs` - benchmarks PCRE
295 * `bench_onig.rs` - benchmarks Oniguruma
297 The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
298 regular expression library. In general, this regex library compares favorably
300 outright can't execute at all). I would love to add other regular expression
303 If you're hacking on one of the matching engines and just want to see
304 benchmarks, then all you need to run is:
308 If you want to compare your results with older benchmarks, then try:
313 $ cargo benchcmp old new --improvements
315 The `cargo-benchcmp` utility is available here:
316 https://github.com/BurntSushi/cargo-benchcmp
319 `./bench/bench --help`.
326 effort to help consumers of the crate focus on the *interface*
327 without having to concern themselves with the *implementation*.
328 Normally this is a great thing, but if you want to start hacking
329 on regex internals it is not what you want. Many of the private members
331 it would be a shame to miss out on the opportunity that presents.
335 $ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes co…
338 Then just point your browser at `target/doc/regex/index.html`.
340 See https://github.com/rust-lang/rust/issues/15347 for more info