• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Your friendly guide to understanding the performance characteristics of this
2crate.
3
4This guide assumes some familiarity with the public API of this crate, which
5can be found here: https://docs.rs/regex
6
7## Theory vs. Practice
8
9One of the design goals of this crate is to provide worst case linear time
10behavior with respect to the text searched using finite state automata. This
11means that, *in theory*, the performance of this crate is much better than most
12regex implementations, which typically use backtracking which has worst case
13exponential time.
14
15For example, try opening a Python interpreter and typing this:
16
17    >>> import re
18    >>> re.search('(a*)*c', 'a' * 30).span()
19
20I'll wait.
21
22At some point, you'll figure out that it won't terminate any time soon. ^C it.
23
24The promise of this crate is that *this pathological behavior can't happen*.
25
26With that said, just because we have protected ourselves against worst case
27exponential behavior doesn't mean we are immune from large constant factors
28or places where the current regex engine isn't quite optimal. This guide will
29detail those cases and provide guidance on how to avoid them, among other
30bits of general advice.
31
32## Thou Shalt Not Compile Regular Expressions In A Loop
33
34**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
35
36Don't do it unless you really don't mind paying for it. Compiling a regular
37expression in this crate is quite expensive. It is conceivable that it may get
38faster some day, but I wouldn't hold out hope for, say, an order of magnitude
39improvement. In particular, compilation can take any where from a few dozen
40microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
41classes, in particular, have the largest impact on compilation performance. At
42the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
43is because `\pL` corresponds to every letter in Unicode and compilation must
44turn it into a proper automaton that decodes a subset of UTF-8 which
45corresponds to those letters. Compilation also spends some cycles shrinking the
46size of the automaton.
47
48This means that in order to realize efficient regex matching, one must
49*amortize the cost of compilation*. Trivially, if a call to `is_match` is
50inside a loop, then make sure your call to `Regex::new` is *outside* that loop.
51
52In many programming languages, regular expressions can be conveniently defined
53and compiled in a global scope, and code can reach out and use them as if
54they were global static variables. In Rust, there is really no concept of
55life-before-main, and therefore, one cannot utter this:
56
57    static MY_REGEX: Regex = Regex::new("...").unwrap();
58
59Unfortunately, this would seem to imply that one must pass `Regex` objects
60around to everywhere they are used, which can be especially painful depending
61on how your program is structured. Thankfully, the
62[`lazy_static`](https://crates.io/crates/lazy_static)
63crate provides an answer that works well:
64
65    use lazy_static::lazy_static;
66    use regex::Regex;
67
68    fn some_helper_function(text: &str) -> bool {
69        lazy_static! {
70            static ref MY_REGEX: Regex = Regex::new("...").unwrap();
71        }
72        MY_REGEX.is_match(text)
73    }
74
75In other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
76it were a global static value. What is actually happening under the covers is
77that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
78of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
79it's self contained and everything works exactly as you expect. In particular,
80`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
81a `Mutex`. On that note...
82
83## Using a regex from multiple threads
84
85**Advice**: The performance impact from using a `Regex` from multiple threads
86is likely negligible. If necessary, clone the `Regex` so that each thread gets
87its own copy. Cloning a regex does not incur any additional memory overhead
88than what would be used by using a `Regex` from multiple threads
89simultaneously. *Its only cost is ergonomics.*
90
91It is supported and encouraged to define your regexes using `lazy_static!` as
92if they were global static values, and then use them to search text from
93multiple threads simultaneously.
94
95One might imagine that this is possible because a `Regex` represents a
96*compiled* program, so that any allocation or mutation is already done, and is
97therefore read-only. Unfortunately, this is not true. Each type of search
98strategy in this crate requires some kind of mutable scratch space to use
99*during search*. For example, when executing a DFA, its states are computed
100lazily and reused on subsequent searches. Those states go into that mutable
101scratch space.
102
103The mutable scratch space is an implementation detail, and in general, its
104mutation should not be observable from users of this crate. Therefore, it uses
105interior mutability. This implies that `Regex` can either only be used from one
106thread, or it must do some sort of synchronization. Either choice is
107reasonable, but this crate chooses the latter, in particular because it is
108ergonomic and makes use with `lazy_static!` straight forward.
109
110Synchronization implies *some* amount of overhead. When a `Regex` is used from
111a single thread, this overhead is negligible. When a `Regex` is used from
112multiple threads simultaneously, it is possible for the overhead of
113synchronization from contention to impact performance. The specific cases where
114contention may happen is if you are calling any of these methods repeatedly
115from multiple threads simultaneously:
116
117* shortest_match
118* is_match
119* find
120* captures
121
122In particular, every invocation of one of these methods must synchronize with
123other threads to retrieve its mutable scratch space before searching can start.
124If, however, you are using one of these methods:
125
126* find_iter
127* captures_iter
128
129Then you may not suffer from contention since the cost of synchronization is
130amortized on *construction of the iterator*. That is, the mutable scratch space
131is obtained when the iterator is created and retained throughout its lifetime.
132
133## Only ask for what you need
134
135**Advice**: Prefer in this order: `is_match`, `find`, `captures`.
136
137There are three primary search methods on a `Regex`:
138
139* is_match
140* find
141* captures
142
143In general, these are ordered from fastest to slowest.
144
145`is_match` is fastest because it doesn't actually need to find the start or the
146end of the leftmost-first match. It can quit immediately after it knows there
147is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
148search will quit after examining the first byte.
149
150In contrast, `find` must return both the start and end location of the
151leftmost-first match. It can use the DFA matcher for this, but must run it
152forwards once to find the end of the match *and then run it backwards* to find
153the start of the match. The two scans and the cost of finding the real end of
154the leftmost-first match make this more expensive than `is_match`.
155
156`captures` is the most expensive of them all because it must do what `find`
157does, and then run either the bounded backtracker or the Pike VM to fill in the
158capture group locations. Both of these are simulations of an NFA, which must
159spend a lot of time shuffling states around. The DFA limits the performance hit
160somewhat by restricting the amount of text that must be searched via an NFA
161simulation.
162
163One other method not mentioned is `shortest_match`. This method has precisely
164the same performance characteristics as `is_match`, except it will return the
165end location of when it discovered a match. For example, given the regex `a+`
166and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
167the latter of which being the correct end location of the leftmost-first match.
168
169## Literals in your regex may make it faster
170
171**Advice**: Literals can reduce the work that the regex engine needs to do. Use
172them if you can, especially as prefixes.
173
174In particular, if your regex starts with a prefix literal, the prefix is
175quickly searched before entering the (much slower) regex engine. For example,
176given the regex `foo\w+`, the literal `foo` will be searched for using
177Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
178there's a match is the regex engine invoked at the location of the match, which
179effectively permits the regex engine to skip large portions of a haystack.
180If a regex is comprised entirely of literals (possibly more than one), then
181it's possible that the regex engine can be avoided entirely even when there's a
182match.
183
184When one literal is found, Boyer-Moore is used. When multiple literals are
185found, then an optimized version of Aho-Corasick is used.
186
187This optimization is in particular extended quite a bit in this crate. Here are
188a few examples of regexes that get literal prefixes detected:
189
190* `(foo|bar)` detects `foo` and `bar`
191* `(a|b)c` detects `ac` and `bc`
192* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
193* `a?b` detects `a` and `b`
194* `a*b` detects `a` and `b`
195* `(ab){3,6}` detects `ababab`
196
197Literals in anchored regexes can also be used for detecting non-matches very
198quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
199just by examining the first (or last) three bytes of the haystack.
200
201## Unicode word boundaries may prevent the DFA from being used
202
203**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
204instead of `\b` if you care about consistent performance more than correctness.
205
206It's a sad state of the current implementation. At the moment, the DFA will try
207to interpret Unicode word boundaries as if they were ASCII word boundaries.
208If the DFA comes across any non-ASCII byte, it will quit and fall back to an
209alternative matching engine that can handle Unicode word boundaries correctly.
210The alternate matching engine is generally quite a bit slower (perhaps by an
211order of magnitude). If necessary, this can be ameliorated in two ways.
212
213The first way is to add some number of literal prefixes to your regular
214expression. Even though the DFA may not be used, specialized routines will
215still kick in to find prefix literals quickly, which limits how much work the
216NFA simulation will need to do.
217
218The second way is to give up on Unicode and use an ASCII word boundary instead.
219One can use an ASCII word boundary by disabling Unicode support. That is,
220instead of using `\b`, use `(?-u:\b)`.  Namely, given the regex `\b.+\b`, it
221can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
222is important to limit the scope of disabling the `u` flag, since it might lead
223to a syntax error if the regex could match arbitrary bytes. For example, if one
224wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
225any *byte* when the Unicode flag is disabled.
226
227The second way isn't appreciably different than just using a Unicode word
228boundary in the first place, since the DFA will speculatively interpret it as
229an ASCII word boundary anyway. The key difference is that if an ASCII word
230boundary is used explicitly, then the DFA won't quit in the presence of
231non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
232more consistent performance.
233
234N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
235can simply write `\b` to get an ASCII word boundary.
236
237## Excessive counting can lead to exponential state blow up in the DFA
238
239**Advice**: Don't write regexes that cause DFA state blow up if you care about
240match performance.
241
242Wait, didn't I say that this crate guards against exponential worst cases?
243Well, it turns out that the process of converting an NFA to a DFA can lead to
244an exponential blow up in the number of states. This crate specifically guards
245against exponential blow up by doing two things:
246
2471. The DFA is computed lazily. That is, a state in the DFA only exists in
248   memory if it is visited. In particular, the lazy DFA guarantees that *at
249   most* one state is created for every byte of input. This, on its own,
250   guarantees linear time complexity.
2512. Of course, creating a new state for *every* byte of input means that search
252   will go incredibly slow because of very large constant factors. On top of
253   that, creating a state for every byte in a large haystack could result in
254   exorbitant memory usage. To ameliorate this, the DFA bounds the number of
255   states it can store. Once it reaches its limit, it flushes its cache. This
256   prevents reuse of states that it already computed. If the cache is flushed
257   too frequently, then the DFA will give up and execution will fall back to
258   one of the NFA simulations.
259
260In effect, this crate will detect exponential state blow up and fall back to
261a search routine with fixed memory requirements. This does, however, mean that
262searching will be much slower than one might expect. Regexes that rely on
263counting in particular are strong aggravators of this behavior. For example,
264matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.
265
266In the future, it may be possible to increase the bound that the DFA uses,
267which would allow the caller to choose how much memory they're willing to
268spend.
269
270## Resist the temptation to "optimize" regexes
271
272**Advice**: This ain't a backtracking engine.
273
274An entire book was written on how to optimize Perl-style regular expressions.
275Most of those techniques are not applicable for this library. For example,
276there is no problem with using non-greedy matching or having lots of
277alternations in your regex.
278