• Home
  • Raw
  • Download

Lines Matching +full:to +full:- +full:regex

1 Your friendly guide to understanding the performance characteristics of this
5 can be found here: https://docs.rs/regex
9 One of the design goals of this crate is to provide worst case linear time
10 behavior with respect to the text searched using finite state automata. This
12 regex implementations, which typically use backtracking which has worst case
28 or places where the current regex engine isn't quite optimal. This guide will
29 detail those cases and provide guidance on how to avoid them, among other
34 **Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
40 microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
42 the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
43 is because `\pL` corresponds to every letter in Unicode and compilation must
44 turn it into a proper automaton that decodes a subset of UTF-8 which
45 corresponds to those letters. Compilation also spends some cycles shrinking the
48 This means that in order to realize efficient regex matching, one must
49 *amortize the cost of compilation*. Trivially, if a call to `is_match` is
50 inside a loop, then make sure your call to `Regex::new` is *outside* that loop.
55 life-before-main, and therefore, one cannot utter this:
57 static MY_REGEX: Regex = Regex::new("...").unwrap();
59 Unfortunately, this would seem to imply that one must pass `Regex` objects
60 around to everywhere they are used, which can be especially painful depending
66 use regex::Regex;
68 fn some_helper_function(text: &str) -> bool {
70 static ref MY_REGEX: Regex = Regex::new("...").unwrap();
75 In other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
77 that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
83 ## Using a regex from multiple threads
85 **Advice**: The performance impact from using a `Regex` from multiple threads
86 is likely negligible. If necessary, clone the `Regex` so that each thread gets
87 its own copy. Cloning a regex does not incur any additional memory overhead
88 than what would be used by using a `Regex` from multiple threads
91 It is supported and encouraged to define your regexes using `lazy_static!` as
92 if they were global static values, and then use them to search text from
95 One might imagine that this is possible because a `Regex` represents a
97 therefore read-only. Unfortunately, this is not true. Each type of search
98 strategy in this crate requires some kind of mutable scratch space to use
105 interior mutability. This implies that `Regex` can either only be used from one
110 Synchronization implies *some* amount of overhead. When a `Regex` is used from
111 a single thread, this overhead is negligible. When a `Regex` is used from
113 synchronization from contention to impact performance. The specific cases where
123 other threads to retrieve its mutable scratch space before searching can start.
137 There are three primary search methods on a `Regex`:
143 In general, these are ordered from fastest to slowest.
145 `is_match` is fastest because it doesn't actually need to find the start or the
146 end of the leftmost-first match. It can quit immediately after it knows there
147 is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
151 leftmost-first match. It can use the DFA matcher for this, but must run it
152 forwards once to find the end of the match *and then run it backwards* to find
154 the leftmost-first match make this more expensive than `is_match`.
157 does, and then run either the bounded backtracker or the Pike VM to fill in the
165 end location of when it discovered a match. For example, given the regex `a+`
166 and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
167 the latter of which being the correct end location of the leftmost-first match.
169 ## Literals in your regex may make it faster
171 **Advice**: Literals can reduce the work that the regex engine needs to do. Use
174 In particular, if your regex starts with a prefix literal, the prefix is
175 quickly searched before entering the (much slower) regex engine. For example,
176 given the regex `foo\w+`, the literal `foo` will be searched for using
177 Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
178 there's a match is the regex engine invoked at the location of the match, which
179 effectively permits the regex engine to skip large portions of a haystack.
180 If a regex is comprised entirely of literals (possibly more than one), then
181 it's possible that the regex engine can be avoided entirely even when there's a
184 When one literal is found, Boyer-Moore is used. When multiple literals are
185 found, then an optimized version of Aho-Corasick is used.
197 Literals in anchored regexes can also be used for detecting non-matches very
198 quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
203 **Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
207 to interpret Unicode word boundaries as if they were ASCII word boundaries.
208 If the DFA comes across any non-ASCII byte, it will quit and fall back to an
213 The first way is to add some number of literal prefixes to your regular
215 still kick in to find prefix literals quickly, which limits how much work the
216 NFA simulation will need to do.
218 The second way is to give up on Unicode and use an ASCII word boundary instead.
220 instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it
221 can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
222 is important to limit the scope of disabling the `u` flag, since it might lead
223 to a syntax error if the regex could match arbitrary bytes. For example, if one
224 wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
231 non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
234 N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
235 can simply write `\b` to get an ASCII word boundary.
237 ## Excessive counting can lead to exponential state blow up in the DFA
243 Well, it turns out that the process of converting an NFA to a DFA can lead to
254 exorbitant memory usage. To ameliorate this, the DFA bounds the number of
257 too frequently, then the DFA will give up and execution will fall back to
260 In effect, this crate will detect exponential state blow up and fall back to
266 In the future, it may be possible to increase the bound that the DFA uses,
267 which would allow the caller to choose how much memory they're willing to
270 ## Resist the temptation to "optimize" regexes
274 An entire book was written on how to optimize Perl-style regular expressions.
276 there is no problem with using non-greedy matching or having lots of
277 alternations in your regex.