lib.rs - OpenGrok cross reference for /external/rust/crates/regex/src/lib.rs

Lines Matching +full:to +full:- +full:regex
3 expressions. Its syntax is similar to Perl-style regular expressions, but lacks
5 execute in linear time with respect to the size of the regular expression and
13 documentation for the [`Regex`](struct.Regex.html) type.
17 This crate is [on crates.io](https://crates.io/crates/regex) and can be
18 used by adding `regex` to your dependencies in your project's `Cargo.toml`.
22 regex = "1"
28 expression and then using it to search, split or replace text. For example,
29 to confirm that some text resembles a date:
32 use regex::Regex;
33 let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
34 assert!(re.is_match("2014-01-01"));
39 it to match anywhere in the text. Anchors can be used to ensure that the
43 [raw strings](https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals)
49 # Example: Avoid compiling the same regex in a loop
51 It is an anti-pattern to compile the same regular expression in a loop
53 microseconds to a few **milliseconds** depending on the size of the
54 regex.) Not only is compilation itself expensive, but this also prevents
55 optimizations that reuse allocations internally to the matching engines.
57 In Rust, it can sometimes be a pain to pass regular expressions around if
59 [`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that
66 use regex::Regex;
68 fn some_helper_function(text: &str) -> bool {
70         static ref RE: Regex = Regex::new("...").unwrap();
78 Specifically, in this example, the regex will be compiled when it is used for
84 repeatedly against a search string to find successive non-overlapping
85 matches. For example, to find all dates in a string and be able to access
89 # use regex::Regex;
91 let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
92 let text = "2012-03-14, 2013-01-01 and 2014-07-05";
108 Building on the previous example, perhaps we'd like to rearrange the date
109 formats. This can be done with text replacement. But to make the code
114 # use regex::Regex;
116 let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap();
117 let before = "2012-03-14, 2013-01-01 and 2014-07-05";
125 `Regex::replace` for more details.)
127 Note that if your regex gets complicated, you can use the `x` flag to
131 # use regex::Regex;
133 let re = Regex::new(r"(?x)
135   -
137   -
140 let before = "2012-03-14, 2013-01-01 and 2014-07-05";
146 If you wish to match against whitespace in this mode, you can still use `\s`,
149 the `x` flag, e.g., `(?-x: )`.
153 This demonstrates how to use a `RegexSet` to match multiple (possibly
157 use regex::RegexSet;
173 // You can also test whether a particular regex matched:
181 With respect to searching text with a regular expression, there are three
188 Generally speaking, this crate could provide a function to answer only #3,
190 more expensive to compute the location of capturing group matches, so it's best
191 not to do it if you don't need to.
194 only need to test if an expression matches a string. (Use `is_match`
199 This implementation executes regular expressions **only** on valid UTF-8
200 while exposing match locations as byte indices into the search string. (To
201 relax this restriction, use the [`bytes`](bytes/index.html) sub-module.)
204 case-insensitively, the characters are first mapped using the "simple" case
212 # use regex::Regex;
214 let re = Regex::new(r"(?i)Δ+").unwrap();
223 * `.` will match any valid UTF-8 encoded Unicode scalar value except for `\n`.
224   (To also match `\n`, enable the `s` flag, e.g., `(?s:.)`.)
230 * `^` and `$` are **not** Unicode aware in multi-line mode. Namely, they only
239 # use regex::Regex;
241 let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap();
247 For a more detailed breakdown of Unicode support with respect to
250 [UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
251 document in the root of the regex repository.
255 The `bytes` sub-module provides a `Regex` type that can be used to match
256 on `&[u8]`. By default, text is interpreted as UTF-8 just like it is with
257 the main `Regex` type. However, this behavior can be disabled by turning
258 off the `u` flag, even if doing so could result in matching invalid UTF-8.
262 Disabling the `u` flag is also possible with the standard `&str`-based `Regex`
263 type, but it is only allowed where the UTF-8 invariant is maintained. For
264 example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an
265 `&str`-based `Regex`, but `(?-u:\xFF)` will attempt to match the raw byte
266 `\xFF`, which is invalid UTF-8 and therefore is illegal in `&str`-based
270 tables, this crate exposes knobs to disable the compilation of those
272 compilation times. For details on how to do that, see the section on [crate
273 features](#crate-features).
280 a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax).
288 \pN           One-letter name Unicode character class
290 \PN           Negated one-letter name Unicode character class
299 [a-z]         A character class matching any character in range a-z.
300 [[:alpha:]]   ASCII character class ([A-Za-z])
301 [[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
303 [a-y&&xyz]    Intersection (matching x or y)
304 [0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
305 [0-9--4]      Direct subtraction (matching 0-9 except 4)
306 [a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
314 Precedence in character classes, from most binding to least:
316 1. Ranges: `a-cd` == `[a-c]d`
318 3. Intersection: `^a-z&&b` == `^[a-z&&b]`
348 ^     the beginning of text (or start-of-line with multi-line mode)
349 $     the end of text (or end-of-line with multi-line mode)
350 \A    only the beginning of text (even with multi-line mode enabled)
351 \z    only the end of text (even with multi-line mode enabled)
356 The empty regex is valid and matches the empty string. For example, the empty
357 regex matches `abc` at positions `0`, `1`, `2` and `3`.
363 (?P&lt;name&gt;exp)  named (also numbered) capture group (allowed chars: [_0-9a-zA-Z.\[\]])
364 (?:exp)        non-capturing group
366 (?flags:exp)   set flags for exp (non-capturing)
370 and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at
371 the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets
377 i     case-insensitive: letters match both upper and lower case
378 m     multi-line mode: ^ and $ match begin/end of line
379 s     allow . to match \n
386 case-insensitively for the first part but case-sensitively for the second part:
389 # use regex::Regex;
391 let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
400 Multi-line mode means `^` and `$` no longer match just at the beginning/end of
404 # use regex::Regex;
405 let re = Regex::new(r"(?m)^line \d+").unwrap();
413 # use regex::Regex;
414 let re = Regex::new(r"(?m)^").unwrap();
423 # use regex::Regex;
425 let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
441 \123        octal character code (up to three digits) (when enabled)
443 \x{10FFFF}  any hex character code corresponding to a Unicode code point
445 \u{7F}      any hex character code corresponding to a Unicode code point
447 \U{7F}      any hex character code corresponding to a Unicode code point
467 [[:alnum:]]    alphanumeric ([0-9A-Za-z])
468 [[:alpha:]]    alphabetic ([A-Za-z])
469 [[:ascii:]]    ASCII ([\x00-\x7F])
471 [[:cntrl:]]    control ([\x00-\x1F\x7F])
472 [[:digit:]]    digits ([0-9])
473 [[:graph:]]    graphical ([!-~])
474 [[:lower:]]    lower case ([a-z])
475 [[:print:]]    printable ([ -~])
476 [[:punct:]]    punctuation ([!-/:-@\[-`{-~])
478 [[:upper:]]    upper case ([A-Z])
479 [[:word:]]     word characters ([0-9A-Za-z_])
480 [[:xdigit:]]   hex digit ([0-9A-Fa-f])
485 By default, this crate tries pretty hard to make regex matching both as fast
487 is a lot of code dedicated to performance, the handling of Unicode data and the
488 Unicode data itself. Overall, this leads to more dependencies, larger binaries
491 is still left with a perfectly serviceable regex engine that will work well
499 `unicode-case` feature (described below), then compiling the regex `(?i)a`
501 callers must use `(?i-u)a` instead to disable Unicode case folding. Stated
510 * **std** -
511   When enabled, this will cause `regex` to use the standard library. Currently,
513   intended to add `alloc`-only support to regex in the future.
517 * **perf** -
521 * **perf-dfa** -
522   Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
523   portions of a regex to a very fast DFA on an as-needed basis. This can
527 * **perf-inline** -
531 * **perf-literal** -
534   magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies.
535 * **perf-cache** -
536   This feature used to enable a faster internal cache at the cost of using
543 * **unicode** -
546 * **unicode-age** -
548   [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
549   This makes it possible to use classes like `\p{Age:6.0}` to refer to all
551 * **unicode-bool** -
555 * **unicode-case** -
558 * **unicode-gencat** -
560 …[Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Va…
561   This includes, but is not limited to, `Decimal_Number`, `Letter`,
563 * **unicode-perl** -
564   Provide the data for supporting the Unicode-aware Perl character classes,
565   corresponding to `\w`, `\s` and `\d`. This is also necessary for using
566   Unicode-aware word boundary assertions. Note that if this feature is
568   `unicode-bool` and `unicode-gencat` features are enabled, respectively.
569 * **unicode-script** -
572   This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
574 * **unicode-segment** -
575   Provide the data necessary to provide the properties used to implement the
589 Without this, it would be trivial for an attacker to exhaust your system's
593 crate have time complexity `O(mn)` (with `m ~ regex` and `n ~ search
594 text`), which means there's no way to cause exponential blow-up like with
596 features like arbitrary look-ahead and backreferences.)
598 When a DFA is used, pathological cases with exponential state blow-up are
601 our time complexity guarantees, but can lead to memory growth
602 proportional to the size of the input. As a stopgap, the DFA is only
603 allowed to store a fixed number of states. When the limit is reached, its
605 the limit is reached too frequently, it gives up and hands control off to
616 compile_error!("`std` feature is currently required to build this crate");
618 // To check README's example
619 // TODO: Re-enable this once the MSRV is 1.43 or greater.
620 // See: https://github.com/rust-lang/regex/issues/684
621 // See: https://github.com/rust-lang/regex/issues/685
636     Locations, Match, Matches, NoExpand, Regex, Replacer, ReplacerRef, Split,
643 This module provides a nearly identical API to the one found in the
644 top-level of this crate. There are two important differences:
649 matching invalid UTF-8 bytes.
653 This shows how to find all null-terminated strings in a slice of bytes:
656 # use regex::bytes::Regex;
657 let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
661 // The unwrap is OK here since a match requires the `cstr` capture to match.
671 This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
672 string (e.g., to extract a title from a Matroska file):
676 # use regex::bytes::Regex;
677 let re = Regex::new(
678     r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
683 // Notice that despite the `.*` at the end, it will only match valid UTF-8
689 // If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
695 is part of the overall match, then the capture is *guaranteed* to be valid
696 UTF-8.
704 1. The `u` flag can be disabled even when disabling it might cause the regex to
705 match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
710 revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
711 to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
712 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
714 5. Hexadecimal notation can be used to specify arbitrary bytes instead of
717 matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when
724 In general, one should expect performance on `&[u8]` to be roughly similar to
737 #[cfg(feature = "perf-dfa")]
758 /// The `internal` module exists to support suspicious activity, such as
759 /// testing different matching engines and supporting the `regex-debug` CLI