UNICODE.md - OpenGrok cross reference for /third_party/rust/crates/regex/UNICODE.md

Lines Matching +full:to +full:- +full:regex +full:- +full:range
3 This document describes the regex crate's conformance to Unicode's
14    are ASCII-only definitions.
16 Little to no support is provided for either Level 2 or Level 3. For the most
17 part, this is because the features are either complex/hard to implement, or at
18 the very least, very difficult to implement without sacrificing performance.
21 undertaking. This is at least partially a result of the fact that this regex
30 Hex Notation refers to the ability to specify a Unicode code point in a regular
32 environments that have poor Unicode font rendering or if you need to express a
37     \x{10FFFF}  any hex character code corresponding to a Unicode code point
39     \u{7F}      any hex character code corresponding to a Unicode code point
41     \U{7F}      any hex character code corresponding to a Unicode code point
46 fixed-width variants of the same idea.
48 Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
50 mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
51 U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
60 provide a convenient way to construct character classes of groups of code
61 points specified by Unicode. The regex crate does not provide exhaustive
90 The following is a list of all properties supported by the regex crate (starred
91 properties correspond to properties required by RL1.2):
154 The regex crate only provides ASCII definitions of the
157 at all). This is because it seems to be consistent with most other regular
158 expression engines, and in particular, because these are often referred to as
163 `[[:word:]]` and `(?-u)\w` are equivalent.
170 The regex crate provides full support for nested character classes, along with
171 union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
174 For example, to match all non-ASCII letters, you could use either
175 `[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
183 The regex crate provides basic Unicode aware word boundary assertions. A word
185 boundary negation corresponds to a zero-width match, where its adjacent
186 characters correspond to word and non-word, or non-word and word characters.
188 Conformance in this case chooses to define word character in the same way that
200 but is permissible according to
202 Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
206 boundaries to be used instead. That is, `\b` is a Unicode word boundary while
207 `(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
209 boundaries is currently sub-optimal on non-ASCII text.
216 The regex crate provides full support for case insensitive matching in
219 "simple" mapping is a mapping from exactly one code point to exactly one other
221 example, straight-forward to implement.
223 When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
231 The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
233 convenience, and to avoid performance cliffs that Unicode word boundaries are
234 subject to.
236 Ideally, it would be nice to at least support `\r\n` as a line boundary as
244 The regex crate provides full support for Unicode code point matching. Namely,
247 Given Rust's strong ties to UTF-8, the following guarantees are also provided:
249 * All matches are reported on valid UTF-8 code unit boundaries. That is, any
250   match range returned by the public regex API is guaranteed to successfully
252 * By consequence of the above, it is impossible to match surrogode code points.
253   No support for UTF-16 is provided, so this is never necessary.
257 Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
258 regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal