Lines Matching full:unicode
1 # Unicode conformance
3 This document describes the regex crate's conformance to Unicode's
4 [UTS#18](https://unicode.org/reports/tr18/)
7 Full support for Level 1 ("Basic Unicode Support") is provided with two
10 1. Line boundaries are not Unicode aware. Namely, only the `\n`
13 [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
28 [UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
30 Hex Notation refers to the ability to specify a Unicode code point in a regular
32 environments that have poor Unicode font rendering or if you need to express a
37 \x{10FFFF} any hex character code corresponding to a Unicode code point
39 \u{7F} any hex character code corresponding to a Unicode code point
41 \U{7F} any hex character code corresponding to a Unicode code point
48 Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
49 banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
50 mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
57 [UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
59 Full support for Unicode property syntax is provided. Unicode properties
61 points specified by Unicode. The regex crate does not provide exhaustive
64 * [General categories](https://unicode.org/reports/tr18/#General_Category_Property)
65 * [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property)
66 * [Age](https://unicode.org/reports/tr18/#Age)
68 [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly.
72 underscores. Property name aliases can be found in Unicode's
73 [`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
74 file, while property value aliases can be found in Unicode's
75 [`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
84 * `\p{age:3.2}` selects all code points in Unicode 3.2.
152 [UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
155 [compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibi…
161 Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
162 Their traditional ASCII definition can be used by disabling Unicode. That is,
168 [UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection)
181 [UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
183 The regex crate provides basic Unicode aware word boundary assertions. A word
199 [prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
201 [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
205 Finally, Unicode word boundaries can be disabled, which will cause ASCII word
206 boundaries to be used instead. That is, `\b` is a Unicode word boundary while
208 if performance is important, since the implementation of Unicode word
214 [UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches)
229 [UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries)
233 convenience, and to avoid performance cliffs that Unicode word boundaries are
242 [UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters)
244 The regex crate provides full support for Unicode code point matching. Namely,
255 Note that when Unicode mode is disabled, the fundamental atom of matching is
256 no longer a code point but a single byte. When Unicode mode is disabled, many
257 Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
258 regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal