1# Unicode conformance 2 3This document describes the regex crate's conformance to Unicode's 4[UTS#18](https://unicode.org/reports/tr18/) 5report, which lays out 3 levels of support: Basic, Extended and Tailored. 6 7Full support for Level 1 ("Basic Unicode Support") is provided with two 8exceptions: 9 101. Line boundaries are not Unicode aware. Namely, only the `\n` 11 (`END OF LINE`) character is recognized as a line boundary. 122. The compatibility properties specified by 13 [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) 14 are ASCII-only definitions. 15 16Little to no support is provided for either Level 2 or Level 3. For the most 17part, this is because the features are either complex/hard to implement, or at 18the very least, very difficult to implement without sacrificing performance. 19For example, tackling canonical equivalence such that matching worked as one 20would expect regardless of normalization form would be a significant 21undertaking. This is at least partially a result of the fact that this regex 22engine is based on finite automata, which admits less flexibility normally 23associated with backtracking implementations. 24 25 26## RL1.1 Hex Notation 27 28[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation) 29 30Hex Notation refers to the ability to specify a Unicode code point in a regular 31expression via its hexadecimal code point representation. This is useful in 32environments that have poor Unicode font rendering or if you need to express a 33code point that is not normally displayable. All forms of hexadecimal notation 34are supported 35 36 \x7F hex character code (exactly two digits) 37 \x{10FFFF} any hex character code corresponding to a Unicode code point 38 \u007F hex character code (exactly four digits) 39 \u{7F} any hex character code corresponding to a Unicode code point 40 \U0000007F hex character code (exactly eight digits) 41 \U{7F} any hex character code corresponding to a Unicode code point 42 43Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways 44of expressing hexadecimal code points. Any number of digits can be written 45within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all 46fixed-width variants of the same idea. 47 48Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is 49banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode 50mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint 51U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches 52the literal byte `\xFF`. 53 54 55## RL1.2 Properties 56 57[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories) 58 59Full support for Unicode property syntax is provided. Unicode properties 60provide a convenient way to construct character classes of groups of code 61points specified by Unicode. The regex crate does not provide exhaustive 62support, but covers a useful subset. In particular: 63 64* [General categories](https://unicode.org/reports/tr18/#General_Category_Property) 65* [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property) 66* [Age](https://unicode.org/reports/tr18/#Age) 67* A smattering of boolean properties, including all of those specified by 68 [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly. 69 70In all cases, property name and value abbreviations are supported, and all 71names/values are matched loosely without regard for case, whitespace or 72underscores. Property name aliases can be found in Unicode's 73[`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) 74file, while property value aliases can be found in Unicode's 75[`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) 76file. 77 78The syntax supported is also consistent with the UTS#18 recommendation: 79 80* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow: 81 `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`, 82 `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and 83 `Script_Extensions` (or `scx` for short). 84* `\p{age:3.2}` selects all code points in Unicode 3.2. 85* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated 86 via `\p{alpha}` (for example). 87* Single letter variants for properties with single letter abbreviations. 88 For example, `\p{Letter}` can be equivalently written as `\pL`. 89 90The following is a list of all properties supported by the regex crate (starred 91properties correspond to properties required by RL1.2): 92 93* `General_Category` \* (including `Any`, `ASCII` and `Assigned`) 94* `Script` \* 95* `Script_Extensions` \* 96* `Age` 97* `ASCII_Hex_Digit` 98* `Alphabetic` \* 99* `Bidi_Control` 100* `Case_Ignorable` 101* `Cased` 102* `Changes_When_Casefolded` 103* `Changes_When_Casemapped` 104* `Changes_When_Lowercased` 105* `Changes_When_Titlecased` 106* `Changes_When_Uppercased` 107* `Dash` 108* `Default_Ignorable_Code_Point` \* 109* `Deprecated` 110* `Diacritic` 111* `Emoji` 112* `Emoji_Presentation` 113* `Emoji_Modifier` 114* `Emoji_Modifier_Base` 115* `Emoji_Component` 116* `Extended_Pictographic` 117* `Extender` 118* `Grapheme_Base` 119* `Grapheme_Cluster_Break` 120* `Grapheme_Extend` 121* `Hex_Digit` 122* `IDS_Binary_Operator` 123* `IDS_Trinary_Operator` 124* `ID_Continue` 125* `ID_Start` 126* `Join_Control` 127* `Logical_Order_Exception` 128* `Lowercase` \* 129* `Math` 130* `Noncharacter_Code_Point` \* 131* `Pattern_Syntax` 132* `Pattern_White_Space` 133* `Prepended_Concatenation_Mark` 134* `Quotation_Mark` 135* `Radical` 136* `Regional_Indicator` 137* `Sentence_Break` 138* `Sentence_Terminal` 139* `Soft_Dotted` 140* `Terminal_Punctuation` 141* `Unified_Ideograph` 142* `Uppercase` \* 143* `Variation_Selector` 144* `White_Space` \* 145* `Word_Break` 146* `XID_Continue` 147* `XID_Start` 148 149 150## RL1.2a Compatibility Properties 151 152[UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) 153 154The regex crate only provides ASCII definitions of the 155[compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties) 156(sans the `\X` class, for matching grapheme clusters, which isn't provided 157at all). This is because it seems to be consistent with most other regular 158expression engines, and in particular, because these are often referred to as 159"ASCII" or "POSIX" character classes. 160 161Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware. 162Their traditional ASCII definition can be used by disabling Unicode. That is, 163`[[:word:]]` and `(?-u)\w` are equivalent. 164 165 166## RL1.3 Subtraction and Intersection 167 168[UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection) 169 170The regex crate provides full support for nested character classes, along with 171union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`) 172operations on arbitrary character classes. 173 174For example, to match all non-ASCII letters, you could use either 175`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]` 176(intersecting the negation). 177 178 179## RL1.4 Simple Word Boundaries 180 181[UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) 182 183The regex crate provides basic Unicode aware word boundary assertions. A word 184boundary assertion can be written as `\b`, or `\B` as its negation. A word 185boundary negation corresponds to a zero-width match, where its adjacent 186characters correspond to word and non-word, or non-word and word characters. 187 188Conformance in this case chooses to define word character in the same way that 189the `\w` character class is defined: a code point that is a member of one of 190the following classes: 191 192* `\p{Alphabetic}` 193* `\p{Join_Control}` 194* `\p{gc:Mark}` 195* `\p{gc:Decimal_Number}` 196* `\p{gc:Connector_Punctuation}` 197 198In particular, this differs slightly from the 199[prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) 200but is permissible according to 201[UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). 202Namely, it is convenient and simpler to have `\w` and `\b` be in sync with 203one another. 204 205Finally, Unicode word boundaries can be disabled, which will cause ASCII word 206boundaries to be used instead. That is, `\b` is a Unicode word boundary while 207`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial 208if performance is important, since the implementation of Unicode word 209boundaries is currently sub-optimal on non-ASCII text. 210 211 212## RL1.5 Simple Loose Matches 213 214[UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches) 215 216The regex crate provides full support for case insensitive matching in 217accordance with RL1.5. That is, it uses the "simple" case folding mapping. The 218"simple" mapping was chosen because of a key convenient property: every 219"simple" mapping is a mapping from exactly one code point to exactly one other 220code point. This makes case insensitive matching of character classes, for 221example, straight-forward to implement. 222 223When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`), 224then all characters classes are case folded as well. 225 226 227## RL1.6 Line Boundaries 228 229[UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries) 230 231The regex crate only provides support for recognizing the `\n` (`END OF LINE`) 232character as a line boundary. This choice was made mostly for implementation 233convenience, and to avoid performance cliffs that Unicode word boundaries are 234subject to. 235 236Ideally, it would be nice to at least support `\r\n` as a line boundary as 237well, and in theory, this could be done efficiently. 238 239 240## RL1.7 Code Points 241 242[UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters) 243 244The regex crate provides full support for Unicode code point matching. Namely, 245the fundamental atom of any match is always a single code point. 246 247Given Rust's strong ties to UTF-8, the following guarantees are also provided: 248 249* All matches are reported on valid UTF-8 code unit boundaries. That is, any 250 match range returned by the public regex API is guaranteed to successfully 251 slice the string that was searched. 252* By consequence of the above, it is impossible to match surrogode code points. 253 No support for UTF-16 is provided, so this is never necessary. 254 255Note that when Unicode mode is disabled, the fundamental atom of matching is 256no longer a code point but a single byte. When Unicode mode is disabled, many 257Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid 258regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal 259byte `\xFF`) is, for example. 260