1c67d6573Sopenharmony_ci# Unicode conformance 2c67d6573Sopenharmony_ci 3c67d6573Sopenharmony_ciThis document describes the regex crate's conformance to Unicode's 4c67d6573Sopenharmony_ci[UTS#18](https://unicode.org/reports/tr18/) 5c67d6573Sopenharmony_cireport, which lays out 3 levels of support: Basic, Extended and Tailored. 6c67d6573Sopenharmony_ci 7c67d6573Sopenharmony_ciFull support for Level 1 ("Basic Unicode Support") is provided with two 8c67d6573Sopenharmony_ciexceptions: 9c67d6573Sopenharmony_ci 10c67d6573Sopenharmony_ci1. Line boundaries are not Unicode aware. Namely, only the `\n` 11c67d6573Sopenharmony_ci (`END OF LINE`) character is recognized as a line boundary. 12c67d6573Sopenharmony_ci2. The compatibility properties specified by 13c67d6573Sopenharmony_ci [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) 14c67d6573Sopenharmony_ci are ASCII-only definitions. 15c67d6573Sopenharmony_ci 16c67d6573Sopenharmony_ciLittle to no support is provided for either Level 2 or Level 3. For the most 17c67d6573Sopenharmony_cipart, this is because the features are either complex/hard to implement, or at 18c67d6573Sopenharmony_cithe very least, very difficult to implement without sacrificing performance. 19c67d6573Sopenharmony_ciFor example, tackling canonical equivalence such that matching worked as one 20c67d6573Sopenharmony_ciwould expect regardless of normalization form would be a significant 21c67d6573Sopenharmony_ciundertaking. This is at least partially a result of the fact that this regex 22c67d6573Sopenharmony_ciengine is based on finite automata, which admits less flexibility normally 23c67d6573Sopenharmony_ciassociated with backtracking implementations. 24c67d6573Sopenharmony_ci 25c67d6573Sopenharmony_ci 26c67d6573Sopenharmony_ci## RL1.1 Hex Notation 27c67d6573Sopenharmony_ci 28c67d6573Sopenharmony_ci[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation) 29c67d6573Sopenharmony_ci 30c67d6573Sopenharmony_ciHex Notation refers to the ability to specify a Unicode code point in a regular 31c67d6573Sopenharmony_ciexpression via its hexadecimal code point representation. This is useful in 32c67d6573Sopenharmony_cienvironments that have poor Unicode font rendering or if you need to express a 33c67d6573Sopenharmony_cicode point that is not normally displayable. All forms of hexadecimal notation 34c67d6573Sopenharmony_ciare supported 35c67d6573Sopenharmony_ci 36c67d6573Sopenharmony_ci \x7F hex character code (exactly two digits) 37c67d6573Sopenharmony_ci \x{10FFFF} any hex character code corresponding to a Unicode code point 38c67d6573Sopenharmony_ci \u007F hex character code (exactly four digits) 39c67d6573Sopenharmony_ci \u{7F} any hex character code corresponding to a Unicode code point 40c67d6573Sopenharmony_ci \U0000007F hex character code (exactly eight digits) 41c67d6573Sopenharmony_ci \U{7F} any hex character code corresponding to a Unicode code point 42c67d6573Sopenharmony_ci 43c67d6573Sopenharmony_ciBriefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways 44c67d6573Sopenharmony_ciof expressing hexadecimal code points. Any number of digits can be written 45c67d6573Sopenharmony_ciwithin the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all 46c67d6573Sopenharmony_cifixed-width variants of the same idea. 47c67d6573Sopenharmony_ci 48c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, any non-ASCII Unicode codepoint is 49c67d6573Sopenharmony_cibanned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode 50c67d6573Sopenharmony_cimode is disabled. That is, the regex `\xFF` matches the Unicode codepoint 51c67d6573Sopenharmony_ciU+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches 52c67d6573Sopenharmony_cithe literal byte `\xFF`. 53c67d6573Sopenharmony_ci 54c67d6573Sopenharmony_ci 55c67d6573Sopenharmony_ci## RL1.2 Properties 56c67d6573Sopenharmony_ci 57c67d6573Sopenharmony_ci[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories) 58c67d6573Sopenharmony_ci 59c67d6573Sopenharmony_ciFull support for Unicode property syntax is provided. Unicode properties 60c67d6573Sopenharmony_ciprovide a convenient way to construct character classes of groups of code 61c67d6573Sopenharmony_cipoints specified by Unicode. The regex crate does not provide exhaustive 62c67d6573Sopenharmony_cisupport, but covers a useful subset. In particular: 63c67d6573Sopenharmony_ci 64c67d6573Sopenharmony_ci* [General categories](https://unicode.org/reports/tr18/#General_Category_Property) 65c67d6573Sopenharmony_ci* [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property) 66c67d6573Sopenharmony_ci* [Age](https://unicode.org/reports/tr18/#Age) 67c67d6573Sopenharmony_ci* A smattering of boolean properties, including all of those specified by 68c67d6573Sopenharmony_ci [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly. 69c67d6573Sopenharmony_ci 70c67d6573Sopenharmony_ciIn all cases, property name and value abbreviations are supported, and all 71c67d6573Sopenharmony_cinames/values are matched loosely without regard for case, whitespace or 72c67d6573Sopenharmony_ciunderscores. Property name aliases can be found in Unicode's 73c67d6573Sopenharmony_ci[`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) 74c67d6573Sopenharmony_cifile, while property value aliases can be found in Unicode's 75c67d6573Sopenharmony_ci[`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) 76c67d6573Sopenharmony_cifile. 77c67d6573Sopenharmony_ci 78c67d6573Sopenharmony_ciThe syntax supported is also consistent with the UTS#18 recommendation: 79c67d6573Sopenharmony_ci 80c67d6573Sopenharmony_ci* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow: 81c67d6573Sopenharmony_ci `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`, 82c67d6573Sopenharmony_ci `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and 83c67d6573Sopenharmony_ci `Script_Extensions` (or `scx` for short). 84c67d6573Sopenharmony_ci* `\p{age:3.2}` selects all code points in Unicode 3.2. 85c67d6573Sopenharmony_ci* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated 86c67d6573Sopenharmony_ci via `\p{alpha}` (for example). 87c67d6573Sopenharmony_ci* Single letter variants for properties with single letter abbreviations. 88c67d6573Sopenharmony_ci For example, `\p{Letter}` can be equivalently written as `\pL`. 89c67d6573Sopenharmony_ci 90c67d6573Sopenharmony_ciThe following is a list of all properties supported by the regex crate (starred 91c67d6573Sopenharmony_ciproperties correspond to properties required by RL1.2): 92c67d6573Sopenharmony_ci 93c67d6573Sopenharmony_ci* `General_Category` \* (including `Any`, `ASCII` and `Assigned`) 94c67d6573Sopenharmony_ci* `Script` \* 95c67d6573Sopenharmony_ci* `Script_Extensions` \* 96c67d6573Sopenharmony_ci* `Age` 97c67d6573Sopenharmony_ci* `ASCII_Hex_Digit` 98c67d6573Sopenharmony_ci* `Alphabetic` \* 99c67d6573Sopenharmony_ci* `Bidi_Control` 100c67d6573Sopenharmony_ci* `Case_Ignorable` 101c67d6573Sopenharmony_ci* `Cased` 102c67d6573Sopenharmony_ci* `Changes_When_Casefolded` 103c67d6573Sopenharmony_ci* `Changes_When_Casemapped` 104c67d6573Sopenharmony_ci* `Changes_When_Lowercased` 105c67d6573Sopenharmony_ci* `Changes_When_Titlecased` 106c67d6573Sopenharmony_ci* `Changes_When_Uppercased` 107c67d6573Sopenharmony_ci* `Dash` 108c67d6573Sopenharmony_ci* `Default_Ignorable_Code_Point` \* 109c67d6573Sopenharmony_ci* `Deprecated` 110c67d6573Sopenharmony_ci* `Diacritic` 111c67d6573Sopenharmony_ci* `Emoji` 112c67d6573Sopenharmony_ci* `Emoji_Presentation` 113c67d6573Sopenharmony_ci* `Emoji_Modifier` 114c67d6573Sopenharmony_ci* `Emoji_Modifier_Base` 115c67d6573Sopenharmony_ci* `Emoji_Component` 116c67d6573Sopenharmony_ci* `Extended_Pictographic` 117c67d6573Sopenharmony_ci* `Extender` 118c67d6573Sopenharmony_ci* `Grapheme_Base` 119c67d6573Sopenharmony_ci* `Grapheme_Cluster_Break` 120c67d6573Sopenharmony_ci* `Grapheme_Extend` 121c67d6573Sopenharmony_ci* `Hex_Digit` 122c67d6573Sopenharmony_ci* `IDS_Binary_Operator` 123c67d6573Sopenharmony_ci* `IDS_Trinary_Operator` 124c67d6573Sopenharmony_ci* `ID_Continue` 125c67d6573Sopenharmony_ci* `ID_Start` 126c67d6573Sopenharmony_ci* `Join_Control` 127c67d6573Sopenharmony_ci* `Logical_Order_Exception` 128c67d6573Sopenharmony_ci* `Lowercase` \* 129c67d6573Sopenharmony_ci* `Math` 130c67d6573Sopenharmony_ci* `Noncharacter_Code_Point` \* 131c67d6573Sopenharmony_ci* `Pattern_Syntax` 132c67d6573Sopenharmony_ci* `Pattern_White_Space` 133c67d6573Sopenharmony_ci* `Prepended_Concatenation_Mark` 134c67d6573Sopenharmony_ci* `Quotation_Mark` 135c67d6573Sopenharmony_ci* `Radical` 136c67d6573Sopenharmony_ci* `Regional_Indicator` 137c67d6573Sopenharmony_ci* `Sentence_Break` 138c67d6573Sopenharmony_ci* `Sentence_Terminal` 139c67d6573Sopenharmony_ci* `Soft_Dotted` 140c67d6573Sopenharmony_ci* `Terminal_Punctuation` 141c67d6573Sopenharmony_ci* `Unified_Ideograph` 142c67d6573Sopenharmony_ci* `Uppercase` \* 143c67d6573Sopenharmony_ci* `Variation_Selector` 144c67d6573Sopenharmony_ci* `White_Space` \* 145c67d6573Sopenharmony_ci* `Word_Break` 146c67d6573Sopenharmony_ci* `XID_Continue` 147c67d6573Sopenharmony_ci* `XID_Start` 148c67d6573Sopenharmony_ci 149c67d6573Sopenharmony_ci 150c67d6573Sopenharmony_ci## RL1.2a Compatibility Properties 151c67d6573Sopenharmony_ci 152c67d6573Sopenharmony_ci[UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) 153c67d6573Sopenharmony_ci 154c67d6573Sopenharmony_ciThe regex crate only provides ASCII definitions of the 155c67d6573Sopenharmony_ci[compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties) 156c67d6573Sopenharmony_ci(sans the `\X` class, for matching grapheme clusters, which isn't provided 157c67d6573Sopenharmony_ciat all). This is because it seems to be consistent with most other regular 158c67d6573Sopenharmony_ciexpression engines, and in particular, because these are often referred to as 159c67d6573Sopenharmony_ci"ASCII" or "POSIX" character classes. 160c67d6573Sopenharmony_ci 161c67d6573Sopenharmony_ciNote that the `\w`, `\s` and `\d` character classes **are** Unicode aware. 162c67d6573Sopenharmony_ciTheir traditional ASCII definition can be used by disabling Unicode. That is, 163c67d6573Sopenharmony_ci`[[:word:]]` and `(?-u)\w` are equivalent. 164c67d6573Sopenharmony_ci 165c67d6573Sopenharmony_ci 166c67d6573Sopenharmony_ci## RL1.3 Subtraction and Intersection 167c67d6573Sopenharmony_ci 168c67d6573Sopenharmony_ci[UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection) 169c67d6573Sopenharmony_ci 170c67d6573Sopenharmony_ciThe regex crate provides full support for nested character classes, along with 171c67d6573Sopenharmony_ciunion, intersection (`&&`), difference (`--`) and symmetric difference (`~~`) 172c67d6573Sopenharmony_cioperations on arbitrary character classes. 173c67d6573Sopenharmony_ci 174c67d6573Sopenharmony_ciFor example, to match all non-ASCII letters, you could use either 175c67d6573Sopenharmony_ci`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]` 176c67d6573Sopenharmony_ci(intersecting the negation). 177c67d6573Sopenharmony_ci 178c67d6573Sopenharmony_ci 179c67d6573Sopenharmony_ci## RL1.4 Simple Word Boundaries 180c67d6573Sopenharmony_ci 181c67d6573Sopenharmony_ci[UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) 182c67d6573Sopenharmony_ci 183c67d6573Sopenharmony_ciThe regex crate provides basic Unicode aware word boundary assertions. A word 184c67d6573Sopenharmony_ciboundary assertion can be written as `\b`, or `\B` as its negation. A word 185c67d6573Sopenharmony_ciboundary negation corresponds to a zero-width match, where its adjacent 186c67d6573Sopenharmony_cicharacters correspond to word and non-word, or non-word and word characters. 187c67d6573Sopenharmony_ci 188c67d6573Sopenharmony_ciConformance in this case chooses to define word character in the same way that 189c67d6573Sopenharmony_cithe `\w` character class is defined: a code point that is a member of one of 190c67d6573Sopenharmony_cithe following classes: 191c67d6573Sopenharmony_ci 192c67d6573Sopenharmony_ci* `\p{Alphabetic}` 193c67d6573Sopenharmony_ci* `\p{Join_Control}` 194c67d6573Sopenharmony_ci* `\p{gc:Mark}` 195c67d6573Sopenharmony_ci* `\p{gc:Decimal_Number}` 196c67d6573Sopenharmony_ci* `\p{gc:Connector_Punctuation}` 197c67d6573Sopenharmony_ci 198c67d6573Sopenharmony_ciIn particular, this differs slightly from the 199c67d6573Sopenharmony_ci[prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) 200c67d6573Sopenharmony_cibut is permissible according to 201c67d6573Sopenharmony_ci[UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). 202c67d6573Sopenharmony_ciNamely, it is convenient and simpler to have `\w` and `\b` be in sync with 203c67d6573Sopenharmony_cione another. 204c67d6573Sopenharmony_ci 205c67d6573Sopenharmony_ciFinally, Unicode word boundaries can be disabled, which will cause ASCII word 206c67d6573Sopenharmony_ciboundaries to be used instead. That is, `\b` is a Unicode word boundary while 207c67d6573Sopenharmony_ci`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial 208c67d6573Sopenharmony_ciif performance is important, since the implementation of Unicode word 209c67d6573Sopenharmony_ciboundaries is currently sub-optimal on non-ASCII text. 210c67d6573Sopenharmony_ci 211c67d6573Sopenharmony_ci 212c67d6573Sopenharmony_ci## RL1.5 Simple Loose Matches 213c67d6573Sopenharmony_ci 214c67d6573Sopenharmony_ci[UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches) 215c67d6573Sopenharmony_ci 216c67d6573Sopenharmony_ciThe regex crate provides full support for case insensitive matching in 217c67d6573Sopenharmony_ciaccordance with RL1.5. That is, it uses the "simple" case folding mapping. The 218c67d6573Sopenharmony_ci"simple" mapping was chosen because of a key convenient property: every 219c67d6573Sopenharmony_ci"simple" mapping is a mapping from exactly one code point to exactly one other 220c67d6573Sopenharmony_cicode point. This makes case insensitive matching of character classes, for 221c67d6573Sopenharmony_ciexample, straight-forward to implement. 222c67d6573Sopenharmony_ci 223c67d6573Sopenharmony_ciWhen case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`), 224c67d6573Sopenharmony_cithen all characters classes are case folded as well. 225c67d6573Sopenharmony_ci 226c67d6573Sopenharmony_ci 227c67d6573Sopenharmony_ci## RL1.6 Line Boundaries 228c67d6573Sopenharmony_ci 229c67d6573Sopenharmony_ci[UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries) 230c67d6573Sopenharmony_ci 231c67d6573Sopenharmony_ciThe regex crate only provides support for recognizing the `\n` (`END OF LINE`) 232c67d6573Sopenharmony_cicharacter as a line boundary. This choice was made mostly for implementation 233c67d6573Sopenharmony_ciconvenience, and to avoid performance cliffs that Unicode word boundaries are 234c67d6573Sopenharmony_cisubject to. 235c67d6573Sopenharmony_ci 236c67d6573Sopenharmony_ciIdeally, it would be nice to at least support `\r\n` as a line boundary as 237c67d6573Sopenharmony_ciwell, and in theory, this could be done efficiently. 238c67d6573Sopenharmony_ci 239c67d6573Sopenharmony_ci 240c67d6573Sopenharmony_ci## RL1.7 Code Points 241c67d6573Sopenharmony_ci 242c67d6573Sopenharmony_ci[UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters) 243c67d6573Sopenharmony_ci 244c67d6573Sopenharmony_ciThe regex crate provides full support for Unicode code point matching. Namely, 245c67d6573Sopenharmony_cithe fundamental atom of any match is always a single code point. 246c67d6573Sopenharmony_ci 247c67d6573Sopenharmony_ciGiven Rust's strong ties to UTF-8, the following guarantees are also provided: 248c67d6573Sopenharmony_ci 249c67d6573Sopenharmony_ci* All matches are reported on valid UTF-8 code unit boundaries. That is, any 250c67d6573Sopenharmony_ci match range returned by the public regex API is guaranteed to successfully 251c67d6573Sopenharmony_ci slice the string that was searched. 252c67d6573Sopenharmony_ci* By consequence of the above, it is impossible to match surrogode code points. 253c67d6573Sopenharmony_ci No support for UTF-16 is provided, so this is never necessary. 254c67d6573Sopenharmony_ci 255c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, the fundamental atom of matching is 256c67d6573Sopenharmony_cino longer a code point but a single byte. When Unicode mode is disabled, many 257c67d6573Sopenharmony_ciUnicode features are disabled as well. For example, `(?-u)\pL` is not a valid 258c67d6573Sopenharmony_ciregex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal 259c67d6573Sopenharmony_cibyte `\xFF`) is, for example. 260