1c67d6573Sopenharmony_ci# Unicode conformance
2c67d6573Sopenharmony_ci
3c67d6573Sopenharmony_ciThis document describes the regex crate's conformance to Unicode's
4c67d6573Sopenharmony_ci[UTS#18](https://unicode.org/reports/tr18/)
5c67d6573Sopenharmony_cireport, which lays out 3 levels of support: Basic, Extended and Tailored.
6c67d6573Sopenharmony_ci
7c67d6573Sopenharmony_ciFull support for Level 1 ("Basic Unicode Support") is provided with two
8c67d6573Sopenharmony_ciexceptions:
9c67d6573Sopenharmony_ci
10c67d6573Sopenharmony_ci1. Line boundaries are not Unicode aware. Namely, only the `\n`
11c67d6573Sopenharmony_ci   (`END OF LINE`) character is recognized as a line boundary.
12c67d6573Sopenharmony_ci2. The compatibility properties specified by
13c67d6573Sopenharmony_ci   [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
14c67d6573Sopenharmony_ci   are ASCII-only definitions.
15c67d6573Sopenharmony_ci
16c67d6573Sopenharmony_ciLittle to no support is provided for either Level 2 or Level 3. For the most
17c67d6573Sopenharmony_cipart, this is because the features are either complex/hard to implement, or at
18c67d6573Sopenharmony_cithe very least, very difficult to implement without sacrificing performance.
19c67d6573Sopenharmony_ciFor example, tackling canonical equivalence such that matching worked as one
20c67d6573Sopenharmony_ciwould expect regardless of normalization form would be a significant
21c67d6573Sopenharmony_ciundertaking. This is at least partially a result of the fact that this regex
22c67d6573Sopenharmony_ciengine is based on finite automata, which admits less flexibility normally
23c67d6573Sopenharmony_ciassociated with backtracking implementations.
24c67d6573Sopenharmony_ci
25c67d6573Sopenharmony_ci
26c67d6573Sopenharmony_ci## RL1.1 Hex Notation
27c67d6573Sopenharmony_ci
28c67d6573Sopenharmony_ci[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
29c67d6573Sopenharmony_ci
30c67d6573Sopenharmony_ciHex Notation refers to the ability to specify a Unicode code point in a regular
31c67d6573Sopenharmony_ciexpression via its hexadecimal code point representation. This is useful in
32c67d6573Sopenharmony_cienvironments that have poor Unicode font rendering or if you need to express a
33c67d6573Sopenharmony_cicode point that is not normally displayable. All forms of hexadecimal notation
34c67d6573Sopenharmony_ciare supported
35c67d6573Sopenharmony_ci
36c67d6573Sopenharmony_ci    \x7F        hex character code (exactly two digits)
37c67d6573Sopenharmony_ci    \x{10FFFF}  any hex character code corresponding to a Unicode code point
38c67d6573Sopenharmony_ci    \u007F      hex character code (exactly four digits)
39c67d6573Sopenharmony_ci    \u{7F}      any hex character code corresponding to a Unicode code point
40c67d6573Sopenharmony_ci    \U0000007F  hex character code (exactly eight digits)
41c67d6573Sopenharmony_ci    \U{7F}      any hex character code corresponding to a Unicode code point
42c67d6573Sopenharmony_ci
43c67d6573Sopenharmony_ciBriefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
44c67d6573Sopenharmony_ciof expressing hexadecimal code points. Any number of digits can be written
45c67d6573Sopenharmony_ciwithin the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
46c67d6573Sopenharmony_cifixed-width variants of the same idea.
47c67d6573Sopenharmony_ci
48c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
49c67d6573Sopenharmony_cibanned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
50c67d6573Sopenharmony_cimode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
51c67d6573Sopenharmony_ciU+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
52c67d6573Sopenharmony_cithe literal byte `\xFF`.
53c67d6573Sopenharmony_ci
54c67d6573Sopenharmony_ci
55c67d6573Sopenharmony_ci## RL1.2 Properties
56c67d6573Sopenharmony_ci
57c67d6573Sopenharmony_ci[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
58c67d6573Sopenharmony_ci
59c67d6573Sopenharmony_ciFull support for Unicode property syntax is provided. Unicode properties
60c67d6573Sopenharmony_ciprovide a convenient way to construct character classes of groups of code
61c67d6573Sopenharmony_cipoints specified by Unicode. The regex crate does not provide exhaustive
62c67d6573Sopenharmony_cisupport, but covers a useful subset. In particular:
63c67d6573Sopenharmony_ci
64c67d6573Sopenharmony_ci* [General categories](https://unicode.org/reports/tr18/#General_Category_Property)
65c67d6573Sopenharmony_ci* [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property)
66c67d6573Sopenharmony_ci* [Age](https://unicode.org/reports/tr18/#Age)
67c67d6573Sopenharmony_ci* A smattering of boolean properties, including all of those specified by
68c67d6573Sopenharmony_ci  [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly.
69c67d6573Sopenharmony_ci
70c67d6573Sopenharmony_ciIn all cases, property name and value abbreviations are supported, and all
71c67d6573Sopenharmony_cinames/values are matched loosely without regard for case, whitespace or
72c67d6573Sopenharmony_ciunderscores. Property name aliases can be found in Unicode's
73c67d6573Sopenharmony_ci[`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
74c67d6573Sopenharmony_cifile, while property value aliases can be found in Unicode's
75c67d6573Sopenharmony_ci[`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
76c67d6573Sopenharmony_cifile.
77c67d6573Sopenharmony_ci
78c67d6573Sopenharmony_ciThe syntax supported is also consistent with the UTS#18 recommendation:
79c67d6573Sopenharmony_ci
80c67d6573Sopenharmony_ci* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
81c67d6573Sopenharmony_ci  `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
82c67d6573Sopenharmony_ci  `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
83c67d6573Sopenharmony_ci  `Script_Extensions` (or `scx` for short).
84c67d6573Sopenharmony_ci* `\p{age:3.2}` selects all code points in Unicode 3.2.
85c67d6573Sopenharmony_ci* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
86c67d6573Sopenharmony_ci  via `\p{alpha}` (for example).
87c67d6573Sopenharmony_ci* Single letter variants for properties with single letter abbreviations.
88c67d6573Sopenharmony_ci  For example, `\p{Letter}` can be equivalently written as `\pL`.
89c67d6573Sopenharmony_ci
90c67d6573Sopenharmony_ciThe following is a list of all properties supported by the regex crate (starred
91c67d6573Sopenharmony_ciproperties correspond to properties required by RL1.2):
92c67d6573Sopenharmony_ci
93c67d6573Sopenharmony_ci* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
94c67d6573Sopenharmony_ci* `Script` \*
95c67d6573Sopenharmony_ci* `Script_Extensions` \*
96c67d6573Sopenharmony_ci* `Age`
97c67d6573Sopenharmony_ci* `ASCII_Hex_Digit`
98c67d6573Sopenharmony_ci* `Alphabetic` \*
99c67d6573Sopenharmony_ci* `Bidi_Control`
100c67d6573Sopenharmony_ci* `Case_Ignorable`
101c67d6573Sopenharmony_ci* `Cased`
102c67d6573Sopenharmony_ci* `Changes_When_Casefolded`
103c67d6573Sopenharmony_ci* `Changes_When_Casemapped`
104c67d6573Sopenharmony_ci* `Changes_When_Lowercased`
105c67d6573Sopenharmony_ci* `Changes_When_Titlecased`
106c67d6573Sopenharmony_ci* `Changes_When_Uppercased`
107c67d6573Sopenharmony_ci* `Dash`
108c67d6573Sopenharmony_ci* `Default_Ignorable_Code_Point` \*
109c67d6573Sopenharmony_ci* `Deprecated`
110c67d6573Sopenharmony_ci* `Diacritic`
111c67d6573Sopenharmony_ci* `Emoji`
112c67d6573Sopenharmony_ci* `Emoji_Presentation`
113c67d6573Sopenharmony_ci* `Emoji_Modifier`
114c67d6573Sopenharmony_ci* `Emoji_Modifier_Base`
115c67d6573Sopenharmony_ci* `Emoji_Component`
116c67d6573Sopenharmony_ci* `Extended_Pictographic`
117c67d6573Sopenharmony_ci* `Extender`
118c67d6573Sopenharmony_ci* `Grapheme_Base`
119c67d6573Sopenharmony_ci* `Grapheme_Cluster_Break`
120c67d6573Sopenharmony_ci* `Grapheme_Extend`
121c67d6573Sopenharmony_ci* `Hex_Digit`
122c67d6573Sopenharmony_ci* `IDS_Binary_Operator`
123c67d6573Sopenharmony_ci* `IDS_Trinary_Operator`
124c67d6573Sopenharmony_ci* `ID_Continue`
125c67d6573Sopenharmony_ci* `ID_Start`
126c67d6573Sopenharmony_ci* `Join_Control`
127c67d6573Sopenharmony_ci* `Logical_Order_Exception`
128c67d6573Sopenharmony_ci* `Lowercase` \*
129c67d6573Sopenharmony_ci* `Math`
130c67d6573Sopenharmony_ci* `Noncharacter_Code_Point` \*
131c67d6573Sopenharmony_ci* `Pattern_Syntax`
132c67d6573Sopenharmony_ci* `Pattern_White_Space`
133c67d6573Sopenharmony_ci* `Prepended_Concatenation_Mark`
134c67d6573Sopenharmony_ci* `Quotation_Mark`
135c67d6573Sopenharmony_ci* `Radical`
136c67d6573Sopenharmony_ci* `Regional_Indicator`
137c67d6573Sopenharmony_ci* `Sentence_Break`
138c67d6573Sopenharmony_ci* `Sentence_Terminal`
139c67d6573Sopenharmony_ci* `Soft_Dotted`
140c67d6573Sopenharmony_ci* `Terminal_Punctuation`
141c67d6573Sopenharmony_ci* `Unified_Ideograph`
142c67d6573Sopenharmony_ci* `Uppercase` \*
143c67d6573Sopenharmony_ci* `Variation_Selector`
144c67d6573Sopenharmony_ci* `White_Space` \*
145c67d6573Sopenharmony_ci* `Word_Break`
146c67d6573Sopenharmony_ci* `XID_Continue`
147c67d6573Sopenharmony_ci* `XID_Start`
148c67d6573Sopenharmony_ci
149c67d6573Sopenharmony_ci
150c67d6573Sopenharmony_ci## RL1.2a Compatibility Properties
151c67d6573Sopenharmony_ci
152c67d6573Sopenharmony_ci[UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
153c67d6573Sopenharmony_ci
154c67d6573Sopenharmony_ciThe regex crate only provides ASCII definitions of the
155c67d6573Sopenharmony_ci[compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties)
156c67d6573Sopenharmony_ci(sans the `\X` class, for matching grapheme clusters, which isn't provided
157c67d6573Sopenharmony_ciat all). This is because it seems to be consistent with most other regular
158c67d6573Sopenharmony_ciexpression engines, and in particular, because these are often referred to as
159c67d6573Sopenharmony_ci"ASCII" or "POSIX" character classes.
160c67d6573Sopenharmony_ci
161c67d6573Sopenharmony_ciNote that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
162c67d6573Sopenharmony_ciTheir traditional ASCII definition can be used by disabling Unicode. That is,
163c67d6573Sopenharmony_ci`[[:word:]]` and `(?-u)\w` are equivalent.
164c67d6573Sopenharmony_ci
165c67d6573Sopenharmony_ci
166c67d6573Sopenharmony_ci## RL1.3 Subtraction and Intersection
167c67d6573Sopenharmony_ci
168c67d6573Sopenharmony_ci[UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection)
169c67d6573Sopenharmony_ci
170c67d6573Sopenharmony_ciThe regex crate provides full support for nested character classes, along with
171c67d6573Sopenharmony_ciunion, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
172c67d6573Sopenharmony_cioperations on arbitrary character classes.
173c67d6573Sopenharmony_ci
174c67d6573Sopenharmony_ciFor example, to match all non-ASCII letters, you could use either
175c67d6573Sopenharmony_ci`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
176c67d6573Sopenharmony_ci(intersecting the negation).
177c67d6573Sopenharmony_ci
178c67d6573Sopenharmony_ci
179c67d6573Sopenharmony_ci## RL1.4 Simple Word Boundaries
180c67d6573Sopenharmony_ci
181c67d6573Sopenharmony_ci[UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
182c67d6573Sopenharmony_ci
183c67d6573Sopenharmony_ciThe regex crate provides basic Unicode aware word boundary assertions. A word
184c67d6573Sopenharmony_ciboundary assertion can be written as `\b`, or `\B` as its negation. A word
185c67d6573Sopenharmony_ciboundary negation corresponds to a zero-width match, where its adjacent
186c67d6573Sopenharmony_cicharacters correspond to word and non-word, or non-word and word characters.
187c67d6573Sopenharmony_ci
188c67d6573Sopenharmony_ciConformance in this case chooses to define word character in the same way that
189c67d6573Sopenharmony_cithe `\w` character class is defined: a code point that is a member of one of
190c67d6573Sopenharmony_cithe following classes:
191c67d6573Sopenharmony_ci
192c67d6573Sopenharmony_ci* `\p{Alphabetic}`
193c67d6573Sopenharmony_ci* `\p{Join_Control}`
194c67d6573Sopenharmony_ci* `\p{gc:Mark}`
195c67d6573Sopenharmony_ci* `\p{gc:Decimal_Number}`
196c67d6573Sopenharmony_ci* `\p{gc:Connector_Punctuation}`
197c67d6573Sopenharmony_ci
198c67d6573Sopenharmony_ciIn particular, this differs slightly from the
199c67d6573Sopenharmony_ci[prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
200c67d6573Sopenharmony_cibut is permissible according to
201c67d6573Sopenharmony_ci[UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
202c67d6573Sopenharmony_ciNamely, it is convenient and simpler to have `\w` and `\b` be in sync with
203c67d6573Sopenharmony_cione another.
204c67d6573Sopenharmony_ci
205c67d6573Sopenharmony_ciFinally, Unicode word boundaries can be disabled, which will cause ASCII word
206c67d6573Sopenharmony_ciboundaries to be used instead. That is, `\b` is a Unicode word boundary while
207c67d6573Sopenharmony_ci`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
208c67d6573Sopenharmony_ciif performance is important, since the implementation of Unicode word
209c67d6573Sopenharmony_ciboundaries is currently sub-optimal on non-ASCII text.
210c67d6573Sopenharmony_ci
211c67d6573Sopenharmony_ci
212c67d6573Sopenharmony_ci## RL1.5 Simple Loose Matches
213c67d6573Sopenharmony_ci
214c67d6573Sopenharmony_ci[UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches)
215c67d6573Sopenharmony_ci
216c67d6573Sopenharmony_ciThe regex crate provides full support for case insensitive matching in
217c67d6573Sopenharmony_ciaccordance with RL1.5. That is, it uses the "simple" case folding mapping. The
218c67d6573Sopenharmony_ci"simple" mapping was chosen because of a key convenient property: every
219c67d6573Sopenharmony_ci"simple" mapping is a mapping from exactly one code point to exactly one other
220c67d6573Sopenharmony_cicode point. This makes case insensitive matching of character classes, for
221c67d6573Sopenharmony_ciexample, straight-forward to implement.
222c67d6573Sopenharmony_ci
223c67d6573Sopenharmony_ciWhen case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
224c67d6573Sopenharmony_cithen all characters classes are case folded as well.
225c67d6573Sopenharmony_ci
226c67d6573Sopenharmony_ci
227c67d6573Sopenharmony_ci## RL1.6 Line Boundaries
228c67d6573Sopenharmony_ci
229c67d6573Sopenharmony_ci[UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries)
230c67d6573Sopenharmony_ci
231c67d6573Sopenharmony_ciThe regex crate only provides support for recognizing the `\n` (`END OF LINE`)
232c67d6573Sopenharmony_cicharacter as a line boundary. This choice was made mostly for implementation
233c67d6573Sopenharmony_ciconvenience, and to avoid performance cliffs that Unicode word boundaries are
234c67d6573Sopenharmony_cisubject to.
235c67d6573Sopenharmony_ci
236c67d6573Sopenharmony_ciIdeally, it would be nice to at least support `\r\n` as a line boundary as
237c67d6573Sopenharmony_ciwell, and in theory, this could be done efficiently.
238c67d6573Sopenharmony_ci
239c67d6573Sopenharmony_ci
240c67d6573Sopenharmony_ci## RL1.7 Code Points
241c67d6573Sopenharmony_ci
242c67d6573Sopenharmony_ci[UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters)
243c67d6573Sopenharmony_ci
244c67d6573Sopenharmony_ciThe regex crate provides full support for Unicode code point matching. Namely,
245c67d6573Sopenharmony_cithe fundamental atom of any match is always a single code point.
246c67d6573Sopenharmony_ci
247c67d6573Sopenharmony_ciGiven Rust's strong ties to UTF-8, the following guarantees are also provided:
248c67d6573Sopenharmony_ci
249c67d6573Sopenharmony_ci* All matches are reported on valid UTF-8 code unit boundaries. That is, any
250c67d6573Sopenharmony_ci  match range returned by the public regex API is guaranteed to successfully
251c67d6573Sopenharmony_ci  slice the string that was searched.
252c67d6573Sopenharmony_ci* By consequence of the above, it is impossible to match surrogode code points.
253c67d6573Sopenharmony_ci  No support for UTF-16 is provided, so this is never necessary.
254c67d6573Sopenharmony_ci
255c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, the fundamental atom of matching is
256c67d6573Sopenharmony_cino longer a code point but a single byte. When Unicode mode is disabled, many
257c67d6573Sopenharmony_ciUnicode features are disabled as well. For example, `(?-u)\pL` is not a valid
258c67d6573Sopenharmony_ciregex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
259c67d6573Sopenharmony_cibyte `\xFF`) is, for example.
260