crates/regex/UNICODE.md

c67d6573Sopenharmony_ci# Unicode conformance
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis document describes the regex crate's conformance to Unicode's
c67d6573Sopenharmony_ci[UTS#18](https://unicode.org/reports/tr18/)
c67d6573Sopenharmony_cireport, which lays out 3 levels of support: Basic, Extended and Tailored.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFull support for Level 1 ("Basic Unicode Support") is provided with two
c67d6573Sopenharmony_ciexceptions:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci1. Line boundaries are not Unicode aware. Namely, only the `\n`
c67d6573Sopenharmony_ci   (`END OF LINE`) character is recognized as a line boundary.
c67d6573Sopenharmony_ci2. The compatibility properties specified by
c67d6573Sopenharmony_ci   [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
c67d6573Sopenharmony_ci   are ASCII-only definitions.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciLittle to no support is provided for either Level 2 or Level 3. For the most
c67d6573Sopenharmony_cipart, this is because the features are either complex/hard to implement, or at
c67d6573Sopenharmony_cithe very least, very difficult to implement without sacrificing performance.
c67d6573Sopenharmony_ciFor example, tackling canonical equivalence such that matching worked as one
c67d6573Sopenharmony_ciwould expect regardless of normalization form would be a significant
c67d6573Sopenharmony_ciundertaking. This is at least partially a result of the fact that this regex
c67d6573Sopenharmony_ciengine is based on finite automata, which admits less flexibility normally
c67d6573Sopenharmony_ciassociated with backtracking implementations.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.1 Hex Notation
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciHex Notation refers to the ability to specify a Unicode code point in a regular
c67d6573Sopenharmony_ciexpression via its hexadecimal code point representation. This is useful in
c67d6573Sopenharmony_cienvironments that have poor Unicode font rendering or if you need to express a
c67d6573Sopenharmony_cicode point that is not normally displayable. All forms of hexadecimal notation
c67d6573Sopenharmony_ciare supported
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci    \x7F        hex character code (exactly two digits)
c67d6573Sopenharmony_ci    \x{10FFFF}  any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci    \u007F      hex character code (exactly four digits)
c67d6573Sopenharmony_ci    \u{7F}      any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci    \U0000007F  hex character code (exactly eight digits)
c67d6573Sopenharmony_ci    \U{7F}      any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciBriefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
c67d6573Sopenharmony_ciof expressing hexadecimal code points. Any number of digits can be written
c67d6573Sopenharmony_ciwithin the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
c67d6573Sopenharmony_cifixed-width variants of the same idea.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
c67d6573Sopenharmony_cibanned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
c67d6573Sopenharmony_cimode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
c67d6573Sopenharmony_ciU+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
c67d6573Sopenharmony_cithe literal byte `\xFF`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.2 Properties
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFull support for Unicode property syntax is provided. Unicode properties
c67d6573Sopenharmony_ciprovide a convenient way to construct character classes of groups of code
c67d6573Sopenharmony_cipoints specified by Unicode. The regex crate does not provide exhaustive
c67d6573Sopenharmony_cisupport, but covers a useful subset. In particular:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* [General categories](https://unicode.org/reports/tr18/#General_Category_Property)
c67d6573Sopenharmony_ci* [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property)
c67d6573Sopenharmony_ci* [Age](https://unicode.org/reports/tr18/#Age)
c67d6573Sopenharmony_ci* A smattering of boolean properties, including all of those specified by
c67d6573Sopenharmony_ci  [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIn all cases, property name and value abbreviations are supported, and all
c67d6573Sopenharmony_cinames/values are matched loosely without regard for case, whitespace or
c67d6573Sopenharmony_ciunderscores. Property name aliases can be found in Unicode's
c67d6573Sopenharmony_ci[`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
c67d6573Sopenharmony_cifile, while property value aliases can be found in Unicode's
c67d6573Sopenharmony_ci[`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
c67d6573Sopenharmony_cifile.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe syntax supported is also consistent with the UTS#18 recommendation:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
c67d6573Sopenharmony_ci  `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
c67d6573Sopenharmony_ci  `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
c67d6573Sopenharmony_ci  `Script_Extensions` (or `scx` for short).
c67d6573Sopenharmony_ci* `\p{age:3.2}` selects all code points in Unicode 3.2.
c67d6573Sopenharmony_ci* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
c67d6573Sopenharmony_ci  via `\p{alpha}` (for example).
c67d6573Sopenharmony_ci* Single letter variants for properties with single letter abbreviations.
c67d6573Sopenharmony_ci  For example, `\p{Letter}` can be equivalently written as `\pL`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe following is a list of all properties supported by the regex crate (starred
c67d6573Sopenharmony_ciproperties correspond to properties required by RL1.2):
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
c67d6573Sopenharmony_ci* `Script` \*
c67d6573Sopenharmony_ci* `Script_Extensions` \*
c67d6573Sopenharmony_ci* `Age`
c67d6573Sopenharmony_ci* `ASCII_Hex_Digit`
c67d6573Sopenharmony_ci* `Alphabetic` \*
c67d6573Sopenharmony_ci* `Bidi_Control`
c67d6573Sopenharmony_ci* `Case_Ignorable`
c67d6573Sopenharmony_ci* `Cased`
c67d6573Sopenharmony_ci* `Changes_When_Casefolded`
c67d6573Sopenharmony_ci* `Changes_When_Casemapped`
c67d6573Sopenharmony_ci* `Changes_When_Lowercased`
c67d6573Sopenharmony_ci* `Changes_When_Titlecased`
c67d6573Sopenharmony_ci* `Changes_When_Uppercased`
c67d6573Sopenharmony_ci* `Dash`
c67d6573Sopenharmony_ci* `Default_Ignorable_Code_Point` \*
c67d6573Sopenharmony_ci* `Deprecated`
c67d6573Sopenharmony_ci* `Diacritic`
c67d6573Sopenharmony_ci* `Emoji`
c67d6573Sopenharmony_ci* `Emoji_Presentation`
c67d6573Sopenharmony_ci* `Emoji_Modifier`
c67d6573Sopenharmony_ci* `Emoji_Modifier_Base`
c67d6573Sopenharmony_ci* `Emoji_Component`
c67d6573Sopenharmony_ci* `Extended_Pictographic`
c67d6573Sopenharmony_ci* `Extender`
c67d6573Sopenharmony_ci* `Grapheme_Base`
c67d6573Sopenharmony_ci* `Grapheme_Cluster_Break`
c67d6573Sopenharmony_ci* `Grapheme_Extend`
c67d6573Sopenharmony_ci* `Hex_Digit`
c67d6573Sopenharmony_ci* `IDS_Binary_Operator`
c67d6573Sopenharmony_ci* `IDS_Trinary_Operator`
c67d6573Sopenharmony_ci* `ID_Continue`
c67d6573Sopenharmony_ci* `ID_Start`
c67d6573Sopenharmony_ci* `Join_Control`
c67d6573Sopenharmony_ci* `Logical_Order_Exception`
c67d6573Sopenharmony_ci* `Lowercase` \*
c67d6573Sopenharmony_ci* `Math`
c67d6573Sopenharmony_ci* `Noncharacter_Code_Point` \*
c67d6573Sopenharmony_ci* `Pattern_Syntax`
c67d6573Sopenharmony_ci* `Pattern_White_Space`
c67d6573Sopenharmony_ci* `Prepended_Concatenation_Mark`
c67d6573Sopenharmony_ci* `Quotation_Mark`
c67d6573Sopenharmony_ci* `Radical`
c67d6573Sopenharmony_ci* `Regional_Indicator`
c67d6573Sopenharmony_ci* `Sentence_Break`
c67d6573Sopenharmony_ci* `Sentence_Terminal`
c67d6573Sopenharmony_ci* `Soft_Dotted`
c67d6573Sopenharmony_ci* `Terminal_Punctuation`
c67d6573Sopenharmony_ci* `Unified_Ideograph`
c67d6573Sopenharmony_ci* `Uppercase` \*
c67d6573Sopenharmony_ci* `Variation_Selector`
c67d6573Sopenharmony_ci* `White_Space` \*
c67d6573Sopenharmony_ci* `Word_Break`
c67d6573Sopenharmony_ci* `XID_Continue`
c67d6573Sopenharmony_ci* `XID_Start`
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.2a Compatibility Properties
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate only provides ASCII definitions of the
c67d6573Sopenharmony_ci[compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties)
c67d6573Sopenharmony_ci(sans the `\X` class, for matching grapheme clusters, which isn't provided
c67d6573Sopenharmony_ciat all). This is because it seems to be consistent with most other regular
c67d6573Sopenharmony_ciexpression engines, and in particular, because these are often referred to as
c67d6573Sopenharmony_ci"ASCII" or "POSIX" character classes.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
c67d6573Sopenharmony_ciTheir traditional ASCII definition can be used by disabling Unicode. That is,
c67d6573Sopenharmony_ci`[[:word:]]` and `(?-u)\w` are equivalent.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.3 Subtraction and Intersection
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate provides full support for nested character classes, along with
c67d6573Sopenharmony_ciunion, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
c67d6573Sopenharmony_cioperations on arbitrary character classes.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFor example, to match all non-ASCII letters, you could use either
c67d6573Sopenharmony_ci`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
c67d6573Sopenharmony_ci(intersecting the negation).
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.4 Simple Word Boundaries
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate provides basic Unicode aware word boundary assertions. A word
c67d6573Sopenharmony_ciboundary assertion can be written as `\b`, or `\B` as its negation. A word
c67d6573Sopenharmony_ciboundary negation corresponds to a zero-width match, where its adjacent
c67d6573Sopenharmony_cicharacters correspond to word and non-word, or non-word and word characters.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciConformance in this case chooses to define word character in the same way that
c67d6573Sopenharmony_cithe `\w` character class is defined: a code point that is a member of one of
c67d6573Sopenharmony_cithe following classes:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* `\p{Alphabetic}`
c67d6573Sopenharmony_ci* `\p{Join_Control}`
c67d6573Sopenharmony_ci* `\p{gc:Mark}`
c67d6573Sopenharmony_ci* `\p{gc:Decimal_Number}`
c67d6573Sopenharmony_ci* `\p{gc:Connector_Punctuation}`
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIn particular, this differs slightly from the
c67d6573Sopenharmony_ci[prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
c67d6573Sopenharmony_cibut is permissible according to
c67d6573Sopenharmony_ci[UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
c67d6573Sopenharmony_ciNamely, it is convenient and simpler to have `\w` and `\b` be in sync with
c67d6573Sopenharmony_cione another.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFinally, Unicode word boundaries can be disabled, which will cause ASCII word
c67d6573Sopenharmony_ciboundaries to be used instead. That is, `\b` is a Unicode word boundary while
c67d6573Sopenharmony_ci`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
c67d6573Sopenharmony_ciif performance is important, since the implementation of Unicode word
c67d6573Sopenharmony_ciboundaries is currently sub-optimal on non-ASCII text.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.5 Simple Loose Matches
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate provides full support for case insensitive matching in
c67d6573Sopenharmony_ciaccordance with RL1.5. That is, it uses the "simple" case folding mapping. The
c67d6573Sopenharmony_ci"simple" mapping was chosen because of a key convenient property: every
c67d6573Sopenharmony_ci"simple" mapping is a mapping from exactly one code point to exactly one other
c67d6573Sopenharmony_cicode point. This makes case insensitive matching of character classes, for
c67d6573Sopenharmony_ciexample, straight-forward to implement.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciWhen case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
c67d6573Sopenharmony_cithen all characters classes are case folded as well.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.6 Line Boundaries
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate only provides support for recognizing the `\n` (`END OF LINE`)
c67d6573Sopenharmony_cicharacter as a line boundary. This choice was made mostly for implementation
c67d6573Sopenharmony_ciconvenience, and to avoid performance cliffs that Unicode word boundaries are
c67d6573Sopenharmony_cisubject to.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIdeally, it would be nice to at least support `\r\n` as a line boundary as
c67d6573Sopenharmony_ciwell, and in theory, this could be done efficiently.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## RL1.7 Code Points
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci[UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe regex crate provides full support for Unicode code point matching. Namely,
c67d6573Sopenharmony_cithe fundamental atom of any match is always a single code point.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciGiven Rust's strong ties to UTF-8, the following guarantees are also provided:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* All matches are reported on valid UTF-8 code unit boundaries. That is, any
c67d6573Sopenharmony_ci  match range returned by the public regex API is guaranteed to successfully
c67d6573Sopenharmony_ci  slice the string that was searched.
c67d6573Sopenharmony_ci* By consequence of the above, it is impossible to match surrogode code points.
c67d6573Sopenharmony_ci  No support for UTF-16 is provided, so this is never necessary.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that when Unicode mode is disabled, the fundamental atom of matching is
c67d6573Sopenharmony_cino longer a code point but a single byte. When Unicode mode is disabled, many
c67d6573Sopenharmony_ciUnicode features are disabled as well. For example, `(?-u)\pL` is not a valid
c67d6573Sopenharmony_ciregex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
c67d6573Sopenharmony_cibyte `\xFF`) is, for example.