regex/src/lib.rs

c67d6573Sopenharmony_ci/*!
c67d6573Sopenharmony_ciThis crate provides a library for parsing, compiling, and executing regular
c67d6573Sopenharmony_ciexpressions. Its syntax is similar to Perl-style regular expressions, but lacks
c67d6573Sopenharmony_cia few features like look around and backreferences. In exchange, all searches
c67d6573Sopenharmony_ciexecute in linear time with respect to the size of the regular expression and
c67d6573Sopenharmony_cisearch text.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis crate's documentation provides some simple examples, describes
c67d6573Sopenharmony_ci[Unicode support](#unicode) and exhaustively lists the
c67d6573Sopenharmony_ci[supported syntax](#syntax).
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFor more specific details on the API for regular expressions, please see the
c67d6573Sopenharmony_cidocumentation for the [`Regex`](struct.Regex.html) type.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Usage
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis crate is [on crates.io](https://crates.io/crates/regex) and can be
c67d6573Sopenharmony_ciused by adding `regex` to your dependencies in your project's `Cargo.toml`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```toml
c67d6573Sopenharmony_ci[dependencies]
c67d6573Sopenharmony_ciregex = "1"
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: find a date
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciGeneral use of regular expressions in this package involves compiling an
c67d6573Sopenharmony_ciexpression and then using it to search, split or replace text. For example,
c67d6573Sopenharmony_cito confirm that some text resembles a date:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ciuse regex::Regex;
c67d6573Sopenharmony_cilet re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
c67d6573Sopenharmony_ciassert!(re.is_match("2014-01-01"));
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNotice the use of the `^` and `$` anchors. In this crate, every expression
c67d6573Sopenharmony_ciis executed with an implicit `.*?` at the beginning and end, which allows
c67d6573Sopenharmony_ciit to match anywhere in the text. Anchors can be used to ensure that the
c67d6573Sopenharmony_cifull text matches an expression.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis example also demonstrates the utility of
c67d6573Sopenharmony_ci[raw strings](https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals)
c67d6573Sopenharmony_ciin Rust, which
c67d6573Sopenharmony_ciare just like regular strings except they are prefixed with an `r` and do
c67d6573Sopenharmony_cinot process any escape sequences. For example, `"\\d"` is the same
c67d6573Sopenharmony_ciexpression as `r"\d"`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: Avoid compiling the same regex in a loop
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIt is an anti-pattern to compile the same regular expression in a loop
c67d6573Sopenharmony_cisince compilation is typically expensive. (It takes anywhere from a few
c67d6573Sopenharmony_cimicroseconds to a few **milliseconds** depending on the size of the
c67d6573Sopenharmony_ciregex.) Not only is compilation itself expensive, but this also prevents
c67d6573Sopenharmony_cioptimizations that reuse allocations internally to the matching engines.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIn Rust, it can sometimes be a pain to pass regular expressions around if
c67d6573Sopenharmony_cithey're used from inside a helper function. Instead, we recommend using the
c67d6573Sopenharmony_ci[`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that
c67d6573Sopenharmony_ciregular expressions are compiled exactly once.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFor example:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ciuse lazy_static::lazy_static;
c67d6573Sopenharmony_ciuse regex::Regex;
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_cifn some_helper_function(text: &str) -> bool {
c67d6573Sopenharmony_ci    lazy_static! {
c67d6573Sopenharmony_ci        static ref RE: Regex = Regex::new("...").unwrap();
c67d6573Sopenharmony_ci    }
c67d6573Sopenharmony_ci    RE.is_match(text)
c67d6573Sopenharmony_ci}
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_cifn main() {}
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciSpecifically, in this example, the regex will be compiled when it is used for
c67d6573Sopenharmony_cithe first time. On subsequent uses, it will reuse the previous compilation.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: iterating over capture groups
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis crate provides convenient iterators for matching an expression
c67d6573Sopenharmony_cirepeatedly against a search string to find successive non-overlapping
c67d6573Sopenharmony_cimatches. For example, to find all dates in a string and be able to access
c67d6573Sopenharmony_cithem by their component pieces:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
c67d6573Sopenharmony_cilet text = "2012-03-14, 2013-01-01 and 2014-07-05";
c67d6573Sopenharmony_cifor cap in re.captures_iter(text) {
c67d6573Sopenharmony_ci    println!("Month: {} Day: {} Year: {}", &cap[2], &cap[3], &cap[1]);
c67d6573Sopenharmony_ci}
c67d6573Sopenharmony_ci// Output:
c67d6573Sopenharmony_ci// Month: 03 Day: 14 Year: 2012
c67d6573Sopenharmony_ci// Month: 01 Day: 01 Year: 2013
c67d6573Sopenharmony_ci// Month: 07 Day: 05 Year: 2014
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNotice that the year is in the capture group indexed at `1`. This is
c67d6573Sopenharmony_cibecause the *entire match* is stored in the capture group at index `0`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: replacement with named capture groups
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciBuilding on the previous example, perhaps we'd like to rearrange the date
c67d6573Sopenharmony_ciformats. This can be done with text replacement. But to make the code
c67d6573Sopenharmony_ciclearer, we can *name*  our capture groups and use those names as variables
c67d6573Sopenharmony_ciin our replacement text:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap();
c67d6573Sopenharmony_cilet before = "2012-03-14, 2013-01-01 and 2014-07-05";
c67d6573Sopenharmony_cilet after = re.replace_all(before, "$m/$d/$y");
c67d6573Sopenharmony_ciassert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe `replace` methods are actually polymorphic in the replacement, which
c67d6573Sopenharmony_ciprovides more flexibility than is seen here. (See the documentation for
c67d6573Sopenharmony_ci`Regex::replace` for more details.)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that if your regex gets complicated, you can use the `x` flag to
c67d6573Sopenharmony_cienable insignificant whitespace mode, which also lets you write comments:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(?x)
c67d6573Sopenharmony_ci  (?P<y>\d{4}) # the year
c67d6573Sopenharmony_ci  -
c67d6573Sopenharmony_ci  (?P<m>\d{2}) # the month
c67d6573Sopenharmony_ci  -
c67d6573Sopenharmony_ci  (?P<d>\d{2}) # the day
c67d6573Sopenharmony_ci").unwrap();
c67d6573Sopenharmony_cilet before = "2012-03-14, 2013-01-01 and 2014-07-05";
c67d6573Sopenharmony_cilet after = re.replace_all(before, "$m/$d/$y");
c67d6573Sopenharmony_ciassert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIf you wish to match against whitespace in this mode, you can still use `\s`,
c67d6573Sopenharmony_ci`\n`, `\t`, etc. For escaping a single space character, you can escape it
c67d6573Sopenharmony_cidirectly with `\ `, use its hex character code `\x20` or temporarily disable
c67d6573Sopenharmony_cithe `x` flag, e.g., `(?-x: )`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: match multiple regular expressions simultaneously
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis demonstrates how to use a `RegexSet` to match multiple (possibly
c67d6573Sopenharmony_cioverlapping) regular expressions in a single scan of the search text:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ciuse regex::RegexSet;
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_cilet set = RegexSet::new(&[
c67d6573Sopenharmony_ci    r"\w+",
c67d6573Sopenharmony_ci    r"\d+",
c67d6573Sopenharmony_ci    r"\pL+",
c67d6573Sopenharmony_ci    r"foo",
c67d6573Sopenharmony_ci    r"bar",
c67d6573Sopenharmony_ci    r"barfoo",
c67d6573Sopenharmony_ci    r"foobar",
c67d6573Sopenharmony_ci]).unwrap();
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// Iterate over and collect all of the matches.
c67d6573Sopenharmony_cilet matches: Vec<_> = set.matches("foobar").into_iter().collect();
c67d6573Sopenharmony_ciassert_eq!(matches, vec![0, 2, 3, 4, 6]);
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// You can also test whether a particular regex matched:
c67d6573Sopenharmony_cilet matches = set.matches("foobar");
c67d6573Sopenharmony_ciassert!(!matches.matched(5));
c67d6573Sopenharmony_ciassert!(matches.matched(6));
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Pay for what you use
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciWith respect to searching text with a regular expression, there are three
c67d6573Sopenharmony_ciquestions that can be asked:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci1. Does the text match this expression?
c67d6573Sopenharmony_ci2. If so, where does it match?
c67d6573Sopenharmony_ci3. Where did the capturing groups match?
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciGenerally speaking, this crate could provide a function to answer only #3,
c67d6573Sopenharmony_ciwhich would subsume #1 and #2 automatically. However, it can be significantly
c67d6573Sopenharmony_cimore expensive to compute the location of capturing group matches, so it's best
c67d6573Sopenharmony_cinot to do it if you don't need to.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciTherefore, only use what you need. For example, don't use `find` if you
c67d6573Sopenharmony_cionly need to test if an expression matches a string. (Use `is_match`
c67d6573Sopenharmony_ciinstead.)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Unicode
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis implementation executes regular expressions **only** on valid UTF-8
c67d6573Sopenharmony_ciwhile exposing match locations as byte indices into the search string. (To
c67d6573Sopenharmony_cirelax this restriction, use the [`bytes`](bytes/index.html) sub-module.)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciOnly simple case folding is supported. Namely, when matching
c67d6573Sopenharmony_cicase-insensitively, the characters are first mapped using the "simple" case
c67d6573Sopenharmony_cifolding rules defined by Unicode.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciRegular expressions themselves are **only** interpreted as a sequence of
c67d6573Sopenharmony_ciUnicode scalar values. This means you can use Unicode characters directly
c67d6573Sopenharmony_ciin your expression:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(?i)Δ+").unwrap();
c67d6573Sopenharmony_cilet mat = re.find("ΔδΔ").unwrap();
c67d6573Sopenharmony_ciassert_eq!((mat.start(), mat.end()), (0, 6));
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciMost features of the regular expressions in this crate are Unicode aware. Here
c67d6573Sopenharmony_ciare some examples:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* `.` will match any valid UTF-8 encoded Unicode scalar value except for `\n`.
c67d6573Sopenharmony_ci  (To also match `\n`, enable the `s` flag, e.g., `(?s:.)`.)
c67d6573Sopenharmony_ci* `\w`, `\d` and `\s` are Unicode aware. For example, `\s` will match all forms
c67d6573Sopenharmony_ci  of whitespace categorized by Unicode.
c67d6573Sopenharmony_ci* `\b` matches a Unicode word boundary.
c67d6573Sopenharmony_ci* Negated character classes like `[^a]` match all Unicode scalar values except
c67d6573Sopenharmony_ci  for `a`.
c67d6573Sopenharmony_ci* `^` and `$` are **not** Unicode aware in multi-line mode. Namely, they only
c67d6573Sopenharmony_ci  recognize `\n` and not any of the other forms of line terminators defined
c67d6573Sopenharmony_ci  by Unicode.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciUnicode general categories, scripts, script extensions, ages and a smattering
c67d6573Sopenharmony_ciof boolean properties are available as character classes. For example, you can
c67d6573Sopenharmony_cimatch a sequence of numerals, Greek or Cherokee letters:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap();
c67d6573Sopenharmony_cilet mat = re.find("abcΔᎠβⅠᏴγδⅡxyz").unwrap();
c67d6573Sopenharmony_ciassert_eq!((mat.start(), mat.end()), (3, 23));
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFor a more detailed breakdown of Unicode support with respect to
c67d6573Sopenharmony_ci[UTS#18](https://unicode.org/reports/tr18/),
c67d6573Sopenharmony_ciplease see the
c67d6573Sopenharmony_ci[UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
c67d6573Sopenharmony_cidocument in the root of the regex repository.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Opt out of Unicode support
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe `bytes` sub-module provides a `Regex` type that can be used to match
c67d6573Sopenharmony_cion `&[u8]`. By default, text is interpreted as UTF-8 just like it is with
c67d6573Sopenharmony_cithe main `Regex` type. However, this behavior can be disabled by turning
c67d6573Sopenharmony_cioff the `u` flag, even if doing so could result in matching invalid UTF-8.
c67d6573Sopenharmony_ciFor example, when the `u` flag is disabled, `.` will match any byte instead
c67d6573Sopenharmony_ciof any Unicode scalar value.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciDisabling the `u` flag is also possible with the standard `&str`-based `Regex`
c67d6573Sopenharmony_citype, but it is only allowed where the UTF-8 invariant is maintained. For
c67d6573Sopenharmony_ciexample, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an
c67d6573Sopenharmony_ci`&str`-based `Regex`, but `(?-u:\xFF)` will attempt to match the raw byte
c67d6573Sopenharmony_ci`\xFF`, which is invalid UTF-8 and therefore is illegal in `&str`-based
c67d6573Sopenharmony_ciregexes.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFinally, since Unicode support requires bundling large Unicode data
c67d6573Sopenharmony_citables, this crate exposes knobs to disable the compilation of those
c67d6573Sopenharmony_cidata tables, which can be useful for shrinking binary size and reducing
c67d6573Sopenharmony_cicompilation times. For details on how to do that, see the section on [crate
c67d6573Sopenharmony_cifeatures](#crate-features).
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Syntax
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe syntax supported in this crate is documented below.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that the regular expression parser and abstract syntax are exposed in
c67d6573Sopenharmony_cia separate crate, [`regex-syntax`](https://docs.rs/regex-syntax).
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Matching one character
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci.             any character except new line (includes new line with s flag)
c67d6573Sopenharmony_ci\d            digit (\p{Nd})
c67d6573Sopenharmony_ci\D            not digit
c67d6573Sopenharmony_ci\pN           One-letter name Unicode character class
c67d6573Sopenharmony_ci\p{Greek}     Unicode character class (general category or script)
c67d6573Sopenharmony_ci\PN           Negated one-letter name Unicode character class
c67d6573Sopenharmony_ci\P{Greek}     negated Unicode character class (general category or script)
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci### Character classes
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci[xyz]         A character class matching either x, y or z (union).
c67d6573Sopenharmony_ci[^xyz]        A character class matching any character except x, y and z.
c67d6573Sopenharmony_ci[a-z]         A character class matching any character in range a-z.
c67d6573Sopenharmony_ci[[:alpha:]]   ASCII character class ([A-Za-z])
c67d6573Sopenharmony_ci[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
c67d6573Sopenharmony_ci[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
c67d6573Sopenharmony_ci[a-y&&xyz]    Intersection (matching x or y)
c67d6573Sopenharmony_ci[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
c67d6573Sopenharmony_ci[0-9--4]      Direct subtraction (matching 0-9 except 4)
c67d6573Sopenharmony_ci[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
c67d6573Sopenharmony_ci[\[\]]        Escaping in character classes (matching [ or ])
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciAny named character class may appear inside a bracketed `[...]` character
c67d6573Sopenharmony_ciclass. For example, `[\p{Greek}[:digit:]]` matches any Greek or ASCII
c67d6573Sopenharmony_cidigit. `[\p{Greek}&&\pL]` matches Greek letters.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciPrecedence in character classes, from most binding to least:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci1. Ranges: `a-cd` == `[a-c]d`
c67d6573Sopenharmony_ci2. Union: `ab&&bc` == `[ab]&&[bc]`
c67d6573Sopenharmony_ci3. Intersection: `^a-z&&b` == `^[a-z&&b]`
c67d6573Sopenharmony_ci4. Negation
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Composites
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_cixy    concatenation (x followed by y)
c67d6573Sopenharmony_cix|y   alternation (x or y, prefer x)
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Repetitions
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_cix*        zero or more of x (greedy)
c67d6573Sopenharmony_cix+        one or more of x (greedy)
c67d6573Sopenharmony_cix?        zero or one of x (greedy)
c67d6573Sopenharmony_cix*?       zero or more of x (ungreedy/lazy)
c67d6573Sopenharmony_cix+?       one or more of x (ungreedy/lazy)
c67d6573Sopenharmony_cix??       zero or one of x (ungreedy/lazy)
c67d6573Sopenharmony_cix{n,m}    at least n x and at most m x (greedy)
c67d6573Sopenharmony_cix{n,}     at least n x (greedy)
c67d6573Sopenharmony_cix{n}      exactly n x
c67d6573Sopenharmony_cix{n,m}?   at least n x and at most m x (ungreedy/lazy)
c67d6573Sopenharmony_cix{n,}?    at least n x (ungreedy/lazy)
c67d6573Sopenharmony_cix{n}?     exactly n x
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Empty matches
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci^     the beginning of text (or start-of-line with multi-line mode)
c67d6573Sopenharmony_ci$     the end of text (or end-of-line with multi-line mode)
c67d6573Sopenharmony_ci\A    only the beginning of text (even with multi-line mode enabled)
c67d6573Sopenharmony_ci\z    only the end of text (even with multi-line mode enabled)
c67d6573Sopenharmony_ci\b    a Unicode word boundary (\w on one side and \W, \A, or \z on other)
c67d6573Sopenharmony_ci\B    not a Unicode word boundary
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe empty regex is valid and matches the empty string. For example, the empty
c67d6573Sopenharmony_ciregex matches `abc` at positions `0`, `1`, `2` and `3`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Grouping and flags
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci(exp)          numbered capture group (indexed by opening parenthesis)
c67d6573Sopenharmony_ci(?P&lt;name&gt;exp)  named (also numbered) capture group (allowed chars: [_0-9a-zA-Z.\[\]])
c67d6573Sopenharmony_ci(?:exp)        non-capturing group
c67d6573Sopenharmony_ci(?flags)       set flags within current group
c67d6573Sopenharmony_ci(?flags:exp)   set flags for exp (non-capturing)
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFlags are each a single character. For example, `(?x)` sets the flag `x`
c67d6573Sopenharmony_ciand `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at
c67d6573Sopenharmony_cithe same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets
c67d6573Sopenharmony_cithe `x` flag and clears the `y` flag.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciAll flags are by default disabled unless stated otherwise. They are:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_cii     case-insensitive: letters match both upper and lower case
c67d6573Sopenharmony_cim     multi-line mode: ^ and $ match begin/end of line
c67d6573Sopenharmony_cis     allow . to match \n
c67d6573Sopenharmony_ciU     swap the meaning of x* and x*?
c67d6573Sopenharmony_ciu     Unicode support (enabled by default)
c67d6573Sopenharmony_cix     ignore whitespace and allow line comments (starting with `#`)
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciFlags can be toggled within a pattern. Here's an example that matches
c67d6573Sopenharmony_cicase-insensitively for the first part but case-sensitively for the second part:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
c67d6573Sopenharmony_cilet cap = re.captures("AaAaAbbBBBb").unwrap();
c67d6573Sopenharmony_ciassert_eq!(&cap[0], "AaAaAbb");
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNotice that the `a+` matches either `a` or `A`, but the `b+` only matches
c67d6573Sopenharmony_ci`b`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciMulti-line mode means `^` and `$` no longer match just at the beginning/end of
c67d6573Sopenharmony_cithe input, but at the beginning/end of lines:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_cilet re = Regex::new(r"(?m)^line \d+").unwrap();
c67d6573Sopenharmony_cilet m = re.find("line one\nline 2\n").unwrap();
c67d6573Sopenharmony_ciassert_eq!(m.as_str(), "line 2");
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciNote that `^` matches after new lines, even at the end of input:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_cilet re = Regex::new(r"(?m)^").unwrap();
c67d6573Sopenharmony_cilet m = re.find_iter("test\n").last().unwrap();
c67d6573Sopenharmony_ciassert_eq!((m.start(), m.end()), (5, 5));
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciHere is an example that uses an ASCII word boundary instead of a Unicode
c67d6573Sopenharmony_ciword boundary:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::Regex;
c67d6573Sopenharmony_ci# fn main() {
c67d6573Sopenharmony_cilet re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
c67d6573Sopenharmony_cilet cap = re.captures("$$abc$$").unwrap();
c67d6573Sopenharmony_ciassert_eq!(&cap[0], "abc");
c67d6573Sopenharmony_ci# }
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Escape sequences
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci\*          literal *, works for any punctuation character: \.+*?()|[]{}^$
c67d6573Sopenharmony_ci\a          bell (\x07)
c67d6573Sopenharmony_ci\f          form feed (\x0C)
c67d6573Sopenharmony_ci\t          horizontal tab
c67d6573Sopenharmony_ci\n          new line
c67d6573Sopenharmony_ci\r          carriage return
c67d6573Sopenharmony_ci\v          vertical tab (\x0B)
c67d6573Sopenharmony_ci\123        octal character code (up to three digits) (when enabled)
c67d6573Sopenharmony_ci\x7F        hex character code (exactly two digits)
c67d6573Sopenharmony_ci\x{10FFFF}  any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci\u007F      hex character code (exactly four digits)
c67d6573Sopenharmony_ci\u{7F}      any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci\U0000007F  hex character code (exactly eight digits)
c67d6573Sopenharmony_ci\U{7F}      any hex character code corresponding to a Unicode code point
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## Perl character classes (Unicode friendly)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThese classes are based on the definitions provided in
c67d6573Sopenharmony_ci[UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci\d     digit (\p{Nd})
c67d6573Sopenharmony_ci\D     not digit
c67d6573Sopenharmony_ci\s     whitespace (\p{White_Space})
c67d6573Sopenharmony_ci\S     not whitespace
c67d6573Sopenharmony_ci\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
c67d6573Sopenharmony_ci\W     not word character
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci## ASCII character classes
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci<pre class="rust">
c67d6573Sopenharmony_ci[[:alnum:]]    alphanumeric ([0-9A-Za-z])
c67d6573Sopenharmony_ci[[:alpha:]]    alphabetic ([A-Za-z])
c67d6573Sopenharmony_ci[[:ascii:]]    ASCII ([\x00-\x7F])
c67d6573Sopenharmony_ci[[:blank:]]    blank ([\t ])
c67d6573Sopenharmony_ci[[:cntrl:]]    control ([\x00-\x1F\x7F])
c67d6573Sopenharmony_ci[[:digit:]]    digits ([0-9])
c67d6573Sopenharmony_ci[[:graph:]]    graphical ([!-~])
c67d6573Sopenharmony_ci[[:lower:]]    lower case ([a-z])
c67d6573Sopenharmony_ci[[:print:]]    printable ([ -~])
c67d6573Sopenharmony_ci[[:punct:]]    punctuation ([!-/:-@\[-`{-~])
c67d6573Sopenharmony_ci[[:space:]]    whitespace ([\t\n\v\f\r ])
c67d6573Sopenharmony_ci[[:upper:]]    upper case ([A-Z])
c67d6573Sopenharmony_ci[[:word:]]     word characters ([0-9A-Za-z_])
c67d6573Sopenharmony_ci[[:xdigit:]]   hex digit ([0-9A-Fa-f])
c67d6573Sopenharmony_ci</pre>
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Crate features
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciBy default, this crate tries pretty hard to make regex matching both as fast
c67d6573Sopenharmony_cias possible and as correct as it can be, within reason. This means that there
c67d6573Sopenharmony_ciis a lot of code dedicated to performance, the handling of Unicode data and the
c67d6573Sopenharmony_ciUnicode data itself. Overall, this leads to more dependencies, larger binaries
c67d6573Sopenharmony_ciand longer compile times.  This trade off may not be appropriate in all cases,
c67d6573Sopenharmony_ciand indeed, even when all Unicode and performance features are disabled, one
c67d6573Sopenharmony_ciis still left with a perfectly serviceable regex engine that will work well
c67d6573Sopenharmony_ciin many cases.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis crate exposes a number of features for controlling that trade off. Some
c67d6573Sopenharmony_ciof these features are strictly performance oriented, such that disabling them
c67d6573Sopenharmony_ciwon't result in a loss of functionality, but may result in worse performance.
c67d6573Sopenharmony_ciOther features, such as the ones controlling the presence or absence of Unicode
c67d6573Sopenharmony_cidata, can result in a loss of functionality. For example, if one disables the
c67d6573Sopenharmony_ci`unicode-case` feature (described below), then compiling the regex `(?i)a`
c67d6573Sopenharmony_ciwill fail since Unicode case insensitivity is enabled by default. Instead,
c67d6573Sopenharmony_cicallers must use `(?i-u)a` instead to disable Unicode case folding. Stated
c67d6573Sopenharmony_cidifferently, enabling or disabling any of the features below can only add or
c67d6573Sopenharmony_cisubtract from the total set of valid regular expressions. Enabling or disabling
c67d6573Sopenharmony_cia feature will never modify the match semantics of a regular expression.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciAll features below are enabled by default.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci### Ecosystem features
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* **std** -
c67d6573Sopenharmony_ci  When enabled, this will cause `regex` to use the standard library. Currently,
c67d6573Sopenharmony_ci  disabling this feature will always result in a compilation error. It is
c67d6573Sopenharmony_ci  intended to add `alloc`-only support to regex in the future.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci### Performance features
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* **perf** -
c67d6573Sopenharmony_ci  Enables all performance related features. This feature is enabled by default
c67d6573Sopenharmony_ci  and will always cover all features that improve performance, even if more
c67d6573Sopenharmony_ci  are added in the future.
c67d6573Sopenharmony_ci* **perf-dfa** -
c67d6573Sopenharmony_ci  Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
c67d6573Sopenharmony_ci  portions of a regex to a very fast DFA on an as-needed basis. This can
c67d6573Sopenharmony_ci  result in substantial speedups, usually by an order of magnitude on large
c67d6573Sopenharmony_ci  haystacks. The lazy DFA does not bring in any new dependencies, but it can
c67d6573Sopenharmony_ci  make compile times longer.
c67d6573Sopenharmony_ci* **perf-inline** -
c67d6573Sopenharmony_ci  Enables the use of aggressive inlining inside match routines. This reduces
c67d6573Sopenharmony_ci  the overhead of each match. The aggressive inlining, however, increases
c67d6573Sopenharmony_ci  compile times and binary size.
c67d6573Sopenharmony_ci* **perf-literal** -
c67d6573Sopenharmony_ci  Enables the use of literal optimizations for speeding up matches. In some
c67d6573Sopenharmony_ci  cases, literal optimizations can result in speedups of _several_ orders of
c67d6573Sopenharmony_ci  magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies.
c67d6573Sopenharmony_ci* **perf-cache** -
c67d6573Sopenharmony_ci  This feature used to enable a faster internal cache at the cost of using
c67d6573Sopenharmony_ci  additional dependencies, but this is no longer an option. A fast internal
c67d6573Sopenharmony_ci  cache is now used unconditionally with no additional dependencies. This may
c67d6573Sopenharmony_ci  change in the future.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci### Unicode features
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci* **unicode** -
c67d6573Sopenharmony_ci  Enables all Unicode features. This feature is enabled by default, and will
c67d6573Sopenharmony_ci  always cover all Unicode features, even if more are added in the future.
c67d6573Sopenharmony_ci* **unicode-age** -
c67d6573Sopenharmony_ci  Provide the data for the
c67d6573Sopenharmony_ci  [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
c67d6573Sopenharmony_ci  This makes it possible to use classes like `\p{Age:6.0}` to refer to all
c67d6573Sopenharmony_ci  codepoints first introduced in Unicode 6.0
c67d6573Sopenharmony_ci* **unicode-bool** -
c67d6573Sopenharmony_ci  Provide the data for numerous Unicode boolean properties. The full list
c67d6573Sopenharmony_ci  is not included here, but contains properties like `Alphabetic`, `Emoji`,
c67d6573Sopenharmony_ci  `Lowercase`, `Math`, `Uppercase` and `White_Space`.
c67d6573Sopenharmony_ci* **unicode-case** -
c67d6573Sopenharmony_ci  Provide the data for case insensitive matching using
c67d6573Sopenharmony_ci  [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches).
c67d6573Sopenharmony_ci* **unicode-gencat** -
c67d6573Sopenharmony_ci  Provide the data for
c67d6573Sopenharmony_ci  [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values).
c67d6573Sopenharmony_ci  This includes, but is not limited to, `Decimal_Number`, `Letter`,
c67d6573Sopenharmony_ci  `Math_Symbol`, `Number` and `Punctuation`.
c67d6573Sopenharmony_ci* **unicode-perl** -
c67d6573Sopenharmony_ci  Provide the data for supporting the Unicode-aware Perl character classes,
c67d6573Sopenharmony_ci  corresponding to `\w`, `\s` and `\d`. This is also necessary for using
c67d6573Sopenharmony_ci  Unicode-aware word boundary assertions. Note that if this feature is
c67d6573Sopenharmony_ci  disabled, the `\s` and `\d` character classes are still available if the
c67d6573Sopenharmony_ci  `unicode-bool` and `unicode-gencat` features are enabled, respectively.
c67d6573Sopenharmony_ci* **unicode-script** -
c67d6573Sopenharmony_ci  Provide the data for
c67d6573Sopenharmony_ci  [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/).
c67d6573Sopenharmony_ci  This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
c67d6573Sopenharmony_ci  `Latin` and `Thai`.
c67d6573Sopenharmony_ci* **unicode-segment** -
c67d6573Sopenharmony_ci  Provide the data necessary to provide the properties used to implement the
c67d6573Sopenharmony_ci  [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/).
c67d6573Sopenharmony_ci  This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and
c67d6573Sopenharmony_ci  `\p{sb=ATerm}`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Untrusted input
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis crate can handle both untrusted regular expressions and untrusted
c67d6573Sopenharmony_cisearch text.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciUntrusted regular expressions are handled by capping the size of a compiled
c67d6573Sopenharmony_ciregular expression.
c67d6573Sopenharmony_ci(See [`RegexBuilder::size_limit`](struct.RegexBuilder.html#method.size_limit).)
c67d6573Sopenharmony_ciWithout this, it would be trivial for an attacker to exhaust your system's
c67d6573Sopenharmony_cimemory with expressions like `a{100}{100}{100}`.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciUntrusted search text is allowed because the matching engine(s) in this
c67d6573Sopenharmony_cicrate have time complexity `O(mn)` (with `m ~ regex` and `n ~ search
c67d6573Sopenharmony_citext`), which means there's no way to cause exponential blow-up like with
c67d6573Sopenharmony_cisome other regular expression engines. (We pay for this by disallowing
c67d6573Sopenharmony_cifeatures like arbitrary look-ahead and backreferences.)
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciWhen a DFA is used, pathological cases with exponential state blow-up are
c67d6573Sopenharmony_ciavoided by constructing the DFA lazily or in an "online" manner. Therefore,
c67d6573Sopenharmony_ciat most one new state can be created for each byte of input. This satisfies
c67d6573Sopenharmony_ciour time complexity guarantees, but can lead to memory growth
c67d6573Sopenharmony_ciproportional to the size of the input. As a stopgap, the DFA is only
c67d6573Sopenharmony_ciallowed to store a fixed number of states. When the limit is reached, its
c67d6573Sopenharmony_cistates are wiped and continues on, possibly duplicating previous work. If
c67d6573Sopenharmony_cithe limit is reached too frequently, it gives up and hands control off to
c67d6573Sopenharmony_cianother matching engine with fixed memory requirements.
c67d6573Sopenharmony_ci(The DFA size limit can also be tweaked. See
c67d6573Sopenharmony_ci[`RegexBuilder::dfa_size_limit`](struct.RegexBuilder.html#method.dfa_size_limit).)
c67d6573Sopenharmony_ci*/
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci#![deny(missing_docs)]
c67d6573Sopenharmony_ci#![cfg_attr(feature = "pattern", feature(pattern))]
c67d6573Sopenharmony_ci#![warn(missing_debug_implementations)]
c67d6573Sopenharmony_ci#![allow(clippy::if_same_then_else)]
c67d6573Sopenharmony_ci#[cfg(not(feature = "std"))]
c67d6573Sopenharmony_cicompile_error!("`std` feature is currently required to build this crate");
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// To check README's example
c67d6573Sopenharmony_ci// TODO: Re-enable this once the MSRV is 1.43 or greater.
c67d6573Sopenharmony_ci// See: https://github.com/rust-lang/regex/issues/684
c67d6573Sopenharmony_ci// See: https://github.com/rust-lang/regex/issues/685
c67d6573Sopenharmony_ci// #[cfg(doctest)]
c67d6573Sopenharmony_ci// doc_comment::doctest!("../README.md");
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub use crate::error::Error;
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub use crate::re_builder::set_unicode::*;
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub use crate::re_builder::unicode::*;
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub use crate::re_set::unicode::*;
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub use crate::re_unicode::{
c67d6573Sopenharmony_ci    escape, CaptureLocations, CaptureMatches, CaptureNames, Captures,
c67d6573Sopenharmony_ci    Locations, Match, Matches, NoExpand, Regex, Replacer, ReplacerRef, Split,
c67d6573Sopenharmony_ci    SplitN, SubCaptureMatches,
c67d6573Sopenharmony_ci};
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci/**
c67d6573Sopenharmony_ciMatch regular expressions on arbitrary bytes.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis module provides a nearly identical API to the one found in the
c67d6573Sopenharmony_citop-level of this crate. There are two important differences:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
c67d6573Sopenharmony_ciis used where `String` would have been used.
c67d6573Sopenharmony_ci2. Unicode support can be disabled even when disabling it would result in
c67d6573Sopenharmony_cimatching invalid UTF-8 bytes.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: match null terminated string
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis shows how to find all null-terminated strings in a slice of bytes:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use regex::bytes::Regex;
c67d6573Sopenharmony_cilet re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
c67d6573Sopenharmony_cilet text = b"foo\x00bar\x00baz\x00";
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// Extract all of the strings without the null terminator from each match.
c67d6573Sopenharmony_ci// The unwrap is OK here since a match requires the `cstr` capture to match.
c67d6573Sopenharmony_cilet cstrs: Vec<&[u8]> =
c67d6573Sopenharmony_ci    re.captures_iter(text)
c67d6573Sopenharmony_ci      .map(|c| c.name("cstr").unwrap().as_bytes())
c67d6573Sopenharmony_ci      .collect();
c67d6573Sopenharmony_ciassert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Example: selectively enable Unicode support
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThis shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
c67d6573Sopenharmony_cistring (e.g., to extract a title from a Matroska file):
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci```rust
c67d6573Sopenharmony_ci# use std::str;
c67d6573Sopenharmony_ci# use regex::bytes::Regex;
c67d6573Sopenharmony_cilet re = Regex::new(
c67d6573Sopenharmony_ci    r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
c67d6573Sopenharmony_ci).unwrap();
c67d6573Sopenharmony_cilet text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
c67d6573Sopenharmony_cilet caps = re.captures(text).unwrap();
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// Notice that despite the `.*` at the end, it will only match valid UTF-8
c67d6573Sopenharmony_ci// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
c67d6573Sopenharmony_ci// the `.*` would match the rest of the bytes.
c67d6573Sopenharmony_cilet mat = caps.get(1).unwrap();
c67d6573Sopenharmony_ciassert_eq!((7, 10), (mat.start(), mat.end()));
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci// If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
c67d6573Sopenharmony_cilet title = str::from_utf8(&caps[1]).unwrap();
c67d6573Sopenharmony_ciassert_eq!("☃", title);
c67d6573Sopenharmony_ci```
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIn general, if the Unicode flag is enabled in a capture group and that capture
c67d6573Sopenharmony_ciis part of the overall match, then the capture is *guaranteed* to be valid
c67d6573Sopenharmony_ciUTF-8.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Syntax
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciThe supported syntax is pretty much the same as the syntax for Unicode
c67d6573Sopenharmony_ciregular expressions with a few changes that make sense for matching arbitrary
c67d6573Sopenharmony_cibytes:
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci1. The `u` flag can be disabled even when disabling it might cause the regex to
c67d6573Sopenharmony_cimatch invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
c67d6573Sopenharmony_ci"ASCII compatible" mode.
c67d6573Sopenharmony_ci2. In ASCII compatible mode, neither Unicode scalar values nor Unicode
c67d6573Sopenharmony_cicharacter classes are allowed.
c67d6573Sopenharmony_ci3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
c67d6573Sopenharmony_cirevert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
c67d6573Sopenharmony_cito `[[:digit:]]` and `\s` maps to `[[:space:]]`.
c67d6573Sopenharmony_ci4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
c67d6573Sopenharmony_cidetermine whether a byte is a word byte or not.
c67d6573Sopenharmony_ci5. Hexadecimal notation can be used to specify arbitrary bytes instead of
c67d6573Sopenharmony_ciUnicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
c67d6573Sopenharmony_ciliteral byte `\xFF`, while in Unicode mode, `\xFF` is a Unicode codepoint that
c67d6573Sopenharmony_cimatches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when
c67d6573Sopenharmony_cienabled.
c67d6573Sopenharmony_ci6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the
c67d6573Sopenharmony_ci`s` flag is additionally enabled, `.` matches any byte.
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci# Performance
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ciIn general, one should expect performance on `&[u8]` to be roughly similar to
c67d6573Sopenharmony_ciperformance on `&str`.
c67d6573Sopenharmony_ci*/
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub mod bytes {
c67d6573Sopenharmony_ci    pub use crate::re_builder::bytes::*;
c67d6573Sopenharmony_ci    pub use crate::re_builder::set_bytes::*;
c67d6573Sopenharmony_ci    pub use crate::re_bytes::*;
c67d6573Sopenharmony_ci    pub use crate::re_set::bytes::*;
c67d6573Sopenharmony_ci}
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_cimod backtrack;
c67d6573Sopenharmony_cimod compile;
c67d6573Sopenharmony_ci#[cfg(feature = "perf-dfa")]
c67d6573Sopenharmony_cimod dfa;
c67d6573Sopenharmony_cimod error;
c67d6573Sopenharmony_cimod exec;
c67d6573Sopenharmony_cimod expand;
c67d6573Sopenharmony_cimod find_byte;
c67d6573Sopenharmony_cimod input;
c67d6573Sopenharmony_cimod literal;
c67d6573Sopenharmony_ci#[cfg(feature = "pattern")]
c67d6573Sopenharmony_cimod pattern;
c67d6573Sopenharmony_cimod pikevm;
c67d6573Sopenharmony_cimod pool;
c67d6573Sopenharmony_cimod prog;
c67d6573Sopenharmony_cimod re_builder;
c67d6573Sopenharmony_cimod re_bytes;
c67d6573Sopenharmony_cimod re_set;
c67d6573Sopenharmony_cimod re_trait;
c67d6573Sopenharmony_cimod re_unicode;
c67d6573Sopenharmony_cimod sparse;
c67d6573Sopenharmony_cimod utf8;
c67d6573Sopenharmony_ci
c67d6573Sopenharmony_ci/// The `internal` module exists to support suspicious activity, such as
c67d6573Sopenharmony_ci/// testing different matching engines and supporting the `regex-debug` CLI
c67d6573Sopenharmony_ci/// utility.
c67d6573Sopenharmony_ci#[doc(hidden)]
c67d6573Sopenharmony_ci#[cfg(feature = "std")]
c67d6573Sopenharmony_cipub mod internal {
c67d6573Sopenharmony_ci    pub use crate::compile::Compiler;
c67d6573Sopenharmony_ci    pub use crate::exec::{Exec, ExecBuilder};
c67d6573Sopenharmony_ci    pub use crate::input::{Char, CharInput, Input, InputAt};
c67d6573Sopenharmony_ci    pub use crate::literal::LiteralSearcher;
c67d6573Sopenharmony_ci    pub use crate::prog::{EmptyLook, Inst, InstRanges, Program};
c67d6573Sopenharmony_ci}