1c67d6573Sopenharmony_ciregex-syntax
2c67d6573Sopenharmony_ci============
3c67d6573Sopenharmony_ciThis crate provides a robust regular expression parser.
4c67d6573Sopenharmony_ci
5c67d6573Sopenharmony_ci[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions)
6c67d6573Sopenharmony_ci[![Crates.io](https://img.shields.io/crates/v/regex-syntax.svg)](https://crates.io/crates/regex-syntax)
7c67d6573Sopenharmony_ci[![Rust](https://img.shields.io/badge/rust-1.28.0%2B-blue.svg?maxAge=3600)](https://github.com/rust-lang/regex)
8c67d6573Sopenharmony_ci
9c67d6573Sopenharmony_ci
10c67d6573Sopenharmony_ci### Documentation
11c67d6573Sopenharmony_ci
12c67d6573Sopenharmony_cihttps://docs.rs/regex-syntax
13c67d6573Sopenharmony_ci
14c67d6573Sopenharmony_ci
15c67d6573Sopenharmony_ci### Overview
16c67d6573Sopenharmony_ci
17c67d6573Sopenharmony_ciThere are two primary types exported by this crate: `Ast` and `Hir`. The former
18c67d6573Sopenharmony_ciis a faithful abstract syntax of a regular expression, and can convert regular
19c67d6573Sopenharmony_ciexpressions back to their concrete syntax while mostly preserving its original
20c67d6573Sopenharmony_ciform. The latter type is a high level intermediate representation of a regular
21c67d6573Sopenharmony_ciexpression that is amenable to analysis and compilation into byte codes or
22c67d6573Sopenharmony_ciautomata. An `Hir` achieves this by drastically simplifying the syntactic
23c67d6573Sopenharmony_cistructure of the regular expression. While an `Hir` can be converted back to
24c67d6573Sopenharmony_ciits equivalent concrete syntax, the result is unlikely to resemble the original
25c67d6573Sopenharmony_ciconcrete syntax that produced the `Hir`.
26c67d6573Sopenharmony_ci
27c67d6573Sopenharmony_ci
28c67d6573Sopenharmony_ci### Example
29c67d6573Sopenharmony_ci
30c67d6573Sopenharmony_ciThis example shows how to parse a pattern string into its HIR:
31c67d6573Sopenharmony_ci
32c67d6573Sopenharmony_ci```rust
33c67d6573Sopenharmony_ciuse regex_syntax::Parser;
34c67d6573Sopenharmony_ciuse regex_syntax::hir::{self, Hir};
35c67d6573Sopenharmony_ci
36c67d6573Sopenharmony_cilet hir = Parser::new().parse("a|b").unwrap();
37c67d6573Sopenharmony_ciassert_eq!(hir, Hir::alternation(vec![
38c67d6573Sopenharmony_ci    Hir::literal(hir::Literal::Unicode('a')),
39c67d6573Sopenharmony_ci    Hir::literal(hir::Literal::Unicode('b')),
40c67d6573Sopenharmony_ci]));
41c67d6573Sopenharmony_ci```
42c67d6573Sopenharmony_ci
43c67d6573Sopenharmony_ci
44c67d6573Sopenharmony_ci### Safety
45c67d6573Sopenharmony_ci
46c67d6573Sopenharmony_ciThis crate has no `unsafe` code and sets `forbid(unsafe_code)`. While it's
47c67d6573Sopenharmony_cipossible this crate could use `unsafe` code in the future, the standard
48c67d6573Sopenharmony_cifor doing so is extremely high. In general, most code in this crate is not
49c67d6573Sopenharmony_ciperformance critical, since it tends to be dwarfed by the time it takes to
50c67d6573Sopenharmony_cicompile a regular expression into an automaton. Therefore, there is little need
51c67d6573Sopenharmony_cifor extreme optimization, and therefore, use of `unsafe`.
52c67d6573Sopenharmony_ci
53c67d6573Sopenharmony_ciThe standard for using `unsafe` in this crate is extremely high because this
54c67d6573Sopenharmony_cicrate is intended to be reasonably safe to use with user supplied regular
55c67d6573Sopenharmony_ciexpressions. Therefore, while there may be bugs in the regex parser itself,
56c67d6573Sopenharmony_cithey should _never_ result in memory unsafety unless there is either a bug
57c67d6573Sopenharmony_ciin the compiler or the standard library. (Since `regex-syntax` has zero
58c67d6573Sopenharmony_cidependencies.)
59c67d6573Sopenharmony_ci
60c67d6573Sopenharmony_ci
61c67d6573Sopenharmony_ci### Crate features
62c67d6573Sopenharmony_ci
63c67d6573Sopenharmony_ciBy default, this crate bundles a fairly large amount of Unicode data tables
64c67d6573Sopenharmony_ci(a source size of ~750KB). Because of their large size, one can disable some
65c67d6573Sopenharmony_cior all of these data tables. If a regular expression attempts to use Unicode
66c67d6573Sopenharmony_cidata that is not available, then an error will occur when translating the `Ast`
67c67d6573Sopenharmony_cito the `Hir`.
68c67d6573Sopenharmony_ci
69c67d6573Sopenharmony_ciThe full set of features one can disable are
70c67d6573Sopenharmony_ci[in the "Crate features" section of the documentation](https://docs.rs/regex-syntax/*/#crate-features).
71c67d6573Sopenharmony_ci
72c67d6573Sopenharmony_ci
73c67d6573Sopenharmony_ci### Testing
74c67d6573Sopenharmony_ci
75c67d6573Sopenharmony_ciSimply running `cargo test` will give you very good coverage. However, because
76c67d6573Sopenharmony_ciof the large number of features exposed by this crate, a `test` script is
77c67d6573Sopenharmony_ciincluded in this directory which will test several feature combinations. This
78c67d6573Sopenharmony_ciis the same script that is run in CI.
79c67d6573Sopenharmony_ci
80c67d6573Sopenharmony_ci
81c67d6573Sopenharmony_ci### Motivation
82c67d6573Sopenharmony_ci
83c67d6573Sopenharmony_ciThe primary purpose of this crate is to provide the parser used by `regex`.
84c67d6573Sopenharmony_ciSpecifically, this crate is treated as an implementation detail of the `regex`,
85c67d6573Sopenharmony_ciand is primarily developed for the needs of `regex`.
86c67d6573Sopenharmony_ci
87c67d6573Sopenharmony_ciSince this crate is an implementation detail of `regex`, it may experience
88c67d6573Sopenharmony_cibreaking change releases at a different cadence from `regex`. This is only
89c67d6573Sopenharmony_cipossible because this crate is _not_ a public dependency of `regex`.
90c67d6573Sopenharmony_ci
91c67d6573Sopenharmony_ciAnother consequence of this de-coupling is that there is no direct way to
92c67d6573Sopenharmony_cicompile a `regex::Regex` from a `regex_syntax::hir::Hir`. Instead, one must
93c67d6573Sopenharmony_cifirst convert the `Hir` to a string (via its `std::fmt::Display`) and then
94c67d6573Sopenharmony_cicompile that via `Regex::new`. While this does repeat some work, compilation
95c67d6573Sopenharmony_citypically takes much longer than parsing.
96c67d6573Sopenharmony_ci
97c67d6573Sopenharmony_ciStated differently, the coupling between `regex` and `regex-syntax` exists only
98c67d6573Sopenharmony_ciat the level of the concrete syntax.
99