1c67d6573Sopenharmony_ciC API for RUst's REgex engine
2c67d6573Sopenharmony_ci=============================
3c67d6573Sopenharmony_cirure is a C API to Rust's regex library, which guarantees linear time
4c67d6573Sopenharmony_cisearching using finite automata. In exchange, it must give up some common
5c67d6573Sopenharmony_ciregex features such as backreferences and arbitrary lookaround. It does
6c67d6573Sopenharmony_cihowever include capturing groups, lazy matching, Unicode support and word
7c67d6573Sopenharmony_ciboundary assertions. Its matching semantics generally correspond to Perl's,
8c67d6573Sopenharmony_cior "leftmost first." Namely, the match locations reported correspond to the
9c67d6573Sopenharmony_cifirst match that would be found by a backtracking engine.
10c67d6573Sopenharmony_ci
11c67d6573Sopenharmony_ciThe header file (`includes/rure.h`) serves as the primary API documentation of
12c67d6573Sopenharmony_cithis library. Types and flags are documented first, and functions follow.
13c67d6573Sopenharmony_ci
14c67d6573Sopenharmony_ciThe syntax and possibly other useful things are documented in the Rust
15c67d6573Sopenharmony_ciAPI documentation: https://docs.rs/regex
16c67d6573Sopenharmony_ci
17c67d6573Sopenharmony_ci
18c67d6573Sopenharmony_ciExamples
19c67d6573Sopenharmony_ci--------
20c67d6573Sopenharmony_ciThere are readable examples in the `ctest` and `examples` sub-directories.
21c67d6573Sopenharmony_ci
22c67d6573Sopenharmony_ciAssuming you have
23c67d6573Sopenharmony_ci[Rust and Cargo installed](https://www.rust-lang.org/downloads.html)
24c67d6573Sopenharmony_ci(and a C compiler), then this should work to run the `iter` example:
25c67d6573Sopenharmony_ci
26c67d6573Sopenharmony_ci```
27c67d6573Sopenharmony_ci$ git clone git://github.com/rust-lang/regex
28c67d6573Sopenharmony_ci$ cd regex/regex-capi/examples
29c67d6573Sopenharmony_ci$ ./compile
30c67d6573Sopenharmony_ci$ LD_LIBRARY_PATH=../target/release ./iter
31c67d6573Sopenharmony_ci```
32c67d6573Sopenharmony_ci
33c67d6573Sopenharmony_ci
34c67d6573Sopenharmony_ciPerformance
35c67d6573Sopenharmony_ci-----------
36c67d6573Sopenharmony_ciIt's fast. Its core matching engine is a lazy DFA, which is what GNU grep
37c67d6573Sopenharmony_ciand RE2 use. Like GNU grep, this regex engine can detect multi byte literals
38c67d6573Sopenharmony_ciin the regex and will use fast literal string searching to quickly skip
39c67d6573Sopenharmony_cithrough the input to find possible match locations.
40c67d6573Sopenharmony_ci
41c67d6573Sopenharmony_ciAll memory usage is bounded and all searching takes linear time with respect
42c67d6573Sopenharmony_cito the input string.
43c67d6573Sopenharmony_ci
44c67d6573Sopenharmony_ciFor more details, see the PERFORMANCE guide:
45c67d6573Sopenharmony_cihttps://github.com/rust-lang/regex/blob/master/PERFORMANCE.md
46c67d6573Sopenharmony_ci
47c67d6573Sopenharmony_ci
48c67d6573Sopenharmony_ciText encoding
49c67d6573Sopenharmony_ci-------------
50c67d6573Sopenharmony_ciAll regular expressions must be valid UTF-8.
51c67d6573Sopenharmony_ci
52c67d6573Sopenharmony_ciThe text encoding of haystacks is more complicated. To a first
53c67d6573Sopenharmony_ciapproximation, haystacks should be UTF-8. In fact, UTF-8 (and, one
54c67d6573Sopenharmony_cisupposes, ASCII) is the only well defined text encoding supported by this
55c67d6573Sopenharmony_cilibrary. It is impossible to match UTF-16, UTF-32 or any other encoding
56c67d6573Sopenharmony_ciwithout first transcoding it to UTF-8.
57c67d6573Sopenharmony_ci
58c67d6573Sopenharmony_ciWith that said, haystacks do not need to be valid UTF-8, and if they aren't
59c67d6573Sopenharmony_civalid UTF-8, no performance penalty is paid. Whether invalid UTF-8 is
60c67d6573Sopenharmony_cimatched or not depends on the regular expression. For example, with the
61c67d6573Sopenharmony_ci`RURE_FLAG_UNICODE` flag enabled, the regex `.` is guaranteed to match a
62c67d6573Sopenharmony_cisingle UTF-8 encoding of a Unicode codepoint (sans LF). In particular,
63c67d6573Sopenharmony_ciit will not match invalid UTF-8 such as `\xFF`, nor will it match surrogate
64c67d6573Sopenharmony_cicodepoints or "alternate" (i.e., non-minimal) encodings of codepoints.
65c67d6573Sopenharmony_ciHowever, with the `RURE_FLAG_UNICODE` flag disabled, the regex `.` will match
66c67d6573Sopenharmony_ciany *single* arbitrary byte (sans LF), including `\xFF`.
67c67d6573Sopenharmony_ci
68c67d6573Sopenharmony_ciThis provides a useful invariant: wherever `RURE_FLAG_UNICODE` is set, the
69c67d6573Sopenharmony_cicorresponding regex is guaranteed to match valid UTF-8. Invalid UTF-8 will
70c67d6573Sopenharmony_cialways prevent a match from happening when the flag is set. Since flags can be
71c67d6573Sopenharmony_citoggled in the regular expression itself, this allows one to pick and choose
72c67d6573Sopenharmony_ciwhich parts of the regular expression must match UTF-8 or not.
73c67d6573Sopenharmony_ci
74c67d6573Sopenharmony_ciSome good advice is to always enable the `RURE_FLAG_UNICODE` flag (which is
75c67d6573Sopenharmony_cienabled when using `rure_compile_must`) and selectively disable the flag when
76c67d6573Sopenharmony_cione wants to match arbitrary bytes. The flag can be disabled in a regular
77c67d6573Sopenharmony_ciexpression with `(?-u)`.
78c67d6573Sopenharmony_ci
79c67d6573Sopenharmony_ciFinally, if one wants to match specific invalid UTF-8 bytes, then you can
80c67d6573Sopenharmony_ciuse escape sequences. e.g., `(?-u)\\xFF` will match `\xFF`. It's not
81c67d6573Sopenharmony_cipossible to use C literal escape sequences in this case since regular
82c67d6573Sopenharmony_ciexpressions must be valid UTF-8.
83c67d6573Sopenharmony_ci
84c67d6573Sopenharmony_ci
85c67d6573Sopenharmony_ciAborts
86c67d6573Sopenharmony_ci------
87c67d6573Sopenharmony_ciThis library will abort your process if an unwinding panic is caught in the
88c67d6573Sopenharmony_ciRust code. Generally, a panic occurs when there is a bug in the program or
89c67d6573Sopenharmony_ciif allocation failed. It is possible to cause this behavior by passing
90c67d6573Sopenharmony_ciinvalid inputs to some functions. For example, giving an invalid capture
91c67d6573Sopenharmony_cigroup index to `rure_captures_at` will cause Rust's bounds checks to fail,
92c67d6573Sopenharmony_ciwhich will cause a panic, which will be caught and printed to stderr. The
93c67d6573Sopenharmony_ciprocess will then `abort`.
94c67d6573Sopenharmony_ci
95c67d6573Sopenharmony_ci
96c67d6573Sopenharmony_ciMissing
97c67d6573Sopenharmony_ci-------
98c67d6573Sopenharmony_ciThere are a few things missing from the C API that are present in the Rust API.
99c67d6573Sopenharmony_ciThere's no particular (known) reason why they don't, they just haven't been
100c67d6573Sopenharmony_ciimplemented yet.
101c67d6573Sopenharmony_ci
102c67d6573Sopenharmony_ci* Splitting a string by a regex.
103c67d6573Sopenharmony_ci* Replacing regex matches in a string with some other text.
104