1c67d6573Sopenharmony_ciC API for RUst's REgex engine 2c67d6573Sopenharmony_ci============================= 3c67d6573Sopenharmony_cirure is a C API to Rust's regex library, which guarantees linear time 4c67d6573Sopenharmony_cisearching using finite automata. In exchange, it must give up some common 5c67d6573Sopenharmony_ciregex features such as backreferences and arbitrary lookaround. It does 6c67d6573Sopenharmony_cihowever include capturing groups, lazy matching, Unicode support and word 7c67d6573Sopenharmony_ciboundary assertions. Its matching semantics generally correspond to Perl's, 8c67d6573Sopenharmony_cior "leftmost first." Namely, the match locations reported correspond to the 9c67d6573Sopenharmony_cifirst match that would be found by a backtracking engine. 10c67d6573Sopenharmony_ci 11c67d6573Sopenharmony_ciThe header file (`includes/rure.h`) serves as the primary API documentation of 12c67d6573Sopenharmony_cithis library. Types and flags are documented first, and functions follow. 13c67d6573Sopenharmony_ci 14c67d6573Sopenharmony_ciThe syntax and possibly other useful things are documented in the Rust 15c67d6573Sopenharmony_ciAPI documentation: https://docs.rs/regex 16c67d6573Sopenharmony_ci 17c67d6573Sopenharmony_ci 18c67d6573Sopenharmony_ciExamples 19c67d6573Sopenharmony_ci-------- 20c67d6573Sopenharmony_ciThere are readable examples in the `ctest` and `examples` sub-directories. 21c67d6573Sopenharmony_ci 22c67d6573Sopenharmony_ciAssuming you have 23c67d6573Sopenharmony_ci[Rust and Cargo installed](https://www.rust-lang.org/downloads.html) 24c67d6573Sopenharmony_ci(and a C compiler), then this should work to run the `iter` example: 25c67d6573Sopenharmony_ci 26c67d6573Sopenharmony_ci``` 27c67d6573Sopenharmony_ci$ git clone git://github.com/rust-lang/regex 28c67d6573Sopenharmony_ci$ cd regex/regex-capi/examples 29c67d6573Sopenharmony_ci$ ./compile 30c67d6573Sopenharmony_ci$ LD_LIBRARY_PATH=../target/release ./iter 31c67d6573Sopenharmony_ci``` 32c67d6573Sopenharmony_ci 33c67d6573Sopenharmony_ci 34c67d6573Sopenharmony_ciPerformance 35c67d6573Sopenharmony_ci----------- 36c67d6573Sopenharmony_ciIt's fast. Its core matching engine is a lazy DFA, which is what GNU grep 37c67d6573Sopenharmony_ciand RE2 use. Like GNU grep, this regex engine can detect multi byte literals 38c67d6573Sopenharmony_ciin the regex and will use fast literal string searching to quickly skip 39c67d6573Sopenharmony_cithrough the input to find possible match locations. 40c67d6573Sopenharmony_ci 41c67d6573Sopenharmony_ciAll memory usage is bounded and all searching takes linear time with respect 42c67d6573Sopenharmony_cito the input string. 43c67d6573Sopenharmony_ci 44c67d6573Sopenharmony_ciFor more details, see the PERFORMANCE guide: 45c67d6573Sopenharmony_cihttps://github.com/rust-lang/regex/blob/master/PERFORMANCE.md 46c67d6573Sopenharmony_ci 47c67d6573Sopenharmony_ci 48c67d6573Sopenharmony_ciText encoding 49c67d6573Sopenharmony_ci------------- 50c67d6573Sopenharmony_ciAll regular expressions must be valid UTF-8. 51c67d6573Sopenharmony_ci 52c67d6573Sopenharmony_ciThe text encoding of haystacks is more complicated. To a first 53c67d6573Sopenharmony_ciapproximation, haystacks should be UTF-8. In fact, UTF-8 (and, one 54c67d6573Sopenharmony_cisupposes, ASCII) is the only well defined text encoding supported by this 55c67d6573Sopenharmony_cilibrary. It is impossible to match UTF-16, UTF-32 or any other encoding 56c67d6573Sopenharmony_ciwithout first transcoding it to UTF-8. 57c67d6573Sopenharmony_ci 58c67d6573Sopenharmony_ciWith that said, haystacks do not need to be valid UTF-8, and if they aren't 59c67d6573Sopenharmony_civalid UTF-8, no performance penalty is paid. Whether invalid UTF-8 is 60c67d6573Sopenharmony_cimatched or not depends on the regular expression. For example, with the 61c67d6573Sopenharmony_ci`RURE_FLAG_UNICODE` flag enabled, the regex `.` is guaranteed to match a 62c67d6573Sopenharmony_cisingle UTF-8 encoding of a Unicode codepoint (sans LF). In particular, 63c67d6573Sopenharmony_ciit will not match invalid UTF-8 such as `\xFF`, nor will it match surrogate 64c67d6573Sopenharmony_cicodepoints or "alternate" (i.e., non-minimal) encodings of codepoints. 65c67d6573Sopenharmony_ciHowever, with the `RURE_FLAG_UNICODE` flag disabled, the regex `.` will match 66c67d6573Sopenharmony_ciany *single* arbitrary byte (sans LF), including `\xFF`. 67c67d6573Sopenharmony_ci 68c67d6573Sopenharmony_ciThis provides a useful invariant: wherever `RURE_FLAG_UNICODE` is set, the 69c67d6573Sopenharmony_cicorresponding regex is guaranteed to match valid UTF-8. Invalid UTF-8 will 70c67d6573Sopenharmony_cialways prevent a match from happening when the flag is set. Since flags can be 71c67d6573Sopenharmony_citoggled in the regular expression itself, this allows one to pick and choose 72c67d6573Sopenharmony_ciwhich parts of the regular expression must match UTF-8 or not. 73c67d6573Sopenharmony_ci 74c67d6573Sopenharmony_ciSome good advice is to always enable the `RURE_FLAG_UNICODE` flag (which is 75c67d6573Sopenharmony_cienabled when using `rure_compile_must`) and selectively disable the flag when 76c67d6573Sopenharmony_cione wants to match arbitrary bytes. The flag can be disabled in a regular 77c67d6573Sopenharmony_ciexpression with `(?-u)`. 78c67d6573Sopenharmony_ci 79c67d6573Sopenharmony_ciFinally, if one wants to match specific invalid UTF-8 bytes, then you can 80c67d6573Sopenharmony_ciuse escape sequences. e.g., `(?-u)\\xFF` will match `\xFF`. It's not 81c67d6573Sopenharmony_cipossible to use C literal escape sequences in this case since regular 82c67d6573Sopenharmony_ciexpressions must be valid UTF-8. 83c67d6573Sopenharmony_ci 84c67d6573Sopenharmony_ci 85c67d6573Sopenharmony_ciAborts 86c67d6573Sopenharmony_ci------ 87c67d6573Sopenharmony_ciThis library will abort your process if an unwinding panic is caught in the 88c67d6573Sopenharmony_ciRust code. Generally, a panic occurs when there is a bug in the program or 89c67d6573Sopenharmony_ciif allocation failed. It is possible to cause this behavior by passing 90c67d6573Sopenharmony_ciinvalid inputs to some functions. For example, giving an invalid capture 91c67d6573Sopenharmony_cigroup index to `rure_captures_at` will cause Rust's bounds checks to fail, 92c67d6573Sopenharmony_ciwhich will cause a panic, which will be caught and printed to stderr. The 93c67d6573Sopenharmony_ciprocess will then `abort`. 94c67d6573Sopenharmony_ci 95c67d6573Sopenharmony_ci 96c67d6573Sopenharmony_ciMissing 97c67d6573Sopenharmony_ci------- 98c67d6573Sopenharmony_ciThere are a few things missing from the C API that are present in the Rust API. 99c67d6573Sopenharmony_ciThere's no particular (known) reason why they don't, they just haven't been 100c67d6573Sopenharmony_ciimplemented yet. 101c67d6573Sopenharmony_ci 102c67d6573Sopenharmony_ci* Splitting a string by a regex. 103c67d6573Sopenharmony_ci* Replacing regex matches in a string with some other text. 104