1c67d6573Sopenharmony_ciYour friendly guide to hacking and navigating the regex library.
2c67d6573Sopenharmony_ci
3c67d6573Sopenharmony_ciThis guide assumes familiarity with Rust and Cargo, and at least a perusal of
4c67d6573Sopenharmony_cithe user facing documentation for this crate.
5c67d6573Sopenharmony_ci
6c67d6573Sopenharmony_ciIf you're looking for background on the implementation in this library, then
7c67d6573Sopenharmony_ciyou can do no better than Russ Cox's article series on implementing regular
8c67d6573Sopenharmony_ciexpressions using finite automata: https://swtch.com/~rsc/regexp/
9c67d6573Sopenharmony_ci
10c67d6573Sopenharmony_ci
11c67d6573Sopenharmony_ci## Architecture overview
12c67d6573Sopenharmony_ci
13c67d6573Sopenharmony_ciAs you probably already know, this library executes regular expressions using
14c67d6573Sopenharmony_cifinite automata. In particular, a design goal is to make searching linear
15c67d6573Sopenharmony_ciwith respect to both the regular expression and the text being searched.
16c67d6573Sopenharmony_ciMeeting that design goal on its own is not so hard and can be done with an
17c67d6573Sopenharmony_ciimplementation of the Pike VM (similar to Thompson's construction, but supports
18c67d6573Sopenharmony_cicapturing groups), as described in: https://swtch.com/~rsc/regexp/regexp2.html
19c67d6573Sopenharmony_ci--- This library contains such an implementation in src/pikevm.rs.
20c67d6573Sopenharmony_ci
21c67d6573Sopenharmony_ciMaking it fast is harder. One of the key problems with the Pike VM is that it
22c67d6573Sopenharmony_cican be in more than one state at any point in time, and must shuffle capture
23c67d6573Sopenharmony_cipositions between them. The Pike VM also spends a lot of time following the
24c67d6573Sopenharmony_cisame epsilon transitions over and over again. We can employ one trick to
25c67d6573Sopenharmony_cispeed up the Pike VM: extract one or more literal prefixes from the regular
26c67d6573Sopenharmony_ciexpression and execute specialized code to quickly find matches of those
27c67d6573Sopenharmony_ciprefixes in the search text. The Pike VM can then be avoided for most the
28c67d6573Sopenharmony_cisearch, and instead only executed when a prefix is found. The code to find
29c67d6573Sopenharmony_ciprefixes is in the regex-syntax crate (in this repository). The code to search
30c67d6573Sopenharmony_cifor literals is in src/literals.rs. When more than one literal prefix is found,
31c67d6573Sopenharmony_ciwe fall back to an Aho-Corasick DFA using the aho-corasick crate. For one
32c67d6573Sopenharmony_ciliteral, we use a variant of the Boyer-Moore algorithm. Both Aho-Corasick and
33c67d6573Sopenharmony_ciBoyer-Moore use `memchr` when appropriate. The Boyer-Moore variant in this
34c67d6573Sopenharmony_cilibrary also uses elementary frequency analysis to choose the right byte to run
35c67d6573Sopenharmony_ci`memchr` with.
36c67d6573Sopenharmony_ci
37c67d6573Sopenharmony_ciOf course, detecting prefix literals can only take us so far. Not all regular
38c67d6573Sopenharmony_ciexpressions have literal prefixes. To remedy this, we try another approach
39c67d6573Sopenharmony_cito executing the Pike VM: backtracking, whose implementation can be found in
40c67d6573Sopenharmony_cisrc/backtrack.rs. One reason why backtracking can be faster is that it avoids
41c67d6573Sopenharmony_ciexcessive shuffling of capture groups. Of course, backtracking is susceptible
42c67d6573Sopenharmony_cito exponential runtimes, so we keep track of every state we've visited to make
43c67d6573Sopenharmony_cisure we never visit it again. This guarantees linear time execution, but we
44c67d6573Sopenharmony_cipay for it with the memory required to track visited states. Because of the
45c67d6573Sopenharmony_cimemory requirement, we only use this engine on small search strings *and* small
46c67d6573Sopenharmony_ciregular expressions.
47c67d6573Sopenharmony_ci
48c67d6573Sopenharmony_ciLastly, the real workhorse of this library is the "lazy" DFA in src/dfa.rs.
49c67d6573Sopenharmony_ciIt is distinct from the Pike VM in that the DFA is explicitly represented in
50c67d6573Sopenharmony_cimemory and is only ever in one state at a time. It is said to be "lazy" because
51c67d6573Sopenharmony_cithe DFA is computed as text is searched, where each byte in the search text
52c67d6573Sopenharmony_ciresults in at most one new DFA state. It is made fast by caching states. DFAs
53c67d6573Sopenharmony_ciare susceptible to exponential state blow up (where the worst case is computing
54c67d6573Sopenharmony_cia new state for every input byte, regardless of what's in the state cache). To
55c67d6573Sopenharmony_ciavoid using a lot of memory, the lazy DFA uses a bounded cache. Once the cache
56c67d6573Sopenharmony_ciis full, it is wiped and state computation starts over again. If the cache is
57c67d6573Sopenharmony_ciwiped too frequently, then the DFA gives up and searching falls back to one of
58c67d6573Sopenharmony_cithe aforementioned algorithms.
59c67d6573Sopenharmony_ci
60c67d6573Sopenharmony_ciAll of the above matching engines expose precisely the same matching semantics.
61c67d6573Sopenharmony_ciThis is indeed tested. (See the section below about testing.)
62c67d6573Sopenharmony_ci
63c67d6573Sopenharmony_ciThe following sub-sections describe the rest of the library and how each of the
64c67d6573Sopenharmony_cimatching engines are actually used.
65c67d6573Sopenharmony_ci
66c67d6573Sopenharmony_ci### Parsing
67c67d6573Sopenharmony_ci
68c67d6573Sopenharmony_ciRegular expressions are parsed using the regex-syntax crate, which is
69c67d6573Sopenharmony_cimaintained in this repository. The regex-syntax crate defines an abstract
70c67d6573Sopenharmony_cisyntax and provides very detailed error messages when a parse error is
71c67d6573Sopenharmony_ciencountered. Parsing is done in a separate crate so that others may benefit
72c67d6573Sopenharmony_cifrom its existence, and because it is relatively divorced from the rest of the
73c67d6573Sopenharmony_ciregex library.
74c67d6573Sopenharmony_ci
75c67d6573Sopenharmony_ciThe regex-syntax crate also provides sophisticated support for extracting
76c67d6573Sopenharmony_ciprefix and suffix literals from regular expressions.
77c67d6573Sopenharmony_ci
78c67d6573Sopenharmony_ci### Compilation
79c67d6573Sopenharmony_ci
80c67d6573Sopenharmony_ciThe compiler is in src/compile.rs. The input to the compiler is some abstract
81c67d6573Sopenharmony_cisyntax for a regular expression and the output is a sequence of opcodes that
82c67d6573Sopenharmony_cimatching engines use to execute a search. (One can think of matching engines as
83c67d6573Sopenharmony_cimini virtual machines.) The sequence of opcodes is a particular encoding of a
84c67d6573Sopenharmony_cinon-deterministic finite automaton. In particular, the opcodes explicitly rely
85c67d6573Sopenharmony_cion epsilon transitions.
86c67d6573Sopenharmony_ci
87c67d6573Sopenharmony_ciConsider a simple regular expression like `a|b`. Its compiled form looks like
88c67d6573Sopenharmony_cithis:
89c67d6573Sopenharmony_ci
90c67d6573Sopenharmony_ci    000 Save(0)
91c67d6573Sopenharmony_ci    001 Split(2, 3)
92c67d6573Sopenharmony_ci    002 'a' (goto: 4)
93c67d6573Sopenharmony_ci    003 'b'
94c67d6573Sopenharmony_ci    004 Save(1)
95c67d6573Sopenharmony_ci    005 Match
96c67d6573Sopenharmony_ci
97c67d6573Sopenharmony_ciThe first column is the instruction pointer and the second column is the
98c67d6573Sopenharmony_ciinstruction. Save instructions indicate that the current position in the input
99c67d6573Sopenharmony_cishould be stored in a captured location. Split instructions represent a binary
100c67d6573Sopenharmony_cibranch in the program (i.e., epsilon transitions). The instructions `'a'` and
101c67d6573Sopenharmony_ci`'b'` indicate that the literal bytes `'a'` or `'b'` should match.
102c67d6573Sopenharmony_ci
103c67d6573Sopenharmony_ciIn older versions of this library, the compilation looked like this:
104c67d6573Sopenharmony_ci
105c67d6573Sopenharmony_ci    000 Save(0)
106c67d6573Sopenharmony_ci    001 Split(2, 3)
107c67d6573Sopenharmony_ci    002 'a'
108c67d6573Sopenharmony_ci    003 Jump(5)
109c67d6573Sopenharmony_ci    004 'b'
110c67d6573Sopenharmony_ci    005 Save(1)
111c67d6573Sopenharmony_ci    006 Match
112c67d6573Sopenharmony_ci
113c67d6573Sopenharmony_ciIn particular, empty instructions that merely served to move execution from one
114c67d6573Sopenharmony_cipoint in the program to another were removed. Instead, every instruction has a
115c67d6573Sopenharmony_ci`goto` pointer embedded into it. This resulted in a small performance boost for
116c67d6573Sopenharmony_cithe Pike VM, because it was one fewer epsilon transition that it had to follow.
117c67d6573Sopenharmony_ci
118c67d6573Sopenharmony_ciThere exist more instructions and they are defined and documented in
119c67d6573Sopenharmony_cisrc/prog.rs.
120c67d6573Sopenharmony_ci
121c67d6573Sopenharmony_ciCompilation has several knobs and a few unfortunately complicated invariants.
122c67d6573Sopenharmony_ciNamely, the output of compilation can be one of two types of programs: a
123c67d6573Sopenharmony_ciprogram that executes on Unicode scalar values or a program that executes
124c67d6573Sopenharmony_cion raw bytes. In the former case, the matching engine is responsible for
125c67d6573Sopenharmony_ciperforming UTF-8 decoding and executing instructions using Unicode codepoints.
126c67d6573Sopenharmony_ciIn the latter case, the program handles UTF-8 decoding implicitly, so that the
127c67d6573Sopenharmony_cimatching engine can execute on raw bytes. All matching engines can execute
128c67d6573Sopenharmony_cieither Unicode or byte based programs except for the lazy DFA, which requires
129c67d6573Sopenharmony_cibyte based programs. In general, both representations were kept because (1) the
130c67d6573Sopenharmony_cilazy DFA requires byte based programs so that states can be encoded in a memory
131c67d6573Sopenharmony_ciefficient manner and (2) the Pike VM benefits greatly from inlining Unicode
132c67d6573Sopenharmony_cicharacter classes into fewer instructions as it results in fewer epsilon
133c67d6573Sopenharmony_citransitions.
134c67d6573Sopenharmony_ci
135c67d6573Sopenharmony_ciN.B. UTF-8 decoding is built into the compiled program by making use of the
136c67d6573Sopenharmony_ciutf8-ranges crate. The compiler in this library factors out common suffixes to
137c67d6573Sopenharmony_cireduce the size of huge character classes (e.g., `\pL`).
138c67d6573Sopenharmony_ci
139c67d6573Sopenharmony_ciA regrettable consequence of this split in instruction sets is we generally
140c67d6573Sopenharmony_cineed to compile two programs; one for NFA execution and one for the lazy DFA.
141c67d6573Sopenharmony_ci
142c67d6573Sopenharmony_ciIn fact, it is worse than that: the lazy DFA is not capable of finding the
143c67d6573Sopenharmony_cistarting location of a match in a single scan, and must instead execute a
144c67d6573Sopenharmony_cibackwards search after finding the end location. To execute a backwards search,
145c67d6573Sopenharmony_ciwe must have compiled the regular expression *in reverse*.
146c67d6573Sopenharmony_ci
147c67d6573Sopenharmony_ciThis means that every compilation of a regular expression generally results in
148c67d6573Sopenharmony_cithree distinct programs. It would be possible to lazily compile the Unicode
149c67d6573Sopenharmony_ciprogram, since it is never needed if (1) the regular expression uses no word
150c67d6573Sopenharmony_ciboundary assertions and (2) the caller never asks for sub-capture locations.
151c67d6573Sopenharmony_ci
152c67d6573Sopenharmony_ci### Execution
153c67d6573Sopenharmony_ci
154c67d6573Sopenharmony_ciAt the time of writing, there are four matching engines in this library:
155c67d6573Sopenharmony_ci
156c67d6573Sopenharmony_ci1. The Pike VM (supports captures).
157c67d6573Sopenharmony_ci2. Bounded backtracking (supports captures).
158c67d6573Sopenharmony_ci3. Literal substring or multi-substring search.
159c67d6573Sopenharmony_ci4. Lazy DFA (no support for Unicode word boundary assertions).
160c67d6573Sopenharmony_ci
161c67d6573Sopenharmony_ciOnly the first two matching engines are capable of executing every regular
162c67d6573Sopenharmony_ciexpression program. They also happen to be the slowest, which means we need
163c67d6573Sopenharmony_cisome logic that (1) knows various facts about the regular expression and (2)
164c67d6573Sopenharmony_ciknows what the caller wants. Using this information, we can determine which
165c67d6573Sopenharmony_ciengine (or engines) to use.
166c67d6573Sopenharmony_ci
167c67d6573Sopenharmony_ciThe logic for choosing which engine to execute is in src/exec.rs and is
168c67d6573Sopenharmony_cidocumented on the Exec type. Exec values contain regular expression Programs
169c67d6573Sopenharmony_ci(defined in src/prog.rs), which contain all the necessary tidbits for actually
170c67d6573Sopenharmony_ciexecuting a regular expression on search text.
171c67d6573Sopenharmony_ci
172c67d6573Sopenharmony_ciFor the most part, the execution logic is straight-forward and follows the
173c67d6573Sopenharmony_cilimitations of each engine described above pretty faithfully. The hairiest
174c67d6573Sopenharmony_cipart of src/exec.rs by far is the execution of the lazy DFA, since it requires
175c67d6573Sopenharmony_cia forwards and backwards search, and then falls back to either the Pike VM or
176c67d6573Sopenharmony_cibacktracking if the caller requested capture locations.
177c67d6573Sopenharmony_ci
178c67d6573Sopenharmony_ciThe Exec type also contains mutable scratch space for each type of matching
179c67d6573Sopenharmony_ciengine. This scratch space is used during search (for example, for the lazy
180c67d6573Sopenharmony_ciDFA, it contains compiled states that are reused on subsequent searches).
181c67d6573Sopenharmony_ci
182c67d6573Sopenharmony_ci### Programs
183c67d6573Sopenharmony_ci
184c67d6573Sopenharmony_ciA regular expression program is essentially a sequence of opcodes produced by
185c67d6573Sopenharmony_cithe compiler plus various facts about the regular expression (such as whether
186c67d6573Sopenharmony_ciit is anchored, its capture names, etc.).
187c67d6573Sopenharmony_ci
188c67d6573Sopenharmony_ci### The regex! macro
189c67d6573Sopenharmony_ci
190c67d6573Sopenharmony_ciThe `regex!` macro no longer exists. It was developed in a bygone era as a
191c67d6573Sopenharmony_cicompiler plugin during the infancy of the regex crate. Back then, then only
192c67d6573Sopenharmony_cimatching engine in the crate was the Pike VM. The `regex!` macro was, itself,
193c67d6573Sopenharmony_cialso a Pike VM. The only advantages it offered over the dynamic Pike VM that
194c67d6573Sopenharmony_ciwas built at runtime were the following:
195c67d6573Sopenharmony_ci
196c67d6573Sopenharmony_ci  1. Syntax checking was done at compile time. Your Rust program wouldn't
197c67d6573Sopenharmony_ci     compile if your regex didn't compile.
198c67d6573Sopenharmony_ci  2. Reduction of overhead that was proportional to the size of the regex.
199c67d6573Sopenharmony_ci     For the most part, this overhead consisted of heap allocation, which
200c67d6573Sopenharmony_ci     was nearly eliminated in the compiler plugin.
201c67d6573Sopenharmony_ci
202c67d6573Sopenharmony_ciThe main takeaway here is that the compiler plugin was a marginally faster
203c67d6573Sopenharmony_civersion of a slow regex engine. As the regex crate evolved, it grew other regex
204c67d6573Sopenharmony_ciengines (DFA, bounded backtracker) and sophisticated literal optimizations.
205c67d6573Sopenharmony_ciThe regex macro didn't keep pace, and it therefore became (dramatically) slower
206c67d6573Sopenharmony_cithan the dynamic engines. The only reason left to use it was for the compile
207c67d6573Sopenharmony_citime guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
208c67d6573Sopenharmony_citool) has a lint that checks your regular expression validity, which mostly
209c67d6573Sopenharmony_cireplaces that use case.
210c67d6573Sopenharmony_ci
211c67d6573Sopenharmony_ciAdditionally, the regex compiler plugin stopped receiving maintenance. Nobody
212c67d6573Sopenharmony_cicomplained. At that point, it seemed prudent to just remove it.
213c67d6573Sopenharmony_ci
214c67d6573Sopenharmony_ciWill a compiler plugin be brought back? The future is murky, but there is
215c67d6573Sopenharmony_cidefinitely an opportunity there to build something that is faster than the
216c67d6573Sopenharmony_cidynamic engines in some cases. But it will be challenging! As of now, there
217c67d6573Sopenharmony_ciare no plans to work on this.
218c67d6573Sopenharmony_ci
219c67d6573Sopenharmony_ci
220c67d6573Sopenharmony_ci## Testing
221c67d6573Sopenharmony_ci
222c67d6573Sopenharmony_ciA key aspect of any mature regex library is its test suite. A subset of the
223c67d6573Sopenharmony_citests in this library come from Glenn Fowler's AT&T test suite (its online
224c67d6573Sopenharmony_cipresence seems gone at the time of writing). The source of the test suite is
225c67d6573Sopenharmony_cilocated in src/testdata. The scripts/regex-match-tests.py takes the test suite
226c67d6573Sopenharmony_ciin src/testdata and generates tests/matches.rs.
227c67d6573Sopenharmony_ci
228c67d6573Sopenharmony_ciThere are also many other manually crafted tests and regression tests in
229c67d6573Sopenharmony_citests/tests.rs. Some of these tests were taken from RE2.
230c67d6573Sopenharmony_ci
231c67d6573Sopenharmony_ciThe biggest source of complexity in the tests is related to answering this
232c67d6573Sopenharmony_ciquestion: how can we reuse the tests to check all of our matching engines? One
233c67d6573Sopenharmony_ciapproach would have been to encode every test into some kind of format (like
234c67d6573Sopenharmony_cithe AT&T test suite) and code generate tests for each matching engine. The
235c67d6573Sopenharmony_ciapproach we use in this library is to create a Cargo.toml entry point for each
236c67d6573Sopenharmony_cimatching engine we want to test. The entry points are:
237c67d6573Sopenharmony_ci
238c67d6573Sopenharmony_ci* `tests/test_default.rs` - tests `Regex::new`
239c67d6573Sopenharmony_ci* `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
240c67d6573Sopenharmony_ci* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
241c67d6573Sopenharmony_ci  algorithm on every regex.
242c67d6573Sopenharmony_ci* `tests/test_nfa_bytes.rs` - tests `Regex::new`, forced to use the NFA
243c67d6573Sopenharmony_ci  algorithm on every regex and use *arbitrary* byte based programs.
244c67d6573Sopenharmony_ci* `tests/test_nfa_utf8bytes.rs` - tests `Regex::new`, forced to use the NFA
245c67d6573Sopenharmony_ci  algorithm on every regex and use *UTF-8* byte based programs.
246c67d6573Sopenharmony_ci* `tests/test_backtrack.rs` - tests `Regex::new`, forced to use
247c67d6573Sopenharmony_ci  backtracking on every regex.
248c67d6573Sopenharmony_ci* `tests/test_backtrack_bytes.rs` - tests `Regex::new`, forced to use
249c67d6573Sopenharmony_ci  backtracking on every regex and use *arbitrary* byte based programs.
250c67d6573Sopenharmony_ci* `tests/test_backtrack_utf8bytes.rs` - tests `Regex::new`, forced to use
251c67d6573Sopenharmony_ci  backtracking on every regex and use *UTF-8* byte based programs.
252c67d6573Sopenharmony_ci* `tests/test_crates_regex.rs` - tests to make sure that all of the
253c67d6573Sopenharmony_ci  backends behave in the same way against a number of quickcheck
254c67d6573Sopenharmony_ci  generated random inputs. These tests need to be enabled through
255c67d6573Sopenharmony_ci  the `RUST_REGEX_RANDOM_TEST` environment variable (see
256c67d6573Sopenharmony_ci  below).
257c67d6573Sopenharmony_ci
258c67d6573Sopenharmony_ciThe lazy DFA and pure literal engines are absent from this list because
259c67d6573Sopenharmony_cithey cannot be used on every regular expression. Instead, we rely on
260c67d6573Sopenharmony_ci`tests/test_dynamic.rs` to test the lazy DFA and literal engines when possible.
261c67d6573Sopenharmony_ci
262c67d6573Sopenharmony_ciSince the tests are repeated several times, and because `cargo test` runs all
263c67d6573Sopenharmony_cientry points, it can take a while to compile everything. To reduce compile
264c67d6573Sopenharmony_citimes slightly, try using `cargo test --test default`, which will only use the
265c67d6573Sopenharmony_ci`tests/test_default.rs` entry point.
266c67d6573Sopenharmony_ci
267c67d6573Sopenharmony_ciThe random testing takes quite a while, so it is not enabled by default.
268c67d6573Sopenharmony_ciIn order to run the random testing you can set the
269c67d6573Sopenharmony_ci`RUST_REGEX_RANDOM_TEST` environment variable to anything before
270c67d6573Sopenharmony_ciinvoking `cargo test`. Note that this variable is inspected at compile
271c67d6573Sopenharmony_citime, so if the tests don't seem to be running, you may need to run
272c67d6573Sopenharmony_ci`cargo clean`.
273c67d6573Sopenharmony_ci
274c67d6573Sopenharmony_ci## Benchmarking
275c67d6573Sopenharmony_ci
276c67d6573Sopenharmony_ciThe benchmarking in this crate is made up of many micro-benchmarks. Currently,
277c67d6573Sopenharmony_cithere are two primary sets of benchmarks: the benchmarks that were adopted
278c67d6573Sopenharmony_ciat this library's inception (in `bench/src/misc.rs`) and a newer set of
279c67d6573Sopenharmony_cibenchmarks meant to test various optimizations. Specifically, the latter set
280c67d6573Sopenharmony_cicontain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
281c67d6573Sopenharmony_ciset are all executed on the same lengthy input whereas the former benchmarks
282c67d6573Sopenharmony_ciare executed on strings of varying length.
283c67d6573Sopenharmony_ci
284c67d6573Sopenharmony_ciThere is also a smattering of benchmarks for parsing and compilation.
285c67d6573Sopenharmony_ci
286c67d6573Sopenharmony_ciBenchmarks are in a separate crate so that its dependencies can be managed
287c67d6573Sopenharmony_ciseparately from the main regex crate.
288c67d6573Sopenharmony_ci
289c67d6573Sopenharmony_ciBenchmarking follows a similarly wonky setup as tests. There are multiple entry
290c67d6573Sopenharmony_cipoints:
291c67d6573Sopenharmony_ci
292c67d6573Sopenharmony_ci* `bench_rust.rs` - benchmarks `Regex::new`
293c67d6573Sopenharmony_ci* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
294c67d6573Sopenharmony_ci* `bench_pcre.rs` - benchmarks PCRE
295c67d6573Sopenharmony_ci* `bench_onig.rs` - benchmarks Oniguruma
296c67d6573Sopenharmony_ci
297c67d6573Sopenharmony_ciThe PCRE and Oniguruma benchmarks exist as a comparison point to a mature
298c67d6573Sopenharmony_ciregular expression library. In general, this regex library compares favorably
299c67d6573Sopenharmony_ci(there are even a few benchmarks that PCRE simply runs too slowly on or
300c67d6573Sopenharmony_cioutright can't execute at all). I would love to add other regular expression
301c67d6573Sopenharmony_cilibrary benchmarks (especially RE2).
302c67d6573Sopenharmony_ci
303c67d6573Sopenharmony_ciIf you're hacking on one of the matching engines and just want to see
304c67d6573Sopenharmony_cibenchmarks, then all you need to run is:
305c67d6573Sopenharmony_ci
306c67d6573Sopenharmony_ci    $ (cd bench && ./run rust)
307c67d6573Sopenharmony_ci
308c67d6573Sopenharmony_ciIf you want to compare your results with older benchmarks, then try:
309c67d6573Sopenharmony_ci
310c67d6573Sopenharmony_ci    $ (cd bench && ./run rust | tee old)
311c67d6573Sopenharmony_ci    $ ... make it faster
312c67d6573Sopenharmony_ci    $ (cd bench && ./run rust | tee new)
313c67d6573Sopenharmony_ci    $ cargo benchcmp old new --improvements
314c67d6573Sopenharmony_ci
315c67d6573Sopenharmony_ciThe `cargo-benchcmp` utility is available here:
316c67d6573Sopenharmony_cihttps://github.com/BurntSushi/cargo-benchcmp
317c67d6573Sopenharmony_ci
318c67d6573Sopenharmony_ciThe `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
319c67d6573Sopenharmony_ci`./bench/bench --help`.
320c67d6573Sopenharmony_ci
321c67d6573Sopenharmony_ci## Dev Docs
322c67d6573Sopenharmony_ci
323c67d6573Sopenharmony_ciWhen digging your teeth into the codebase for the first time, the
324c67d6573Sopenharmony_cicrate documentation can be a great resource. By default `rustdoc`
325c67d6573Sopenharmony_ciwill strip out all documentation of private crate members in an
326c67d6573Sopenharmony_cieffort to help consumers of the crate focus on the *interface*
327c67d6573Sopenharmony_ciwithout having to concern themselves with the *implementation*.
328c67d6573Sopenharmony_ciNormally this is a great thing, but if you want to start hacking
329c67d6573Sopenharmony_cion regex internals it is not what you want. Many of the private members
330c67d6573Sopenharmony_ciof this crate are well documented with rustdoc style comments, and
331c67d6573Sopenharmony_ciit would be a shame to miss out on the opportunity that presents.
332c67d6573Sopenharmony_ciYou can generate the private docs with:
333c67d6573Sopenharmony_ci
334c67d6573Sopenharmony_ci```
335c67d6573Sopenharmony_ci$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
336c67d6573Sopenharmony_ci```
337c67d6573Sopenharmony_ci
338c67d6573Sopenharmony_ciThen just point your browser at `target/doc/regex/index.html`.
339c67d6573Sopenharmony_ci
340c67d6573Sopenharmony_ciSee https://github.com/rust-lang/rust/issues/15347 for more info
341c67d6573Sopenharmony_ciabout generating developer docs for internal use.
342