1c67d6573Sopenharmony_ciYour friendly guide to understanding the performance characteristics of this
2c67d6573Sopenharmony_cicrate.
3c67d6573Sopenharmony_ci
4c67d6573Sopenharmony_ciThis guide assumes some familiarity with the public API of this crate, which
5c67d6573Sopenharmony_cican be found here: https://docs.rs/regex
6c67d6573Sopenharmony_ci
7c67d6573Sopenharmony_ci## Theory vs. Practice
8c67d6573Sopenharmony_ci
9c67d6573Sopenharmony_ciOne of the design goals of this crate is to provide worst case linear time
10c67d6573Sopenharmony_cibehavior with respect to the text searched using finite state automata. This
11c67d6573Sopenharmony_cimeans that, *in theory*, the performance of this crate is much better than most
12c67d6573Sopenharmony_ciregex implementations, which typically use backtracking which has worst case
13c67d6573Sopenharmony_ciexponential time.
14c67d6573Sopenharmony_ci
15c67d6573Sopenharmony_ciFor example, try opening a Python interpreter and typing this:
16c67d6573Sopenharmony_ci
17c67d6573Sopenharmony_ci    >>> import re
18c67d6573Sopenharmony_ci    >>> re.search('(a*)*c', 'a' * 30).span()
19c67d6573Sopenharmony_ci
20c67d6573Sopenharmony_ciI'll wait.
21c67d6573Sopenharmony_ci
22c67d6573Sopenharmony_ciAt some point, you'll figure out that it won't terminate any time soon. ^C it.
23c67d6573Sopenharmony_ci
24c67d6573Sopenharmony_ciThe promise of this crate is that *this pathological behavior can't happen*.
25c67d6573Sopenharmony_ci
26c67d6573Sopenharmony_ciWith that said, just because we have protected ourselves against worst case
27c67d6573Sopenharmony_ciexponential behavior doesn't mean we are immune from large constant factors
28c67d6573Sopenharmony_cior places where the current regex engine isn't quite optimal. This guide will
29c67d6573Sopenharmony_cidetail those cases and provide guidance on how to avoid them, among other
30c67d6573Sopenharmony_cibits of general advice.
31c67d6573Sopenharmony_ci
32c67d6573Sopenharmony_ci## Thou Shalt Not Compile Regular Expressions In A Loop
33c67d6573Sopenharmony_ci
34c67d6573Sopenharmony_ci**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
35c67d6573Sopenharmony_ci
36c67d6573Sopenharmony_ciDon't do it unless you really don't mind paying for it. Compiling a regular
37c67d6573Sopenharmony_ciexpression in this crate is quite expensive. It is conceivable that it may get
38c67d6573Sopenharmony_cifaster some day, but I wouldn't hold out hope for, say, an order of magnitude
39c67d6573Sopenharmony_ciimprovement. In particular, compilation can take any where from a few dozen
40c67d6573Sopenharmony_cimicroseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
41c67d6573Sopenharmony_ciclasses, in particular, have the largest impact on compilation performance. At
42c67d6573Sopenharmony_cithe time of writing, for example, `\pL{100}` takes around 44ms to compile. This
43c67d6573Sopenharmony_ciis because `\pL` corresponds to every letter in Unicode and compilation must
44c67d6573Sopenharmony_citurn it into a proper automaton that decodes a subset of UTF-8 which
45c67d6573Sopenharmony_cicorresponds to those letters. Compilation also spends some cycles shrinking the
46c67d6573Sopenharmony_cisize of the automaton.
47c67d6573Sopenharmony_ci
48c67d6573Sopenharmony_ciThis means that in order to realize efficient regex matching, one must
49c67d6573Sopenharmony_ci*amortize the cost of compilation*. Trivially, if a call to `is_match` is
50c67d6573Sopenharmony_ciinside a loop, then make sure your call to `Regex::new` is *outside* that loop.
51c67d6573Sopenharmony_ci
52c67d6573Sopenharmony_ciIn many programming languages, regular expressions can be conveniently defined
53c67d6573Sopenharmony_ciand compiled in a global scope, and code can reach out and use them as if
54c67d6573Sopenharmony_cithey were global static variables. In Rust, there is really no concept of
55c67d6573Sopenharmony_cilife-before-main, and therefore, one cannot utter this:
56c67d6573Sopenharmony_ci
57c67d6573Sopenharmony_ci    static MY_REGEX: Regex = Regex::new("...").unwrap();
58c67d6573Sopenharmony_ci
59c67d6573Sopenharmony_ciUnfortunately, this would seem to imply that one must pass `Regex` objects
60c67d6573Sopenharmony_ciaround to everywhere they are used, which can be especially painful depending
61c67d6573Sopenharmony_cion how your program is structured. Thankfully, the
62c67d6573Sopenharmony_ci[`lazy_static`](https://crates.io/crates/lazy_static)
63c67d6573Sopenharmony_cicrate provides an answer that works well:
64c67d6573Sopenharmony_ci
65c67d6573Sopenharmony_ci    use lazy_static::lazy_static;
66c67d6573Sopenharmony_ci    use regex::Regex;
67c67d6573Sopenharmony_ci
68c67d6573Sopenharmony_ci    fn some_helper_function(text: &str) -> bool {
69c67d6573Sopenharmony_ci        lazy_static! {
70c67d6573Sopenharmony_ci            static ref MY_REGEX: Regex = Regex::new("...").unwrap();
71c67d6573Sopenharmony_ci        }
72c67d6573Sopenharmony_ci        MY_REGEX.is_match(text)
73c67d6573Sopenharmony_ci    }
74c67d6573Sopenharmony_ci
75c67d6573Sopenharmony_ciIn other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
76c67d6573Sopenharmony_ciit were a global static value. What is actually happening under the covers is
77c67d6573Sopenharmony_cithat the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
78c67d6573Sopenharmony_ciof `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
79c67d6573Sopenharmony_ciit's self contained and everything works exactly as you expect. In particular,
80c67d6573Sopenharmony_ci`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
81c67d6573Sopenharmony_cia `Mutex`. On that note...
82c67d6573Sopenharmony_ci
83c67d6573Sopenharmony_ci## Using a regex from multiple threads
84c67d6573Sopenharmony_ci
85c67d6573Sopenharmony_ci**Advice**: The performance impact from using a `Regex` from multiple threads
86c67d6573Sopenharmony_ciis likely negligible. If necessary, clone the `Regex` so that each thread gets
87c67d6573Sopenharmony_ciits own copy. Cloning a regex does not incur any additional memory overhead
88c67d6573Sopenharmony_cithan what would be used by using a `Regex` from multiple threads
89c67d6573Sopenharmony_cisimultaneously. *Its only cost is ergonomics.*
90c67d6573Sopenharmony_ci
91c67d6573Sopenharmony_ciIt is supported and encouraged to define your regexes using `lazy_static!` as
92c67d6573Sopenharmony_ciif they were global static values, and then use them to search text from
93c67d6573Sopenharmony_cimultiple threads simultaneously.
94c67d6573Sopenharmony_ci
95c67d6573Sopenharmony_ciOne might imagine that this is possible because a `Regex` represents a
96c67d6573Sopenharmony_ci*compiled* program, so that any allocation or mutation is already done, and is
97c67d6573Sopenharmony_citherefore read-only. Unfortunately, this is not true. Each type of search
98c67d6573Sopenharmony_cistrategy in this crate requires some kind of mutable scratch space to use
99c67d6573Sopenharmony_ci*during search*. For example, when executing a DFA, its states are computed
100c67d6573Sopenharmony_cilazily and reused on subsequent searches. Those states go into that mutable
101c67d6573Sopenharmony_ciscratch space.
102c67d6573Sopenharmony_ci
103c67d6573Sopenharmony_ciThe mutable scratch space is an implementation detail, and in general, its
104c67d6573Sopenharmony_cimutation should not be observable from users of this crate. Therefore, it uses
105c67d6573Sopenharmony_ciinterior mutability. This implies that `Regex` can either only be used from one
106c67d6573Sopenharmony_cithread, or it must do some sort of synchronization. Either choice is
107c67d6573Sopenharmony_cireasonable, but this crate chooses the latter, in particular because it is
108c67d6573Sopenharmony_ciergonomic and makes use with `lazy_static!` straight forward.
109c67d6573Sopenharmony_ci
110c67d6573Sopenharmony_ciSynchronization implies *some* amount of overhead. When a `Regex` is used from
111c67d6573Sopenharmony_cia single thread, this overhead is negligible. When a `Regex` is used from
112c67d6573Sopenharmony_cimultiple threads simultaneously, it is possible for the overhead of
113c67d6573Sopenharmony_cisynchronization from contention to impact performance. The specific cases where
114c67d6573Sopenharmony_cicontention may happen is if you are calling any of these methods repeatedly
115c67d6573Sopenharmony_cifrom multiple threads simultaneously:
116c67d6573Sopenharmony_ci
117c67d6573Sopenharmony_ci* shortest_match
118c67d6573Sopenharmony_ci* is_match
119c67d6573Sopenharmony_ci* find
120c67d6573Sopenharmony_ci* captures
121c67d6573Sopenharmony_ci
122c67d6573Sopenharmony_ciIn particular, every invocation of one of these methods must synchronize with
123c67d6573Sopenharmony_ciother threads to retrieve its mutable scratch space before searching can start.
124c67d6573Sopenharmony_ciIf, however, you are using one of these methods:
125c67d6573Sopenharmony_ci
126c67d6573Sopenharmony_ci* find_iter
127c67d6573Sopenharmony_ci* captures_iter
128c67d6573Sopenharmony_ci
129c67d6573Sopenharmony_ciThen you may not suffer from contention since the cost of synchronization is
130c67d6573Sopenharmony_ciamortized on *construction of the iterator*. That is, the mutable scratch space
131c67d6573Sopenharmony_ciis obtained when the iterator is created and retained throughout its lifetime.
132c67d6573Sopenharmony_ci
133c67d6573Sopenharmony_ci## Only ask for what you need
134c67d6573Sopenharmony_ci
135c67d6573Sopenharmony_ci**Advice**: Prefer in this order: `is_match`, `find`, `captures`.
136c67d6573Sopenharmony_ci
137c67d6573Sopenharmony_ciThere are three primary search methods on a `Regex`:
138c67d6573Sopenharmony_ci
139c67d6573Sopenharmony_ci* is_match
140c67d6573Sopenharmony_ci* find
141c67d6573Sopenharmony_ci* captures
142c67d6573Sopenharmony_ci
143c67d6573Sopenharmony_ciIn general, these are ordered from fastest to slowest.
144c67d6573Sopenharmony_ci
145c67d6573Sopenharmony_ci`is_match` is fastest because it doesn't actually need to find the start or the
146c67d6573Sopenharmony_ciend of the leftmost-first match. It can quit immediately after it knows there
147c67d6573Sopenharmony_ciis a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
148c67d6573Sopenharmony_cisearch will quit after examining the first byte.
149c67d6573Sopenharmony_ci
150c67d6573Sopenharmony_ciIn contrast, `find` must return both the start and end location of the
151c67d6573Sopenharmony_cileftmost-first match. It can use the DFA matcher for this, but must run it
152c67d6573Sopenharmony_ciforwards once to find the end of the match *and then run it backwards* to find
153c67d6573Sopenharmony_cithe start of the match. The two scans and the cost of finding the real end of
154c67d6573Sopenharmony_cithe leftmost-first match make this more expensive than `is_match`.
155c67d6573Sopenharmony_ci
156c67d6573Sopenharmony_ci`captures` is the most expensive of them all because it must do what `find`
157c67d6573Sopenharmony_cidoes, and then run either the bounded backtracker or the Pike VM to fill in the
158c67d6573Sopenharmony_cicapture group locations. Both of these are simulations of an NFA, which must
159c67d6573Sopenharmony_cispend a lot of time shuffling states around. The DFA limits the performance hit
160c67d6573Sopenharmony_cisomewhat by restricting the amount of text that must be searched via an NFA
161c67d6573Sopenharmony_cisimulation.
162c67d6573Sopenharmony_ci
163c67d6573Sopenharmony_ciOne other method not mentioned is `shortest_match`. This method has precisely
164c67d6573Sopenharmony_cithe same performance characteristics as `is_match`, except it will return the
165c67d6573Sopenharmony_ciend location of when it discovered a match. For example, given the regex `a+`
166c67d6573Sopenharmony_ciand the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
167c67d6573Sopenharmony_cithe latter of which being the correct end location of the leftmost-first match.
168c67d6573Sopenharmony_ci
169c67d6573Sopenharmony_ci## Literals in your regex may make it faster
170c67d6573Sopenharmony_ci
171c67d6573Sopenharmony_ci**Advice**: Literals can reduce the work that the regex engine needs to do. Use
172c67d6573Sopenharmony_cithem if you can, especially as prefixes.
173c67d6573Sopenharmony_ci
174c67d6573Sopenharmony_ciIn particular, if your regex starts with a prefix literal, the prefix is
175c67d6573Sopenharmony_ciquickly searched before entering the (much slower) regex engine. For example,
176c67d6573Sopenharmony_cigiven the regex `foo\w+`, the literal `foo` will be searched for using
177c67d6573Sopenharmony_ciBoyer-Moore. If there's no match, then no regex engine is ever used. Only when
178c67d6573Sopenharmony_cithere's a match is the regex engine invoked at the location of the match, which
179c67d6573Sopenharmony_cieffectively permits the regex engine to skip large portions of a haystack.
180c67d6573Sopenharmony_ciIf a regex is comprised entirely of literals (possibly more than one), then
181c67d6573Sopenharmony_ciit's possible that the regex engine can be avoided entirely even when there's a
182c67d6573Sopenharmony_cimatch.
183c67d6573Sopenharmony_ci
184c67d6573Sopenharmony_ciWhen one literal is found, Boyer-Moore is used. When multiple literals are
185c67d6573Sopenharmony_cifound, then an optimized version of Aho-Corasick is used.
186c67d6573Sopenharmony_ci
187c67d6573Sopenharmony_ciThis optimization is in particular extended quite a bit in this crate. Here are
188c67d6573Sopenharmony_cia few examples of regexes that get literal prefixes detected:
189c67d6573Sopenharmony_ci
190c67d6573Sopenharmony_ci* `(foo|bar)` detects `foo` and `bar`
191c67d6573Sopenharmony_ci* `(a|b)c` detects `ac` and `bc`
192c67d6573Sopenharmony_ci* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
193c67d6573Sopenharmony_ci* `a?b` detects `a` and `b`
194c67d6573Sopenharmony_ci* `a*b` detects `a` and `b`
195c67d6573Sopenharmony_ci* `(ab){3,6}` detects `ababab`
196c67d6573Sopenharmony_ci
197c67d6573Sopenharmony_ciLiterals in anchored regexes can also be used for detecting non-matches very
198c67d6573Sopenharmony_ciquickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
199c67d6573Sopenharmony_cijust by examining the first (or last) three bytes of the haystack.
200c67d6573Sopenharmony_ci
201c67d6573Sopenharmony_ci## Unicode word boundaries may prevent the DFA from being used
202c67d6573Sopenharmony_ci
203c67d6573Sopenharmony_ci**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
204c67d6573Sopenharmony_ciinstead of `\b` if you care about consistent performance more than correctness.
205c67d6573Sopenharmony_ci
206c67d6573Sopenharmony_ciIt's a sad state of the current implementation. At the moment, the DFA will try
207c67d6573Sopenharmony_cito interpret Unicode word boundaries as if they were ASCII word boundaries.
208c67d6573Sopenharmony_ciIf the DFA comes across any non-ASCII byte, it will quit and fall back to an
209c67d6573Sopenharmony_cialternative matching engine that can handle Unicode word boundaries correctly.
210c67d6573Sopenharmony_ciThe alternate matching engine is generally quite a bit slower (perhaps by an
211c67d6573Sopenharmony_ciorder of magnitude). If necessary, this can be ameliorated in two ways.
212c67d6573Sopenharmony_ci
213c67d6573Sopenharmony_ciThe first way is to add some number of literal prefixes to your regular
214c67d6573Sopenharmony_ciexpression. Even though the DFA may not be used, specialized routines will
215c67d6573Sopenharmony_cistill kick in to find prefix literals quickly, which limits how much work the
216c67d6573Sopenharmony_ciNFA simulation will need to do.
217c67d6573Sopenharmony_ci
218c67d6573Sopenharmony_ciThe second way is to give up on Unicode and use an ASCII word boundary instead.
219c67d6573Sopenharmony_ciOne can use an ASCII word boundary by disabling Unicode support. That is,
220c67d6573Sopenharmony_ciinstead of using `\b`, use `(?-u:\b)`.  Namely, given the regex `\b.+\b`, it
221c67d6573Sopenharmony_cican be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
222c67d6573Sopenharmony_ciis important to limit the scope of disabling the `u` flag, since it might lead
223c67d6573Sopenharmony_cito a syntax error if the regex could match arbitrary bytes. For example, if one
224c67d6573Sopenharmony_ciwrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
225c67d6573Sopenharmony_ciany *byte* when the Unicode flag is disabled.
226c67d6573Sopenharmony_ci
227c67d6573Sopenharmony_ciThe second way isn't appreciably different than just using a Unicode word
228c67d6573Sopenharmony_ciboundary in the first place, since the DFA will speculatively interpret it as
229c67d6573Sopenharmony_cian ASCII word boundary anyway. The key difference is that if an ASCII word
230c67d6573Sopenharmony_ciboundary is used explicitly, then the DFA won't quit in the presence of
231c67d6573Sopenharmony_cinon-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
232c67d6573Sopenharmony_cimore consistent performance.
233c67d6573Sopenharmony_ci
234c67d6573Sopenharmony_ciN.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
235c67d6573Sopenharmony_cican simply write `\b` to get an ASCII word boundary.
236c67d6573Sopenharmony_ci
237c67d6573Sopenharmony_ci## Excessive counting can lead to exponential state blow up in the DFA
238c67d6573Sopenharmony_ci
239c67d6573Sopenharmony_ci**Advice**: Don't write regexes that cause DFA state blow up if you care about
240c67d6573Sopenharmony_cimatch performance.
241c67d6573Sopenharmony_ci
242c67d6573Sopenharmony_ciWait, didn't I say that this crate guards against exponential worst cases?
243c67d6573Sopenharmony_ciWell, it turns out that the process of converting an NFA to a DFA can lead to
244c67d6573Sopenharmony_cian exponential blow up in the number of states. This crate specifically guards
245c67d6573Sopenharmony_ciagainst exponential blow up by doing two things:
246c67d6573Sopenharmony_ci
247c67d6573Sopenharmony_ci1. The DFA is computed lazily. That is, a state in the DFA only exists in
248c67d6573Sopenharmony_ci   memory if it is visited. In particular, the lazy DFA guarantees that *at
249c67d6573Sopenharmony_ci   most* one state is created for every byte of input. This, on its own,
250c67d6573Sopenharmony_ci   guarantees linear time complexity.
251c67d6573Sopenharmony_ci2. Of course, creating a new state for *every* byte of input means that search
252c67d6573Sopenharmony_ci   will go incredibly slow because of very large constant factors. On top of
253c67d6573Sopenharmony_ci   that, creating a state for every byte in a large haystack could result in
254c67d6573Sopenharmony_ci   exorbitant memory usage. To ameliorate this, the DFA bounds the number of
255c67d6573Sopenharmony_ci   states it can store. Once it reaches its limit, it flushes its cache. This
256c67d6573Sopenharmony_ci   prevents reuse of states that it already computed. If the cache is flushed
257c67d6573Sopenharmony_ci   too frequently, then the DFA will give up and execution will fall back to
258c67d6573Sopenharmony_ci   one of the NFA simulations.
259c67d6573Sopenharmony_ci
260c67d6573Sopenharmony_ciIn effect, this crate will detect exponential state blow up and fall back to
261c67d6573Sopenharmony_cia search routine with fixed memory requirements. This does, however, mean that
262c67d6573Sopenharmony_cisearching will be much slower than one might expect. Regexes that rely on
263c67d6573Sopenharmony_cicounting in particular are strong aggravators of this behavior. For example,
264c67d6573Sopenharmony_cimatching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.
265c67d6573Sopenharmony_ci
266c67d6573Sopenharmony_ciIn the future, it may be possible to increase the bound that the DFA uses,
267c67d6573Sopenharmony_ciwhich would allow the caller to choose how much memory they're willing to
268c67d6573Sopenharmony_cispend.
269c67d6573Sopenharmony_ci
270c67d6573Sopenharmony_ci## Resist the temptation to "optimize" regexes
271c67d6573Sopenharmony_ci
272c67d6573Sopenharmony_ci**Advice**: This ain't a backtracking engine.
273c67d6573Sopenharmony_ci
274c67d6573Sopenharmony_ciAn entire book was written on how to optimize Perl-style regular expressions.
275c67d6573Sopenharmony_ciMost of those techniques are not applicable for this library. For example,
276c67d6573Sopenharmony_cithere is no problem with using non-greedy matching or having lots of
277c67d6573Sopenharmony_cialternations in your regex.
278