Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
.github/workflows/ | 03-May-2024 | - | 159 | 138 | ||
data/ | 03-May-2024 | - | 4,476 | 3,837 | ||
src/ | 03-May-2024 | - | 9,593 | 5,070 | ||
tests/ | 03-May-2024 | - | 867 | 741 | ||
.cargo_vcs_info.json | D | 03-May-2024 | 74 | 6 | 5 | |
.gitignore | D | 03-May-2024 | 104 | 8 | 7 | |
Android.bp | D | 03-May-2024 | 2.1 KiB | 70 | 66 | |
COPYING | D | 03-May-2024 | 126 | 4 | 2 | |
Cargo.toml | D | 03-May-2024 | 2 KiB | 87 | 71 | |
Cargo.toml.orig | D | 03-May-2024 | 2 KiB | 76 | 64 | |
LICENSE | D | 03-May-2024 | 1.1 KiB | 22 | 17 | |
LICENSE-MIT | D | 03-May-2024 | 1.1 KiB | 22 | 17 | |
METADATA | D | 03-May-2024 | 427 | 20 | 19 | |
MODULE_LICENSE_MIT | D | 03-May-2024 | 0 | |||
OWNERS | D | 03-May-2024 | 40 | 2 | 1 | |
README.md | D | 03-May-2024 | 11.8 KiB | 224 | 186 | |
TEST_MAPPING | D | 03-May-2024 | 337 | 18 | 17 | |
TODO | D | 03-May-2024 | 583 | 11 | 10 | |
UNLICENSE | D | 03-May-2024 | 1.2 KiB | 25 | 20 | |
cargo2android.json | D | 03-May-2024 | 35 | 4 | 4 | |
rustfmt.toml | D | 03-May-2024 | 44 | 3 | 2 |
README.md
1regex-automata 2============== 3A low level regular expression library that uses deterministic finite automata. 4It supports a rich syntax with Unicode support, has extensive options for 5configuring the best space vs time trade off for your use case and provides 6support for cheap deserialization of automata for use in `no_std` environments. 7 8[](https://github.com/BurntSushi/regex-automata/actions) 9[](https://crates.io/crates/regex-automata) 10 11 12Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/). 13 14 15### Documentation 16 17https://docs.rs/regex-automata 18 19 20### Usage 21 22Add this to your `Cargo.toml`: 23 24```toml 25[dependencies] 26regex-automata = "0.1" 27``` 28 29and this to your crate root (if you're using Rust 2015): 30 31```rust 32extern crate regex_automata; 33``` 34 35 36### Example: basic regex searching 37 38This example shows how to compile a regex using the default configuration 39and then use it to find matches in a byte string: 40 41```rust 42use regex_automata::Regex; 43 44let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 45let text = b"2018-12-24 2016-10-08"; 46let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); 47assert_eq!(matches, vec![(0, 10), (11, 21)]); 48``` 49 50For more examples and information about the various knobs that can be turned, 51please see the [docs](https://docs.rs/regex-automata). 52 53 54### Support for `no_std` 55 56This crate comes with a `std` feature that is enabled by default. When the 57`std` feature is enabled, the API of this crate will include the facilities 58necessary for compiling, serializing, deserializing and searching with regular 59expressions. When the `std` feature is disabled, the API of this crate will 60shrink such that it only includes the facilities necessary for deserializing 61and searching with regular expressions. 62 63The intended workflow for `no_std` environments is thus as follows: 64 65* Write a program with the `std` feature that compiles and serializes a 66 regular expression. Serialization should only happen after first converting 67 the DFAs to use a fixed size state identifier instead of the default `usize`. 68 You may also need to serialize both little and big endian versions of each 69 DFA. (So that's 4 DFAs in total for each regex.) 70* In your `no_std` environment, follow the examples above for deserializing 71 your previously serialized DFAs into regexes. You can then search with them 72 as you would any regex. 73 74Deserialization can happen anywhere. For example, with bytes embedded into a 75binary or with a file memory mapped at runtime. 76 77Note that the 78[`ucd-generate`](https://github.com/BurntSushi/ucd-generate) 79tool will do the first step for you with its `dfa` or `regex` sub-commands. 80 81 82### Cargo features 83 84* `std` - **Enabled** by default. This enables the ability to compile finite 85 automata. This requires the `regex-syntax` dependency. Without this feature 86 enabled, finite automata can only be used for searching (using the approach 87 described above). 88* `transducer` - **Disabled** by default. This provides implementations of the 89 `Automaton` trait found in the `fst` crate. This permits using finite 90 automata generated by this crate to search finite state transducers. This 91 requires the `fst` dependency. 92 93 94### Differences with the regex crate 95 96The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a 97general purpose regular expression engine. It aims to automatically balance low 98compile times, fast search times and low memory usage, while also providing 99a convenient API for users. In contrast, this crate provides a lower level 100regular expression interface that is a bit less convenient while providing more 101explicit control over memory usage and search times. 102 103Here are some specific negative differences: 104 105* **Compilation can take an exponential amount of time and space** in the size 106 of the regex pattern. While most patterns do not exhibit worst case 107 exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will 108 build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should 109 not be compiled with this library. (In the future, the API may expose an 110 option to return an error if the DFA gets too big.) 111* This crate does not support sub-match extraction, which can be achieved with 112 the regex crate's "captures" API. This may be added in the future, but is 113 unlikely. 114* While the regex crate doesn't necessarily sport fast compilation times, the 115 regexes in this crate are almost universally slow to compile, especially when 116 they contain large Unicode character classes. For example, on my system, 117 compiling `\w{3}` with byte classes enabled takes just over 1 second and 118 almost 5MB of memory! (Compiling a sparse regex takes about the same time 119 but only uses about 500KB of memory.) Conversly, compiling the same regex 120 without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and 121 less than 5KB of memory. For this reason, you should only use Unicode 122 character classes if you absolutely need them! 123* This crate does not support regex sets. 124* This crate does not support zero-width assertions such as `^`, `$`, `\b` or 125 `\B`. 126* As a lower level crate, this library does not do literal optimizations. In 127 exchange, you get predictable performance regardless of input. The 128 philosophy here is that literal optimizations should be applied at a higher 129 level, although there is no easy support for this in the ecosystem yet. 130* There is no `&str` API like in the regex crate. In this crate, all APIs 131 operate on `&[u8]`. By default, match indices are guaranteed to fall on 132 UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled. 133 134With some of the downsides out of the way, here are some positive differences: 135 136* Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply 137 deserialized. Deserialization always takes constant time since searching can 138 be performed directly on the raw serialized bytes of a DFA. 139* This crate was specifically designed so that the searching phase of a DFA has 140 minimal runtime requirements, and can therefore be used in `no_std` 141 environments. While `no_std` environments cannot compile regexes, they can 142 deserialize pre-compiled regexes. 143* Since this crate builds DFAs ahead of time, it will generally out-perform 144 the `regex` crate on equivalent tasks. The performance difference is likely 145 not large. However, because of a complex set of optimizations in the regex 146 crate (like literal optimizations), an accurate performance comparison may be 147 difficult to do. 148* Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search 149 performance a small amount, but uses much less storage space. Potentially 150 even less than what the regex crate uses. 151* This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`, 152 which enables one to do less work in some cases. For example, if you only 153 need the end of a match and not the start of a match, then you can use a DFA 154 directly without building a `Regex`, which always requires a second DFA to 155 find the start of a match. 156* Aside from choosing between dense and sparse DFAs, there are several options 157 for configuring the space usage vs search time trade off. These include 158 things like choosing a smaller state identifier representation, to 159 premultiplying state identifiers and splitting a DFA's alphabet into 160 equivalence classes. Finally, DFA minimization is also provided, but can 161 increase compilation times dramatically. 162 163 164### Future work 165 166* Look into being smarter about generating NFA states for large Unicode 167 character classes. These can create a lot of additional work for both the 168 determinizer and the minimizer, and I suspect this is the key thing we'll 169 want to improve if we want to make DFA compile times faster. I *believe* 170 it's possible to potentially build minimal or nearly minimal NFAs for the 171 special case of Unicode character classes by leveraging Daciuk's algorithms 172 for building minimal automata in linear time for sets of strings. See 173 https://blog.burntsushi.net/transducers/#construction for more details. The 174 key adaptation I think we need to make is to modify the algorithm to operate 175 on byte ranges instead of enumerating every codepoint in the set. Otherwise, 176 it might not be worth doing. 177* Add support for regex sets. It should be possible to do this by "simply" 178 introducing more match states. I think we can also report the positions at 179 each match, similar to how Aho-Corasick works. I think the long pole in the 180 tent here is probably the API design work and arranging it so that we don't 181 introduce extra overhead into the non-regex-set case without duplicating a 182 lot of code. It seems doable. 183* Stretch goal: support capturing groups by implementing "tagged" DFA 184 (transducers). Laurikari's paper is the usual reference here, but Trofimovich 185 has a much more thorough treatment here: 186 https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf 187 I've only read the paper once. I suspect it will require at least a few more 188 read throughs before I understand it. 189 See also: https://re2c.org 190* Possibly less ambitious goal: can we select a portion of Trofimovich's work 191 to make small fixed length look-around work? It would be really nice to 192 support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $. 193* Experiment with code generating Rust code. There is an early experiment in 194 src/codegen.rs that is thoroughly bit-rotted. At the time, I was 195 experimenting with whether or not codegen would significant decrease the size 196 of a DFA, since if you squint hard enough, it's kind of like a sparse 197 representation. However, it didn't shrink as much as I thought it would, so 198 I gave up. The other problem is that Rust doesn't support gotos, so I don't 199 even know whether the "match on each state" in a loop thing will be fast 200 enough. Either way, it's probably a good option to have. For one thing, it 201 would be endian independent where as the serialization format of the DFAs in 202 this crate are endian dependent (so you need two versions of every DFA, but 203 you only need to compile one of them for any given arch). 204* Experiment with unrolling the match loops and fill out the benchmarks. 205* Add some kind of streaming API. I believe users of the library can already 206 implement something for this outside of the crate, but it would be good to 207 provide an official API. The key thing here is figuring out the API. I 208 suspect we might want to support several variants. 209* Make a decision on whether or not there is room for literal optimizations 210 in this crate. My original intent was to not let this crate sink down into 211 that very very very deep rabbit hole. But instead, we might want to provide 212 some way for literal optimizations to hook into the match routines. The right 213 path forward here is to probably build something outside of the crate and 214 then see about integrating it. After all, users can implement their own 215 match routines just as efficiently as what the crate provides. 216* A key downside of DFAs is that they can take up a lot of memory and can be 217 quite costly to build. Their worst case compilation time is O(2^n), where 218 n is the number of NFA states. A paper by Yang and Prasanna (2011) actually 219 seems to provide a way to character state blow up such that it is detectable. 220 If we could know whether a regex will exhibit state explosion or not, then 221 we could make an intelligent decision about whether to ahead-of-time compile 222 a DFA. 223 See: https://www.researchgate.net/profile/Xu-Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000/Characterization-of-a-global-germplasm-collection-and-its-potential-utilization-for-analysis-of-complex-quantitative-traits-in-maize.pdf 224