Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
ctest/ | 12-May-2024 | - | 603 | 527 | ||
examples/ | 12-May-2024 | - | 13,165 | 10,457 | ||
include/ | 12-May-2024 | - | 586 | 79 | ||
src/ | 12-May-2024 | - | 755 | 666 | ||
Cargo.toml | D | 12-May-2024 | 557 | 23 | 20 | |
LICENSE-APACHE | D | 12-May-2024 | 10.6 KiB | 202 | 169 | |
LICENSE-MIT | D | 12-May-2024 | 1 KiB | 26 | 22 | |
README.md | D | 12-May-2024 | 4.3 KiB | 104 | 81 | |
test | D | 12-May-2024 | 182 | 8 | 4 |
README.md
1C API for RUst's REgex engine 2============================= 3rure is a C API to Rust's regex library, which guarantees linear time 4searching using finite automata. In exchange, it must give up some common 5regex features such as backreferences and arbitrary lookaround. It does 6however include capturing groups, lazy matching, Unicode support and word 7boundary assertions. Its matching semantics generally correspond to Perl's, 8or "leftmost first." Namely, the match locations reported correspond to the 9first match that would be found by a backtracking engine. 10 11The header file (`includes/rure.h`) serves as the primary API documentation of 12this library. Types and flags are documented first, and functions follow. 13 14The syntax and possibly other useful things are documented in the Rust 15API documentation: https://docs.rs/regex 16 17 18Examples 19-------- 20There are readable examples in the `ctest` and `examples` sub-directories. 21 22Assuming you have 23[Rust and Cargo installed](https://www.rust-lang.org/downloads.html) 24(and a C compiler), then this should work to run the `iter` example: 25 26``` 27$ git clone git://github.com/rust-lang/regex 28$ cd regex/regex-capi/examples 29$ ./compile 30$ LD_LIBRARY_PATH=../target/release ./iter 31``` 32 33 34Performance 35----------- 36It's fast. Its core matching engine is a lazy DFA, which is what GNU grep 37and RE2 use. Like GNU grep, this regex engine can detect multi byte literals 38in the regex and will use fast literal string searching to quickly skip 39through the input to find possible match locations. 40 41All memory usage is bounded and all searching takes linear time with respect 42to the input string. 43 44For more details, see the PERFORMANCE guide: 45https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md 46 47 48Text encoding 49------------- 50All regular expressions must be valid UTF-8. 51 52The text encoding of haystacks is more complicated. To a first 53approximation, haystacks should be UTF-8. In fact, UTF-8 (and, one 54supposes, ASCII) is the only well defined text encoding supported by this 55library. It is impossible to match UTF-16, UTF-32 or any other encoding 56without first transcoding it to UTF-8. 57 58With that said, haystacks do not need to be valid UTF-8, and if they aren't 59valid UTF-8, no performance penalty is paid. Whether invalid UTF-8 is 60matched or not depends on the regular expression. For example, with the 61`RURE_FLAG_UNICODE` flag enabled, the regex `.` is guaranteed to match a 62single UTF-8 encoding of a Unicode codepoint (sans LF). In particular, 63it will not match invalid UTF-8 such as `\xFF`, nor will it match surrogate 64codepoints or "alternate" (i.e., non-minimal) encodings of codepoints. 65However, with the `RURE_FLAG_UNICODE` flag disabled, the regex `.` will match 66any *single* arbitrary byte (sans LF), including `\xFF`. 67 68This provides a useful invariant: wherever `RURE_FLAG_UNICODE` is set, the 69corresponding regex is guaranteed to match valid UTF-8. Invalid UTF-8 will 70always prevent a match from happening when the flag is set. Since flags can be 71toggled in the regular expression itself, this allows one to pick and choose 72which parts of the regular expression must match UTF-8 or not. 73 74Some good advice is to always enable the `RURE_FLAG_UNICODE` flag (which is 75enabled when using `rure_compile_must`) and selectively disable the flag when 76one wants to match arbitrary bytes. The flag can be disabled in a regular 77expression with `(?-u)`. 78 79Finally, if one wants to match specific invalid UTF-8 bytes, then you can 80use escape sequences. e.g., `(?-u)\\xFF` will match `\xFF`. It's not 81possible to use C literal escape sequences in this case since regular 82expressions must be valid UTF-8. 83 84 85Aborts 86------ 87This library will abort your process if an unwinding panic is caught in the 88Rust code. Generally, a panic occurs when there is a bug in the program or 89if allocation failed. It is possible to cause this behavior by passing 90invalid inputs to some functions. For example, giving an invalid capture 91group index to `rure_captures_at` will cause Rust's bounds checks to fail, 92which will cause a panic, which will be caught and printed to stderr. The 93process will then `abort`. 94 95 96Missing 97------- 98There are a few things missing from the C API that are present in the Rust API. 99There's no particular (known) reason why they don't, they just haven't been 100implemented yet. 101 102* Splitting a string by a regex. 103* Replacing regex matches in a string with some other text. 104