Lines Matching +full:storage +full:- +full:repo +full:- +full:branch
4 …adge/github-dtolnay/unicode--ident-8da0cb?style=for-the-badge&labelColor=555555&logo=github" heigh…
5 …ields.io/crates/v/unicode-ident.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](http…
6 …lds.io/badge/docs.rs-unicode--ident-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" hei…
7 …tolnay/unicode-ident/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/dt…
14 This crate is a better optimized implementation of the older `unicode-xid`
15 crate. This crate uses less static storage, and is able to classify both ASCII
16 and non-ASCII codepoints with better performance, 2–10× faster than
17 `unicode-xid`.
26 - `unicode-ident` is this crate;
27 - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org;
28 - `ucd-trie` and `fst` are two data structures supported by the [`ucd-generate`] tool;
29 - [`roaring`] is a Rust implementation of Roaring bitmap.
31 The *static storage* column shows the total size of `static` tables that the
36 different ratios of ASCII to non-ASCII codepoints in the input data.
38 [`unicode-xid`]: https://github.com/unicode-rs/unicode-xid
39 [`ucd-generate`]: https://github.com/BurntSushi/ucd-generate
40 [`roaring`]: https://github.com/RoaringBitmap/roaring-rs
42 | | static storage | 0% nonascii | 1% | 10% | 100% nonascii |
43 |---|---|---|---|---|---|
44 | **`unicode-ident`** | 10.0 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns |
45 | **`unicode-xid`** | 11.5 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns |
46 | **`ucd-trie`** | 10.2 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns |
50 Source code for the benchmark is provided in the *bench* directory of this repo
57 #### unicode-xid
64 ('\u{30}', '\u{39}'), // 0-9
65 ('\u{41}', '\u{5a}'), // A-Z
71 The static storage used by this data structure scales with the number of
73 consumes 8 bytes, because it consists of a pair of 32-bit `char` values.
79 surrounded by non-identifier codepoints consumes 64 bits in the table, while it
82 On a system with 64-byte cache lines, binary searching the table touches 7 cache
85 the branch predictor.
87 Overall, the crate ends up being about 10× slower on non-ASCII input
91 Rust's `char` type is a 21-bit integer padded to 32 bits, which means every
100 #### ucd-trie
103 codepoints. The design is credited to Raph Levien in [rust-lang/rust#33098].
105 [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098
120 a 64-bit integer. Each bit position of the integer corresponds to whether a
130 partitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded
131 codepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints,
139 query based on a UTF-8 encoded string, returning the Unicode property
141 caller is required to tokenize their UTF-8 encoded input data into `char`, hand
142 the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting
143 back into the variable-length representation for trie traversal.
148 [ucd-generate] but I am not aware of any advantage over the `ucd-trie`
149 representation. In particular `ucd-trie` is optimized for storing Unicode
153 [ucd-generate]: https://github.com/BurntSushi/ucd-generate
156 slow lookups for this use case relative to `ucd-trie` is that it does not
163 This crate is a pure-Rust implementation of [Roaring Bitmap], a data structure
164 designed for storing sets of 32-bit unsigned integers.
173 substantially slower than the Unicode-optimized crates. Meanwhile the
174 compression was significantly worse, requiring 6× as much storage for the
179 15% slower than pure-Rust `roaring`, which could just be FFI overhead. I did not
184 #### unicode-ident
186 This crate is most similar to the `ucd-trie` library, in that it's based on
192 - Uses a single 2-level trie, rather than 3 disjoint partitions of different
194 - Uses significantly larger chunks: 512 bits rather than 64 bits.
195 - Compresses the XID\_Start and XID\_Continue properties together
200 properties in uncompressed form, in row-major order:
205 …rt bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-a…
206 …ue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2…
214 This crate stores one 512-bit "row" of the above bitmaps in the leaf level of a
216 are 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to
225 In fact since there are only 124 unique chunks, we can use an 8-bit index with a
226 spare bit to index at the half-chunk level. This achieves an additional 8.5%
232 In contrast to binary search or the `ucd-trie` crate, performing lookups in this
233 data structure is straight-line code with no need for branching.
267 href="LICENSE-UNICODE">Unicode License Agreement – Data Files and Software
272 href="LICENSE-APACHE">Apache License, Version 2.0</a> or <a
273 href="LICENSE-MIT">MIT license</a> at your option.
282 for inclusion in this crate by you, as defined in the Apache-2.0 license, shall