Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
.github/ | 12-May-2024 | - | 92 | 81 | ||
benches/ | 12-May-2024 | - | 125 | 99 | ||
diagram/ | 12-May-2024 | - | 41 | 35 | ||
generate/ | 12-May-2024 | - | 370 | 317 | ||
src/ | 12-May-2024 | - | 918 | 656 | ||
tests/ | 12-May-2024 | - | 1,006 | 943 | ||
.gitattributes | D | 12-May-2024 | 202 | 6 | 5 | |
.gitignore | D | 12-May-2024 | 34 | 5 | 4 | |
BUILD.gn | D | 12-May-2024 | 1 KiB | 28 | 24 | |
Cargo.toml | D | 12-May-2024 | 954 | 34 | 28 | |
LICENSE-APACHE | D | 12-May-2024 | 9.5 KiB | 177 | 150 | |
LICENSE-MIT | D | 12-May-2024 | 1,023 | 24 | 21 | |
LICENSE-UNICODE | D | 12-May-2024 | 2.3 KiB | 47 | 39 | |
OAT.xml | D | 12-May-2024 | 4.5 KiB | 72 | 18 | |
README.OpenSource | D | 12-May-2024 | 398 | 11 | 11 | |
README.md | D | 12-May-2024 | 12.2 KiB | 284 | 223 |
README.OpenSource
1[ 2 { 3 "Name": "unicode-ident", 4 "License": "Unicode, Apache License 2.0, MIT", 5 "License File": "LICENSE-UNICODE|LICENSE-APACHE|LICENSE-MIT", 6 "Version Number": "1.0.8", 7 "Owner": "fangting12@huawei.com", 8 "Upstream URL": "https://github.com/dtolnay/unicode-ident.git", 9 "Description": "A Rust library that provides support for working with Unicode identifiers." 10} 11]
README.md
1Unicode ident 2============= 3 4[<img alt="github" src="https://img.shields.io/badge/github-dtolnay/unicode--ident-8da0cb?style=for-the-badge&labelColor=555555&logo=github" height="20">](https://github.com/dtolnay/unicode-ident) 5[<img alt="crates.io" src="https://img.shields.io/crates/v/unicode-ident.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](https://crates.io/crates/unicode-ident) 6[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-unicode--ident-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" height="20">](https://docs.rs/unicode-ident) 7[<img alt="build status" src="https://img.shields.io/github/actions/workflow/status/dtolnay/unicode-ident/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/dtolnay/unicode-ident/actions?query=branch%3Amaster) 8 9Implementation of [Unicode Standard Annex #31][tr31] for determining which 10`char` values are valid in programming language identifiers. 11 12[tr31]: https://www.unicode.org/reports/tr31/ 13 14This crate is a better optimized implementation of the older `unicode-xid` 15crate. This crate uses less static storage, and is able to classify both ASCII 16and non-ASCII codepoints with better performance, 2–10× faster than 17`unicode-xid`. 18 19<br> 20 21## Comparison of performance 22 23The following table shows a comparison between five Unicode identifier 24implementations. 25 26- `unicode-ident` is this crate; 27- [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; 28- `ucd-trie` and `fst` are two data structures supported by the [`ucd-generate`] tool; 29- [`roaring`] is a Rust implementation of Roaring bitmap. 30 31The *static storage* column shows the total size of `static` tables that the 32crate bakes into your binary, measured in 1000s of bytes. 33 34The remaining columns show the **cost per call** to evaluate whether a single 35`char` has the XID\_Start or XID\_Continue Unicode property, comparing across 36different ratios of ASCII to non-ASCII codepoints in the input data. 37 38[`unicode-xid`]: https://github.com/unicode-rs/unicode-xid 39[`ucd-generate`]: https://github.com/BurntSushi/ucd-generate 40[`roaring`]: https://github.com/RoaringBitmap/roaring-rs 41 42| | static storage | 0% nonascii | 1% | 10% | 100% nonascii | 43|---|---|---|---|---|---| 44| **`unicode-ident`** | 10.0 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns | 45| **`unicode-xid`** | 11.5 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns | 46| **`ucd-trie`** | 10.2 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns | 47| **`fst`** | 138 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns | 48| **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns | 49 50Source code for the benchmark is provided in the *bench* directory of this repo 51and may be repeated by running `cargo criterion`. 52 53<br> 54 55## Comparison of data structures 56 57#### unicode-xid 58 59They use a sorted array of character ranges, and do a binary search to look up 60whether a given character lands inside one of those ranges. 61 62```rust 63static XID_Continue_table: [(char, char); 763] = [ 64 ('\u{30}', '\u{39}'), // 0-9 65 ('\u{41}', '\u{5a}'), // A-Z 66 … 67 ('\u{e0100}', '\u{e01ef}'), 68]; 69``` 70 71The static storage used by this data structure scales with the number of 72contiguous ranges of identifier codepoints in Unicode. Every table entry 73consumes 8 bytes, because it consists of a pair of 32-bit `char` values. 74 75In some ranges of the Unicode codepoint space, this is quite a sparse 76representation – there are some ranges where tens of thousands of adjacent 77codepoints are all valid identifier characters. In other places, the 78representation is quite inefficient. A characater like `µ` (U+00B5) which is 79surrounded by non-identifier codepoints consumes 64 bits in the table, while it 80would be just 1 bit in a dense bitmap. 81 82On a system with 64-byte cache lines, binary searching the table touches 7 cache 83lines on average. Each cache line fits only 8 table entries. Additionally, the 84branching performed during the binary search is probably mostly unpredictable to 85the branch predictor. 86 87Overall, the crate ends up being about 10× slower on non-ASCII input 88compared to the fastest crate. 89 90A potential improvement would be to pack the table entries more compactly. 91Rust's `char` type is a 21-bit integer padded to 32 bits, which means every 92table entry is holding 22 bits of wasted space, adding up to 3.9 K. They could 93instead fit every table entry into 6 bytes, leaving out some of the padding, for 94a 25% improvement in space used. With some cleverness it may be possible to fit 95in 5 bytes or even 4 bytes by storing a low char and an extent, instead of low 96char and high char. I don't expect that performance would improve much but this 97could be the most efficient for space across all the libraries, needing only 98about 7 K to store. 99 100#### ucd-trie 101 102Their data structure is a compressed trie set specifically tailored for Unicode 103codepoints. The design is credited to Raph Levien in [rust-lang/rust#33098]. 104 105[rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098 106 107```rust 108pub struct TrieSet { 109 tree1_level1: &'static [u64; 32], 110 tree2_level1: &'static [u8; 992], 111 tree2_level2: &'static [u64], 112 tree3_level1: &'static [u8; 256], 113 tree3_level2: &'static [u8], 114 tree3_level3: &'static [u64], 115} 116``` 117 118It represents codepoint sets using a trie to achieve prefix compression. The 119final states of the trie are embedded in leaves or "chunks", where each chunk is 120a 64-bit integer. Each bit position of the integer corresponds to whether a 121particular codepoint is in the set or not. These chunks are not just a compact 122representation of the final states of the trie, but are also a form of suffix 123compression. In particular, if multiple ranges of 64 contiguous codepoints have 124the same Unicode properties, then they all map to the same chunk in the final 125level of the trie. 126 127Being tailored for Unicode codepoints, this trie is partitioned into three 128disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints \[0, 1290x800), the second \[0x800, 0x10000) and the third \[0x10000, 0x110000). These 130partitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded 131codepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints, 132respectively. 133 134Lookups in this data structure are significantly more efficient than binary 135search. A lookup touches either 1, 2, or 3 cache lines based on which of the 136trie partitions is being accessed. 137 138One possible performance improvement would be for this crate to expose a way to 139query based on a UTF-8 encoded string, returning the Unicode property 140corresponding to the first character in the string. Without such an API, the 141caller is required to tokenize their UTF-8 encoded input data into `char`, hand 142the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting 143back into the variable-length representation for trie traversal. 144 145#### fst 146 147Uses a [finite state transducer][fst]. This representation is built into 148[ucd-generate] but I am not aware of any advantage over the `ucd-trie` 149representation. In particular `ucd-trie` is optimized for storing Unicode 150properties while `fst` is not. 151 152[fst]: https://github.com/BurntSushi/fst 153[ucd-generate]: https://github.com/BurntSushi/ucd-generate 154 155As far as I can tell, the main thing that causes `fst` to have large size and 156slow lookups for this use case relative to `ucd-trie` is that it does not 157specialize for the fact that only 21 of the 32 bits in a `char` are meaningful. 158There are some dense arrays in the structure with large ranges that could never 159possibly be used. 160 161#### roaring 162 163This crate is a pure-Rust implementation of [Roaring Bitmap], a data structure 164designed for storing sets of 32-bit unsigned integers. 165 166[Roaring Bitmap]: https://roaringbitmap.org/about/ 167 168Roaring bitmaps are compressed bitmaps which tend to outperform conventional 169compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be 170hundreds of times faster and they often offer significantly better compression. 171 172In this use case the performance was reasonably competitive but still 173substantially slower than the Unicode-optimized crates. Meanwhile the 174compression was significantly worse, requiring 6× as much storage for the 175data structure. 176 177I also benchmarked the [`croaring`] crate which is an FFI wrapper around the C 178reference implementation of Roaring Bitmap. This crate was consistently about 17915% slower than pure-Rust `roaring`, which could just be FFI overhead. I did not 180investigate further. 181 182[`croaring`]: https://crates.io/crates/croaring 183 184#### unicode-ident 185 186This crate is most similar to the `ucd-trie` library, in that it's based on 187bitmaps stored in the leafs of a trie representation, achieving both prefix 188compression and suffix compression. 189 190The key differences are: 191 192- Uses a single 2-level trie, rather than 3 disjoint partitions of different 193 depth each. 194- Uses significantly larger chunks: 512 bits rather than 64 bits. 195- Compresses the XID\_Start and XID\_Continue properties together 196 simultaneously, rather than duplicating identical trie leaf chunks across the 197 two. 198 199The following diagram show the XID\_Start and XID\_Continue Unicode boolean 200properties in uncompressed form, in row-major order: 201 202<table> 203<tr><th>XID_Start</th><th>XID_Continue</th></tr> 204<tr> 205<td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td> 206<td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td> 207</tr> 208</table> 209 210Uncompressed, these would take 140 K to store, which is beyond what would be 211reasonable. However, as you can see there is a large degree of similarity 212between the two bitmaps and across the rows, which lends well to compression. 213 214This crate stores one 512-bit "row" of the above bitmaps in the leaf level of a 215trie, and a single additional level to index into the leafs. It turns out there 216are 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient to 217index them. 218 219The chunk size of 512 bits is selected as the size that minimizes the total size 220of the data structure. A smaller chunk, like 256 or 128 bits, would achieve 221better deduplication but require a larger index. A larger chunk would increase 222redundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size of 223the index plus leaf bitmaps. 224 225In fact since there are only 124 unique chunks, we can use an 8-bit index with a 226spare bit to index at the half-chunk level. This achieves an additional 8.5% 227compression by eliminating redundancies between the second half of any chunk and 228the first half of any other chunk. Note that this is not the same as using 229chunks which are half the size, because it does not necessitate raising the size 230of the trie's first level. 231 232In contrast to binary search or the `ucd-trie` crate, performing lookups in this 233data structure is straight-line code with no need for branching. 234 235```asm 236is_xid_start: 237 mov eax, edi 238 shr eax, 9 239 lea rcx, [rip + unicode_ident::tables::TRIE_START] 240 add rcx, rax 241 xor eax, eax 242 cmp edi, 201728 243 cmovb rax, rcx 244 test rax, rax 245 lea rcx, [rip + .L__unnamed_1] 246 cmovne rcx, rax 247 movzx eax, byte ptr [rcx] 248 shl rax, 5 249 mov ecx, edi 250 shr ecx, 3 251 and ecx, 63 252 add rcx, rax 253 lea rax, [rip + unicode_ident::tables::LEAF] 254 mov al, byte ptr [rax + rcx] 255 and dil, 7 256 mov ecx, edi 257 shr al, cl 258 and al, 1 259 ret 260``` 261 262<br> 263 264## License 265 266Use of the Unicode Character Database, as this crate does, is governed by the <a 267href="LICENSE-UNICODE">Unicode License Agreement – Data Files and Software 268(2016)</a>. 269 270All intellectual property within this crate that is **not generated** using the 271Unicode Character Database as input is licensed under either of <a 272href="LICENSE-APACHE">Apache License, Version 2.0</a> or <a 273href="LICENSE-MIT">MIT license</a> at your option. 274 275The **generated** files incorporate tabular data derived from the Unicode 276Character Database, together with intellectual property from the original source 277code content of the crate. One must comply with the terms of both the Unicode 278License Agreement and either of the Apache license or MIT license when those 279generated files are involved. 280 281Unless you explicitly state otherwise, any contribution intentionally submitted 282for inclusion in this crate by you, as defined in the Apache-2.0 license, shall 283be licensed as just described, without any additional terms or conditions. 284