• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // This file is part of ICU4X. For terms of use, please see the file
2 // called LICENSE at the top level of the ICU4X source tree
3 // (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).
4 
5 //! # ZeroTrie Builder
6 //!
7 //! There are two implementations of the ZeroTrie Builder:
8 //!
9 //! - [konst::ZeroTrieBuilderConst] allows for human-readable const construction
10 //! - [nonconst::ZeroTrieBuilder] has the full feaure set but requires `alloc`
11 //!
12 //! The two builders follow the same algorithm but have different capabilities.
13 //!
14 //! ## Builder Algorithm Overview
15 //!
16 //! The tries are built backwards, from the last node to the first node. The key step of the
17 //! algorithm is **determining what is the next node to prepend.**
18 //!
19 //! In the simple case of [`ZeroTrieSimpleAscii`], all nodes are binary-search, so if the input
20 //! strings are provided in lexicographic order, there is a simple, deterministic method for
21 //! identifying the next node. This insight is what enables us to make the const builder.
22 //!
23 //! The builder works with the following intermediate state variables:
24 //!
25 //! - `prefix_len` indicates the byte index we are currently processing.
26 //! - `i` and `j` bracket a window of strings in the input that share the same prefix.
27 //! - `current_len` is the length in bytes of the current self-contained trie.
28 //! - `lengths_stack` contains metadata for branch nodes.
29 //!
30 //! What follows is a verbal explanation of the build steps for a trie containing:
31 //!
32 //! - "" → 11
33 //! - "ad" → 22
34 //! - "adef" → 33
35 //! - "adghk" → 44
36 //!
37 //! When a node is prepended, it is shown in **boldface**.
38 //!
39 //! 1. Initialize the builder by setting `i=3`, `j=4`, `prefix_len=5` (the last string),
40 //!    `current_len=0`, and `lengths_stack` empty. Start the main loop.
41 //! 2. Top of loop. The string at `i` is equal in length to `prefix_len`, so we prepend
42 //!    our first node: a **value node 44**, which requires a 2-byte varint. Increase
43 //!    `current_len` to 2.
44 //! 3. Reduce `prefix_len` to 4, read our `key_ascii="k"`, and recalculate `i` and `j`
45 //!    _(this calculation is a long chunk of code in the builder impls)_. Since there is no
46 //!    other string with the prefix "adgh", `i` and `j` stay the same, we prepend an
47 //!    **ASCII node "k"**, increase `current_len` to 3, and continue the main loop.
48 //! 4. Top of loop. The string at `i` is of length 5, but `prefix_len` is 4, so there is
49 //!    no value node to prepend.
50 //! 5. Reduce `prefix_len` to 3, read our `key_ascii="h"`, and recalculate `i` and `j`.
51 //!    There are no other strings sharing the prefix "abg", so we prepend an
52 //!    **ASCII node "h"**, increase `current_len` to 4, and continue the main loop.
53 //! 6. Top of loop. There is still no value node to prepend.
54 //! 7. Reduce `prefix_len` to 2, read our `key_ascii="g"`, and recalculate `i` and `j`.
55 //!    We find that `i=1` and `j=4`, the range of strings sharing the prefix "ad". Since
56 //!    `i` or `j` changed, proceed to evaluate the branch node.
57 //! 8. The last branch byte `ascii_j` for this prefix is "g", which is the same as `key_ascii`,
58 //!    so we are the _last_ target of a branch node. Push an entry onto `lengths_stack`:
59 //!    `BranchMeta { ascii: "g", cumulative_length: 4, local_length: 4, count: 1 }`.
60 //! 9. The first branch byte `ascii_i` for this prefix is "e", which is NOT equal to `key_ascii`,
61 //!    so we are _not the first_ target of a branch node. We therefore start evaluating the
62 //!    string preceding where we were at the top of the current loop. We set `i=2`, `j=3`,
63 //!    `prefix_len=4` (length of the string at `i`), and continue the main loop.
64 //! 10. Top of loop. Since the string at `i` is equal in length to `prefix_len`, we prepend a
65 //!     **value node 33** (which requires a 2-byte varint) and increase `current_len` to 2.
66 //! 11. Reduce `prefix_len` to 3, read our `key_ascii="f"`, and recalculate `i` and `j`.
67 //!     They stay the same, so we prepend an **ASCII node "f"**, increase `current_len` to 3,
68 //!     and continue the main loop.
69 //! 12. Top of loop. No value node this time.
70 //! 13. Reduce `prefix_len` to 2, read our `key_ascii="e"`, and recalculate `i` and `j`.
71 //!     They go back to `i=1` and `j=4`.
72 //! 14. The last branch byte `ascii_j` for this prefix is "g", which is NOT equal to `key_ascii`,
73 //!     so we are _not the last_ target of a branch node. We peek at the entry at the front of
74 //!     the lengths stack and use it to push another entry onto the stack:
75 //!     `BranchMeta { ascii: "e", cumulative_length: 7, local_length: 3, count: 2 }`
76 //! 15. The first branch byte `ascii_i` for this prefix is "e", which is the same as `key_ascii`,
77 //!     wo we are the _first_ target of a branch node. We can therefore proceed to prepend the
78 //!     metadata for the branch node. We peek at the top of the stack and find that there are 2
79 //!     tries reachable from this branch and they have a total byte length of 5. We then pull off
80 //!     2 entries from the stack into a local variable `branch_metas`. From here, we write out
81 //!     the **offset table**, **lookup table**, and **branch head node**, which are determined
82 //!     from the metadata entries. We set `current_len` to the length of the two tries plus the
83 //!     metadata, which happens to be 11. Then we return to the top of the main loop.
84 //! 16. Top of loop. The string at `i` is length 2, which is the same as `prefix_len`, so we
85 //!     prepend a **value node 22** (2-byte varint) and increase `current_len` to 13.
86 //! 17. Reduce `prefix_len` to 1, read our `key_ascii="d"`, and recalculate `i` and `j`.
87 //!     They stay the same, so we prepend an **ASCII node "d"**, increase `current_len` to 14,
88 //!     and continue the main loop.
89 //! 18. Top of loop. No value node this time.
90 //! 19. Reduce `prefix_len` to 0, read our `key_ascii="a"`, and recalculate `i` and `j`.
91 //!     They change to `i=0` and `j=4`, since all strings have the empty string as a prefix.
92 //!     However, `ascii_i` and `ascii_j` both equal `key_ascii`, so we prepend **ASCII node "a"**,
93 //!     increase `current_len` to 15, and continue the main loop.
94 //! 16. Top of loop. The string at `i` is length 0, which is the same as `prefix_len`, so we
95 //!     prepend a **value node 11** and increase `current_len` to 16.
96 //! 17. We can no longer reduce `prefix_len`, so our trie is complete.
97 //!
98 //! ## Perfect Hash Reordering
99 //!
100 //! When the PHF is added to the mix, the main change is that the strings are no longer in sorted
101 //! order when they are in the trie. To resolve this issue, when adding a branch node, the target
102 //! tries are rearranged in-place in the buffer to be in the correct order for the PHF.
103 //!
104 //! ## Example
105 //!
106 //! Here is the output of the trie described above.
107 //!
108 //! ```
109 //! use zerotrie::ZeroTrieSimpleAscii;
110 //!
111 //! const DATA: [(&str, usize); 4] =
112 //!     [("", 11), ("ad", 22), ("adef", 33), ("adghk", 44)];
113 //!
114 //! // As demonstrated above, the required capacity for this trie is 16 bytes
115 //! const TRIE: ZeroTrieSimpleAscii<[u8; 16]> =
116 //!     ZeroTrieSimpleAscii::from_sorted_str_tuples(&DATA);
117 //!
118 //! assert_eq!(
119 //!     TRIE.as_bytes(),
120 //!     &[
121 //!         0x8B, // value node 11
122 //!         b'a', // ASCII node 'a'
123 //!         b'd', // ASCII node 'd'
124 //!         0x90, // value node 22 lead byte
125 //!         0x06, // value node 22 trail byte
126 //!         0xC2, // branch node 2
127 //!         b'e', // first target of branch
128 //!         b'g', // second target of branch
129 //!         3,    // offset
130 //!         b'f', // ASCII node 'f'
131 //!         0x90, // value node 33 lead byte
132 //!         0x11, // value node 33 trail byte
133 //!         b'h', // ASCII node 'h'
134 //!         b'k', // ASCII node 'k'
135 //!         0x90, // value node 44 lead byte
136 //!         0x1C, // value node 44 trail byte
137 //!     ]
138 //! );
139 //!
140 //! assert_eq!(TRIE.get(b""), Some(11));
141 //! assert_eq!(TRIE.get(b"ad"), Some(22));
142 //! assert_eq!(TRIE.get(b"adef"), Some(33));
143 //! assert_eq!(TRIE.get(b"adghk"), Some(44));
144 //! assert_eq!(TRIE.get(b"unknown"), None);
145 //! ```
146 
147 mod branch_meta;
148 pub(crate) mod bytestr;
149 pub(crate) mod konst;
150 #[cfg(feature = "litemap")]
151 mod litemap;
152 #[cfg(feature = "alloc")]
153 pub(crate) mod nonconst;
154 
155 use bytestr::ByteStr;
156 
157 use super::ZeroTrieSimpleAscii;
158 
159 impl<const N: usize> ZeroTrieSimpleAscii<[u8; N]> {
160     /// **Const Constructor:** Creates an [`ZeroTrieSimpleAscii`] from a sorted slice of keys and values.
161     ///
162     /// This function needs to know the exact length of the resulting trie at compile time. To
163     /// figure out `N`, first set `N` to be too large (say 0xFFFF), then look at the resulting
164     /// compile error which will tell you how to set `N`, like this:
165     ///
166     /// > the evaluated program panicked at 'Buffer too large. Size needed: 17'
167     ///
168     /// That error message says you need to set `N` to 17.
169     ///
170     /// Also see [`Self::from_sorted_str_tuples`].
171     ///
172     /// # Panics
173     ///
174     /// Panics if `items` is not sorted or if `N` is not correct.
175     ///
176     /// # Examples
177     ///
178     /// Create a `const` ZeroTrieSimpleAscii at compile time:
179     ///
180     /// ```
181     /// use zerotrie::ZeroTrieSimpleAscii;
182     ///
183     /// // The required capacity for this trie happens to be 17 bytes
184     /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> =
185     ///     ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
186     ///         (b"bar", 2),
187     ///         (b"bazzoo", 3),
188     ///         (b"foo", 1),
189     ///     ]);
190     ///
191     /// assert_eq!(TRIE.get(b"foo"), Some(1));
192     /// assert_eq!(TRIE.get(b"bar"), Some(2));
193     /// assert_eq!(TRIE.get(b"bazzoo"), Some(3));
194     /// assert_eq!(TRIE.get(b"unknown"), None);
195     /// ```
196     ///
197     /// Panics if strings are not sorted:
198     ///
199     /// ```compile_fail
200     /// # use zerotrie::ZeroTrieSimpleAscii;
201     /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
202     ///     (b"foo", 1),
203     ///     (b"bar", 2),
204     ///     (b"bazzoo", 3),
205     /// ]);
206     /// ```
207     ///
208     /// Panics if capacity is too small:
209     ///
210     /// ```compile_fail
211     /// # use zerotrie::ZeroTrieSimpleAscii;
212     /// const TRIE: ZeroTrieSimpleAscii<[u8; 15]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
213     ///     (b"bar", 2),
214     ///     (b"bazzoo", 3),
215     ///     (b"foo", 1),
216     /// ]);
217     /// ```
218     ///
219     /// Panics if capacity is too large:
220     ///
221     /// ```compile_fail
222     /// # use zerotrie::ZeroTrieSimpleAscii;
223     /// const TRIE: ZeroTrieSimpleAscii<[u8; 20]> = ZeroTrieSimpleAscii::from_sorted_u8_tuples(&[
224     ///     (b"bar", 2),
225     ///     (b"bazzoo", 3),
226     ///     (b"foo", 1),
227     /// ]);
228     /// ```
from_sorted_u8_tuples(tuples: &[(&[u8], usize)]) -> Self229     pub const fn from_sorted_u8_tuples(tuples: &[(&[u8], usize)]) -> Self {
230         use konst::*;
231         let byte_str_slice = ByteStr::from_byte_slice_with_value(tuples);
232         let result = ZeroTrieBuilderConst::<N>::from_tuple_slice::<100>(byte_str_slice);
233         match result {
234             Ok(s) => Self::from_store(s.build_or_panic()),
235             Err(_) => panic!("Failed to build ZeroTrie"),
236         }
237     }
238 
239     /// **Const Constructor:** Creates an [`ZeroTrieSimpleAscii`] from a sorted slice of keys and values.
240     ///
241     /// This function needs to know the exact length of the resulting trie at compile time. To
242     /// figure out `N`, first set `N` to be too large (say 0xFFFF), then look at the resulting
243     /// compile error which will tell you how to set `N`, like this:
244     ///
245     /// > the evaluated program panicked at 'Buffer too large. Size needed: 17'
246     ///
247     /// That error message says you need to set `N` to 17.
248     ///
249     /// Also see [`Self::from_sorted_u8_tuples`].
250     ///
251     /// # Panics
252     ///
253     /// Panics if `items` is not sorted, if `N` is not correct, or if any of the strings contain
254     /// non-ASCII characters.
255     ///
256     /// # Examples
257     ///
258     /// Create a `const` ZeroTrieSimpleAscii at compile time:
259     ///
260     /// ```
261     /// use zerotrie::ZeroTrieSimpleAscii;
262     ///
263     /// // The required capacity for this trie happens to be 17 bytes
264     /// const TRIE: ZeroTrieSimpleAscii<[u8; 17]> =
265     ///     ZeroTrieSimpleAscii::from_sorted_str_tuples(&[
266     ///         ("bar", 2),
267     ///         ("bazzoo", 3),
268     ///         ("foo", 1),
269     ///     ]);
270     ///
271     /// assert_eq!(TRIE.get(b"foo"), Some(1));
272     /// assert_eq!(TRIE.get(b"bar"), Some(2));
273     /// assert_eq!(TRIE.get(b"bazzoo"), Some(3));
274     /// assert_eq!(TRIE.get(b"unknown"), None);
275     /// ```
276     ///
277     /// Panics if the strings are not ASCII:
278     ///
279     /// ```compile_fail
280     /// # use zerotrie::ZeroTrieSimpleAscii;
281     /// const TRIE: ZeroTrieSimpleAscii<[u8; 100]> = ZeroTrieSimpleAscii::from_sorted_str_tuples(&[
282     ///     ("bár", 2),
283     ///     ("båzzöo", 3),
284     ///     ("foo", 1),
285     /// ]);
286     /// ```
from_sorted_str_tuples(tuples: &[(&str, usize)]) -> Self287     pub const fn from_sorted_str_tuples(tuples: &[(&str, usize)]) -> Self {
288         use konst::*;
289         let byte_str_slice = ByteStr::from_str_slice_with_value(tuples);
290         // 100 is the value of `K`, the size of the lengths stack. If compile errors are
291         // encountered, this number may need to be increased.
292         let result = ZeroTrieBuilderConst::<N>::from_tuple_slice::<100>(byte_str_slice);
293         match result {
294             Ok(s) => Self::from_store(s.build_or_panic()),
295             Err(_) => panic!("Failed to build ZeroTrie"),
296         }
297     }
298 }
299