Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
README.md | D | 12-May-2024 | 4.3 KiB | 93 | 78 | |
array-mapped-radix-tree.js | D | 12-May-2024 | 2.4 KiB | 101 | 72 | |
index.js | D | 12-May-2024 | 1 KiB | 26 | 19 | |
radix-tree.js | D | 12-May-2024 | 5.1 KiB | 196 | 106 | |
tree-node.js | D | 12-May-2024 | 692 | 38 | 29 |
README.md
1# Named entity array mapped radix tree generator 2 3Prior to `v3.0.0` we've used simple pre-generated [trie data structure](https://en.wikipedia.org/wiki/Trie) 4for [named character reference](https://html.spec.whatwg.org/multipage/syntax.html#named-character-references) consumption 5in tokenizer. This approach suffered from huge constant memory consumption: the in-memory size of the structure was ~8.5Mb. 6This new approach reduces the size of the character reference data to ~250Kb and maintains performance that is close 7to trie structure. 8 9## Radix tree 10Trie was replaced with [radix tree](https://en.wikipedia.org/wiki/Radix_tree). Unlike trie, which contains 11only *nodes*, radix tree contains *nodes* and *edges*. If subsequent nodes contains only one branch they can be combined into single 12edge. 13 14E.g. for words `test`, `tester` and `testing` we'll have trie (legend: `[a, ...]` - node, `<abc>` - edge, `*` - data): 15``` 16 [t] 17 | 18 [e] 19 | 20 [s] 21 | 22 [t] 23 | 24 [e, i, *] 25 / | 26 [r] [n] 27 | | 28 [*] [g] 29 | 30 [*] 31``` 32 33with radix tree we can reduce it to: 34``` 35 <test> 36 | 37 [e, i, *] 38 / | 39 <r> <ng> 40 | | 41 [*] [*] 42``` 43 44This approach has two advantages: 45- it significantly reduces number of nodes, and thus memory allocated for data strucure; 46- edge can be represented by simple array. 47 48## Mapping radix tree to array 49 50We've significantly reduced the size of the tree. However, since we need to allocate an object for each node and array for 51each edge, it still consumes a lot of memory. Therefore, we map our tree to array, so we'll end up with just a single object. 52From my experiments it's appeared that it's possible to map this tree to `Uint16Array`, since we don't have indices and 53code points which are more than `MAX_UINT16` (which is `0xFFFF`). The only exception here is [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF) 54which appear in named character reference results, but they can be decoupled to two `uint16` code points. The advantage 55of typed arrays is what they obviously consume less memory and extremely fast to traverse. 56 57Since edges are already arrays, we write them to array as is. 58 59### Mapping nodes 60 61#### Node header 62Node may contain `1` or `2` bytes of data and/or branch data. Branch data is represented by dictionary, keys of 63dictionary are code points and the values are references to the next node or edge. Since edges represented as raw arrays 64of code points, we need marker which will tell traversing algorithm that it sought a node. Character reference names contains only 65ASCII alpha characters and semicolon. So, any value less than `59` (semicolon code point) will work for us. The naive 66approach is to use, e.g. `0` as the node marker, followed by number of data bytes, followed by data, and then followed by number of branches: 67``` 68| 0 | 2 | 8847, 824 | 5 | ... | 69 \ \ \ \ \ 70 \ \ \ \ branch data 71 \ \ \ number of branches 72 \ \ data code points 73 \ number of data code points 74 node marker 75``` 76 77We can improve this structure by using binary flags to encode information about branch into marker. We have three flags: 78- `HAS_DATA_FLAG` - node has data; 79- `DATA_DUPLET_FLAG` - node has 2 bytes of data; 80- `HAS_BRANCHES_FLAG` - node has branches. 81 82With flags we can omit some parts of the node structure, thus reducing the in-memory size of the node. 83 84#### Branch data 85 86Branch data is represented by two arrays, following one after another. First array contains sorted transition code points, 87and the second one - corresponding next edge/node indices. Traversing algorithm reads number of branches in node header 88and then performs binary search for the matching code point. When matching code point found, next node/edge index can be 89obtained on offset which is equal to the number of branches in the node. Binary search can look like a step back compared 90to previous approach where branch lookup was implemented as dictionary lookup which is `O(1)`. However, since character 91reference names consists of ASCII alpha characters and semicolon we have `log(26*2+1) ≈ 6` iterations of search loop 92in worst case. Iteration other the typed array is extremely fast, so performance doesn't degrade here. 93