• Home
Name
Date
Size
#Lines
LOC

..--

README.mdD12-May-20244.3 KiB9378

array-mapped-radix-tree.jsD12-May-20242.4 KiB10172

index.jsD12-May-20241 KiB2619

radix-tree.jsD12-May-20245.1 KiB196106

tree-node.jsD12-May-2024692 3829

README.md

1# Named entity array mapped radix tree generator
2
3Prior to `v3.0.0` we've used simple pre-generated [trie data structure](https://en.wikipedia.org/wiki/Trie)
4for [named character reference](https://html.spec.whatwg.org/multipage/syntax.html#named-character-references) consumption
5in tokenizer. This approach suffered from huge constant memory consumption: the in-memory size of the structure was ~8.5Mb.
6This new approach reduces the size of the character reference data to ~250Kb and maintains performance that is close
7to trie structure.
8
9## Radix tree
10Trie was replaced with [radix tree](https://en.wikipedia.org/wiki/Radix_tree). Unlike trie, which contains
11only *nodes*, radix tree contains *nodes* and *edges*. If subsequent nodes contains only one branch they can be combined into single
12edge.
13
14E.g. for words `test`, `tester` and `testing` we'll have trie (legend: `[a, ...]` - node, `<abc>` - edge, `*` - data):
15```
16              [t]
17               |
18              [e]
19               |
20              [s]
21               |
22              [t]
23               |
24           [e, i, *]
25           /   |
26         [r]  [n]
27          |    |
28         [*]  [g]
29               |
30              [*]
31```
32
33with radix tree we can reduce it to:
34```
35        <test>
36          |
37      [e, i, *]
38      /   |
39    <r>  <ng>
40     |     |
41    [*]   [*]
42```
43
44This approach has two advantages:
45- it significantly reduces number of nodes, and thus memory allocated for data strucure;
46- edge can be represented by simple array.
47
48## Mapping radix tree to array
49
50We've significantly reduced the size of the tree. However, since we need to allocate an object for each node and array for
51each edge, it still consumes a lot of memory. Therefore, we map our tree to array, so we'll end up with just a single object.
52From my experiments it's appeared that it's possible to map this tree to `Uint16Array`, since we don't have indices and
53code points which are more than `MAX_UINT16` (which is `0xFFFF`). The only exception here is [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF)
54which appear in named character reference results, but they can be decoupled to two `uint16` code points. The advantage
55of typed arrays is what they obviously consume less memory and extremely fast to traverse.
56
57Since edges are already arrays, we write them to array as is.
58
59### Mapping nodes
60
61#### Node header
62Node may contain `1` or `2` bytes of data and/or branch data. Branch data is represented by dictionary, keys of
63dictionary are code points and the values are references to the next node or edge. Since edges represented as raw arrays
64of code points, we need marker which will tell traversing algorithm that it sought a node. Character reference names contains only
65ASCII alpha characters and semicolon. So, any value less than `59` (semicolon code point) will work for us. The naive
66approach is to use, e.g. `0` as the node marker, followed by number of data bytes, followed by data, and then followed by number of branches:
67```
68| 0 | 2 | 8847, 824 | 5 | ... |
69 \    \    \          \   \
70  \    \    \          \   branch data
71   \    \    \          number of branches
72    \    \    data code points
73     \    number of data code points
74      node marker
75```
76
77We can improve this structure by using binary flags to encode information about branch into marker. We have three flags:
78- `HAS_DATA_FLAG` - node has data;
79- `DATA_DUPLET_FLAG` - node has 2 bytes of data;
80- `HAS_BRANCHES_FLAG` - node has branches.
81
82With flags we can omit some parts of the node structure, thus reducing the in-memory size of the node.
83
84#### Branch data
85
86Branch data is represented by two arrays, following one after another. First array contains sorted transition code points,
87and the second one - corresponding next edge/node indices. Traversing algorithm reads number of branches in node header
88and then performs binary search for the matching code point. When matching code point found, next node/edge index can be
89obtained on offset which is equal to the number of branches in the node. Binary search can look like a step back compared
90to previous approach where branch lookup was implemented as dictionary lookup which is `O(1)`. However, since character
91reference names consists of ASCII alpha characters and semicolon we have `log(26*2+1) ≈ 6` iterations of search loop
92in worst case. Iteration other the typed array is extremely fast, so performance doesn't degrade here.
93