# Named entity array mapped radix tree generator Prior to `v3.0.0` we've used simple pre-generated [trie data structure](https://en.wikipedia.org/wiki/Trie) for [named character reference](https://html.spec.whatwg.org/multipage/syntax.html#named-character-references) consumption in tokenizer. This approach suffered from huge constant memory consumption: the in-memory size of the structure was ~8.5Mb. This new approach reduces the size of the character reference data to ~250Kb and maintains performance that is close to trie structure. ## Radix tree Trie was replaced with [radix tree](https://en.wikipedia.org/wiki/Radix_tree). Unlike trie, which contains only *nodes*, radix tree contains *nodes* and *edges*. If subsequent nodes contains only one branch they can be combined into single edge. E.g. for words `test`, `tester` and `testing` we'll have trie (legend: `[a, ...]` - node, `` - edge, `*` - data): ``` [t] | [e] | [s] | [t] | [e, i, *] / | [r] [n] | | [*] [g] | [*] ``` with radix tree we can reduce it to: ``` | [e, i, *] / | | | [*] [*] ``` This approach has two advantages: - it significantly reduces number of nodes, and thus memory allocated for data strucure; - edge can be represented by simple array. ## Mapping radix tree to array We've significantly reduced the size of the tree. However, since we need to allocate an object for each node and array for each edge, it still consumes a lot of memory. Therefore, we map our tree to array, so we'll end up with just a single object. From my experiments it's appeared that it's possible to map this tree to `Uint16Array`, since we don't have indices and code points which are more than `MAX_UINT16` (which is `0xFFFF`). The only exception here is [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF) which appear in named character reference results, but they can be decoupled to two `uint16` code points. The advantage of typed arrays is what they obviously consume less memory and extremely fast to traverse. Since edges are already arrays, we write them to array as is. ### Mapping nodes #### Node header Node may contain `1` or `2` bytes of data and/or branch data. Branch data is represented by dictionary, keys of dictionary are code points and the values are references to the next node or edge. Since edges represented as raw arrays of code points, we need marker which will tell traversing algorithm that it sought a node. Character reference names contains only ASCII alpha characters and semicolon. So, any value less than `59` (semicolon code point) will work for us. The naive approach is to use, e.g. `0` as the node marker, followed by number of data bytes, followed by data, and then followed by number of branches: ``` | 0 | 2 | 8847, 824 | 5 | ... | \ \ \ \ \ \ \ \ \ branch data \ \ \ number of branches \ \ data code points \ number of data code points node marker ``` We can improve this structure by using binary flags to encode information about branch into marker. We have three flags: - `HAS_DATA_FLAG` - node has data; - `DATA_DUPLET_FLAG` - node has 2 bytes of data; - `HAS_BRANCHES_FLAG` - node has branches. With flags we can omit some parts of the node structure, thus reducing the in-memory size of the node. #### Branch data Branch data is represented by two arrays, following one after another. First array contains sorted transition code points, and the second one - corresponding next edge/node indices. Traversing algorithm reads number of branches in node header and then performs binary search for the matching code point. When matching code point found, next node/edge index can be obtained on offset which is equal to the number of branches in the node. Binary search can look like a step back compared to previous approach where branch lookup was implemented as dictionary lookup which is `O(1)`. However, since character reference names consists of ASCII alpha characters and semicolon we have `log(26*2+1) ≈ 6` iterations of search loop in worst case. Iteration other the typed array is extremely fast, so performance doesn't degrade here.