README.md
1# LWS Full Text Search
2
3## Introduction
4
5![lwsac flow](/doc-assets/lws-fts.svg)
6
7The general approach is to scan one or more UTF-8 input text "files" (they may
8only exist in memory) and create an in-memory optimized trie for every token in
9the file.
10
11This can then be serialized out to disk in the form of a single index file (no
12matter how many input files were involved or how large they were).
13
14The implementation is designed to be modest on memory and cpu for both index
15creation and querying, and suitable for weak machines with some kind of random
16access storage. For searching only memory to hold results is required, the
17actual searches and autocomplete suggestions are done very rapidly by seeking
18around structures in the on-disk index file.
19
20Function|Related Link
21---|---
22Public API|[include/libwebsockets/lws-fts.h](https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-fts.h)
23CI test app|[minimal-examples/api-tests/api-test-fts](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/api-tests/api-test-fts)
24Demo minimal example|[minimal-examples/http-server/minimal-http-server-fulltext-search](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/http-server/minimal-http-server-fulltext-search)
25Live Demo|[https://libwebsockets.org/ftsdemo/](https://libwebsockets.org/ftsdemo/)
26
27## Query API overview
28
29Searching returns a potentially very large lwsac allocated object, with contents
30and max size controlled by the members of a struct lws_fts_search_params passed
31to the search function. Three kinds of result are possible:
32
33### Autocomplete suggestions
34
35These are useful to provide lists of extant results in
36realtime as the user types characters that constrain the search. So if the
37user has typed 'len', any hits for 'len' itself are reported along with
38'length', and whatever else is in the index beginning 'len'.. The results are
39selected using and are accompanied by an aggregated count of results down that
40path, and the results so the "most likely" results already measured by potential
41hits appear first.
42
43These results are in a linked-list headed by `result.autocomplete_head` and
44each is in a `struct lws_fts_result_autocomplete`.
45
46They're enabled in the search results by giving the flag
47 `LWSFTS_F_QUERY_AUTOCOMPLETE` in the search parameter flags.
48
49### Filepath results
50
51Simply a list of input files containing the search term with some statistics,
52one file is mentioned in a `struct lws_fts_result_filepath` result struct.
53
54This would be useful for creating a selection UI to "drill down" to individual
55files when there are many with matches.
56
57This is enabled by the `LWSFTS_F_QUERY_FILES` search flag.
58
59### Filepath and line results
60
61Same as the file path list, but for each filepath, information on the line
62numbers and input file offset where the line starts are provided.
63
64This is enabled by `LWSFTS_F_QUERY_FILE_LINES`... if you additionally give
65`LWSFTS_F_QUERY_QUOTE_LINE` flag then the contents of each hit line from the
66input file are also provided.
67
68## Result format inside the lwsac
69
70A `struct lws_fts_result` at the start of the lwsac contains heads for linked-
71lists of autocomplete and filepath results inside the lwsac.
72
73For autocomplete suggestions, the string itself is immediately after the
74`struct lws_fts_result_autocomplete` in memory. For filepath results, after
75each `struct lws_fts_result_filepath` is
76
77 - match information depending on the flags given to the search
78 - the filepath string
79
80You can always skip the line number table to get the filepath string by adding
81.matches_length to the address of the byte after the struct.
82
83The matches information is either
84
85 - 0 bytes per match
86
87 - 2x int32_t per match (8 bytes) if `LWSFTS_F_QUERY_FILE_LINES` given... the
88 first is the native-endian line number of the match, the second is the
89 byte offset in the original file where that line starts
90
91 - 2 x int32_t as above plus a const char * if `LWSFTS_F_QUERY_QUOTE_LINE` is
92 also given... this points to a NUL terminated string also stored in the
93 results lwsac that contains up to 255 chars of the line from the original
94 file. In some cases, the original file was either virtual (you are indexing
95 a git revision) or is not stored with the index, in that case you can't
96 usefully use `LWSFTS_F_QUERY_QUOTE_LINE`.
97
98To facilitate interpreting what is stored per match, the original search flags
99that created the result are stored in the `struct lws_fts_result`.
100
101## Indexing In-memory and serialized to file
102
103When creating the trie, in-memory structs are used with various optimization
104schemes trading off memory usage for speed. While in-memory, it's possible to
105add more indexed filepaths to the single index. Once the trie is complete in
106terms of having indexed everything, it is serialized to disk.
107
108These contain many additional housekeeping pointers and trie entries which can
109be optimized out. Most in-memory values must be held literally in large types,
110whereas most of the values in the serialized file use smaller VLI which use
111more or less bytes according to the value. So the peak memory requirements for
112large tries are much bigger than the size of the serialized trie file that is
113output.
114
115For the linux kernel at 4.14 and default indexing list on a 2.8GHz AMD
116threadripper (using one thread), the stats are:
117
118Name|Value
119---|---
120Files indexed|52932
121Input corpus size|694MiB
122Indexing cpu time|50.1s (>1000 files / sec; 13.8MBytes/sec)
123Peak alloc|78MiB
124Serialization time|202ms
125Trie File size|347MiB
126
127To index libwebsockets main branch under the same conditions:
128
129Name|Value
130---|---
131Files indexed|489
132Input corpus size|3MiB
133Indexing time|123ms
134Peak alloc|3MiB
135Serialization time|1ms
136Trie File size|1.4MiB
137
138
139Once it's generated, querying the trie file is very inexpensive, even when there
140are lots of results.
141
142 - trie entry child lists are kept sorted by the character they map to. This
143 allows discovering there is no match as soon as a character later in the
144 order than the one being matched is seen
145
146 - for the root trie, in addition to the linked-list child + sibling entries,
147 a 256-entry pointer table is associated with the root trie, allowing one-
148 step lookup. But as the table is 2KiB, it's too expensive to use on all
149 trie entries
150
151## Structure on disk
152
153All explicit multibyte numbers are stored in Network (MSB-first) byte order.
154
155 - file header
156 - filepath line number tables
157 - filepath information
158 - filepath map table
159 - tries, trie instances (hits), trie child tables
160
161### VLI coding
162
163VLI (Variable Length Integer) coding works like this
164
165[b7 EON] [b6 .. b0 DATA]
166
167If EON = 0, then DATA represents the Least-significant 7 bits of the number.
168if EON = 1, DATA represents More-significant 7-bits that should be shifted
169left until the byte with EON = 0 is found to terminate the number.
170
171The VLI used is predicated around 32-bit unsigned integers
172
173Examples:
174
175 - 0x30 = 48
176 - 0x81 30 = 176
177 - 0x81 0x80 0x00 = 16384
178
179Bytes | Range
180---|---
1811|<= 127
1822|<= 16K - 1
1833|<= 2M -1
1844|<= 256M - 1
1855|<= 4G - 1
186
187The coding is very efficient if there's a high probabilty the number being
188stored is not large. So it's great for line numbers for example, where most
189files have less that 16K lines and the VLI for the line number fits in 2 bytes,
190but if you meet a huge file, the VLI coding can also handle it.
191
192All numbers except a few in the headers that are actually written after the
193following data are stored using VLI for space- efficiency without limiting
194capability. The numbers that are fixed up after the fact have to have a fixed
195size and can't use VLI.
196
197### File header
198
199The first byte of the file header where the magic is, is "fileoffset" 0. All
200the stored "fileoffset"s are relative to that.
201
202The header has a fixed size of 16 bytes.
203
204size|function
205---|---
20632-bits|Magic 0xCA7A5F75
20732-bits|Fileoffset to root trie entry
20832-bits|Size of the trie file when it was created (to detect truncation)
20932-bits|Fileoffset to the filepath map
21032-bits|Number of filepaths
211
212### Filepath line tables
213
214Immediately after the file header are the line length tables.
215
216As the input files are parsed, line length tables are written for each file...
217at that time the rest of the parser data is held in memory so nothing else is
218in the file yet. These allow you to map logical line numbers in the file to
219file offsets space- and time- efficiently without having to walk through the
220file contents.
221
222The line information is cut into blocks, allowing quick skipping over the VLI
223data that doesn't contain the line you want just by following the 8-byte header
224part.
225
226Once you find the block with your line, you have to iteratively add the VLIs
227until you hit the one you want.
228
229For normal text files with average line length below 128, the VLIs will
230typically be a single byte. So a block of 200 line lengths is typically
231208 bytes long.
232
233There is a final linetable chunk consisting of all zeros to indicate the end
234of the filepath line chunk series for a filepath.
235
236size|function
237---|---
23816-bit|length of this chunk itself in bytes
23916-bit|count of lines covered in this chunk
24032-bit|count of bytes in the input file this chunk covers
241VLI...|for each line in the chunk, the number of bytes in the line
242
243
244### Filepaths
245
246The single trie in the file may contain information from multiple files, for
247example one trie may cover all files in a directory. The "Filepaths" are
248listed after the line tables, and referred to by index thereafter.
249
250For each filepath, one after the other:
251
252size|function
253---|---
254VLI|fileoffset of the start of this filepath's line table
255VLI|count of lines in the file
256VLI|length of filepath in bytes
257...|the filepath (with no NUL)
258
259### Filepath map
260
261To facilitate rapid filepath lookup, there's a filepath map table with a 32-bit
262fileoffset per filepath. This is the way to convert filepath indexes to
263information on the filepath like its name, etc
264
265size|function
266---|---
26732-bit...|fileoffset to filepath table for each filepath
268
269### Trie entries
270
271Immediately after that, the trie entries are dumped, for each one a header:
272
273#### Trie entry header
274
275size|function
276---|---
277VLI|Fileoffset of first file table in this trie entry instance list
278VLI|number of child trie entries this trie entry has
279VLI|number of instances this trie entry has
280
281The child list follows immediately after this header
282
283#### Trie entry instance file
284
285For each file that has instances of this symbol:
286
287size|function
288---|---
289VLI|Fileoffset of next file table in this trie entry instance list
290VLI|filepath index
291VLI|count of line number instances following
292
293#### Trie entry file line number table
294
295Then for the file mentioned above, a list of all line numbers in the file with
296the symbol in them, in ascending order. As a VLI, the median size per entry
297will typically be ~15.9 bits due to the probability of line numbers below 16K.
298
299size|function
300---|---
301VLI|line number
302...
303
304#### Trie entry child table
305
306For each child node
307
308size|function
309---|---
310VLI|file offset of child
311VLI|instance count belonging directly to this child
312VLI|aggregated number of instances down all descendent paths of child
313VLI|aggregated number of children down all descendent paths of child
314VLI|match string length
315...|the match string
316