Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
README.md | D | 12-May-2024 | 11 KiB | 316 | 228 | |
private-lib-misc-fts.h | D | 12-May-2024 | 404 | 24 | 14 | |
trie-fd.c | D | 12-May-2024 | 21.4 KiB | 1,005 | 643 | |
trie.c | D | 12-May-2024 | 31.6 KiB | 1,373 | 813 |
README.md
1# LWS Full Text Search 2 3## Introduction 4 5 6 7The general approach is to scan one or more UTF-8 input text "files" (they may 8only exist in memory) and create an in-memory optimized trie for every token in 9the file. 10 11This can then be serialized out to disk in the form of a single index file (no 12matter how many input files were involved or how large they were). 13 14The implementation is designed to be modest on memory and cpu for both index 15creation and querying, and suitable for weak machines with some kind of random 16access storage. For searching only memory to hold results is required, the 17actual searches and autocomplete suggestions are done very rapidly by seeking 18around structures in the on-disk index file. 19 20Function|Related Link 21---|--- 22Public API|[include/libwebsockets/lws-fts.h](https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-fts.h) 23CI test app|[minimal-examples/api-tests/api-test-fts](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/api-tests/api-test-fts) 24Demo minimal example|[minimal-examples/http-server/minimal-http-server-fulltext-search](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/http-server/minimal-http-server-fulltext-search) 25Live Demo|[https://libwebsockets.org/ftsdemo/](https://libwebsockets.org/ftsdemo/) 26 27## Query API overview 28 29Searching returns a potentially very large lwsac allocated object, with contents 30and max size controlled by the members of a struct lws_fts_search_params passed 31to the search function. Three kinds of result are possible: 32 33### Autocomplete suggestions 34 35These are useful to provide lists of extant results in 36realtime as the user types characters that constrain the search. So if the 37user has typed 'len', any hits for 'len' itself are reported along with 38'length', and whatever else is in the index beginning 'len'.. The results are 39selected using and are accompanied by an aggregated count of results down that 40path, and the results so the "most likely" results already measured by potential 41hits appear first. 42 43These results are in a linked-list headed by `result.autocomplete_head` and 44each is in a `struct lws_fts_result_autocomplete`. 45 46They're enabled in the search results by giving the flag 47 `LWSFTS_F_QUERY_AUTOCOMPLETE` in the search parameter flags. 48 49### Filepath results 50 51Simply a list of input files containing the search term with some statistics, 52one file is mentioned in a `struct lws_fts_result_filepath` result struct. 53 54This would be useful for creating a selection UI to "drill down" to individual 55files when there are many with matches. 56 57This is enabled by the `LWSFTS_F_QUERY_FILES` search flag. 58 59### Filepath and line results 60 61Same as the file path list, but for each filepath, information on the line 62numbers and input file offset where the line starts are provided. 63 64This is enabled by `LWSFTS_F_QUERY_FILE_LINES`... if you additionally give 65`LWSFTS_F_QUERY_QUOTE_LINE` flag then the contents of each hit line from the 66input file are also provided. 67 68## Result format inside the lwsac 69 70A `struct lws_fts_result` at the start of the lwsac contains heads for linked- 71lists of autocomplete and filepath results inside the lwsac. 72 73For autocomplete suggestions, the string itself is immediately after the 74`struct lws_fts_result_autocomplete` in memory. For filepath results, after 75each `struct lws_fts_result_filepath` is 76 77 - match information depending on the flags given to the search 78 - the filepath string 79 80You can always skip the line number table to get the filepath string by adding 81.matches_length to the address of the byte after the struct. 82 83The matches information is either 84 85 - 0 bytes per match 86 87 - 2x int32_t per match (8 bytes) if `LWSFTS_F_QUERY_FILE_LINES` given... the 88 first is the native-endian line number of the match, the second is the 89 byte offset in the original file where that line starts 90 91 - 2 x int32_t as above plus a const char * if `LWSFTS_F_QUERY_QUOTE_LINE` is 92 also given... this points to a NUL terminated string also stored in the 93 results lwsac that contains up to 255 chars of the line from the original 94 file. In some cases, the original file was either virtual (you are indexing 95 a git revision) or is not stored with the index, in that case you can't 96 usefully use `LWSFTS_F_QUERY_QUOTE_LINE`. 97 98To facilitate interpreting what is stored per match, the original search flags 99that created the result are stored in the `struct lws_fts_result`. 100 101## Indexing In-memory and serialized to file 102 103When creating the trie, in-memory structs are used with various optimization 104schemes trading off memory usage for speed. While in-memory, it's possible to 105add more indexed filepaths to the single index. Once the trie is complete in 106terms of having indexed everything, it is serialized to disk. 107 108These contain many additional housekeeping pointers and trie entries which can 109be optimized out. Most in-memory values must be held literally in large types, 110whereas most of the values in the serialized file use smaller VLI which use 111more or less bytes according to the value. So the peak memory requirements for 112large tries are much bigger than the size of the serialized trie file that is 113output. 114 115For the linux kernel at 4.14 and default indexing list on a 2.8GHz AMD 116threadripper (using one thread), the stats are: 117 118Name|Value 119---|--- 120Files indexed|52932 121Input corpus size|694MiB 122Indexing cpu time|50.1s (>1000 files / sec; 13.8MBytes/sec) 123Peak alloc|78MiB 124Serialization time|202ms 125Trie File size|347MiB 126 127To index libwebsockets main branch under the same conditions: 128 129Name|Value 130---|--- 131Files indexed|489 132Input corpus size|3MiB 133Indexing time|123ms 134Peak alloc|3MiB 135Serialization time|1ms 136Trie File size|1.4MiB 137 138 139Once it's generated, querying the trie file is very inexpensive, even when there 140are lots of results. 141 142 - trie entry child lists are kept sorted by the character they map to. This 143 allows discovering there is no match as soon as a character later in the 144 order than the one being matched is seen 145 146 - for the root trie, in addition to the linked-list child + sibling entries, 147 a 256-entry pointer table is associated with the root trie, allowing one- 148 step lookup. But as the table is 2KiB, it's too expensive to use on all 149 trie entries 150 151## Structure on disk 152 153All explicit multibyte numbers are stored in Network (MSB-first) byte order. 154 155 - file header 156 - filepath line number tables 157 - filepath information 158 - filepath map table 159 - tries, trie instances (hits), trie child tables 160 161### VLI coding 162 163VLI (Variable Length Integer) coding works like this 164 165[b7 EON] [b6 .. b0 DATA] 166 167If EON = 0, then DATA represents the Least-significant 7 bits of the number. 168if EON = 1, DATA represents More-significant 7-bits that should be shifted 169left until the byte with EON = 0 is found to terminate the number. 170 171The VLI used is predicated around 32-bit unsigned integers 172 173Examples: 174 175 - 0x30 = 48 176 - 0x81 30 = 176 177 - 0x81 0x80 0x00 = 16384 178 179Bytes | Range 180---|--- 1811|<= 127 1822|<= 16K - 1 1833|<= 2M -1 1844|<= 256M - 1 1855|<= 4G - 1 186 187The coding is very efficient if there's a high probabilty the number being 188stored is not large. So it's great for line numbers for example, where most 189files have less that 16K lines and the VLI for the line number fits in 2 bytes, 190but if you meet a huge file, the VLI coding can also handle it. 191 192All numbers except a few in the headers that are actually written after the 193following data are stored using VLI for space- efficiency without limiting 194capability. The numbers that are fixed up after the fact have to have a fixed 195size and can't use VLI. 196 197### File header 198 199The first byte of the file header where the magic is, is "fileoffset" 0. All 200the stored "fileoffset"s are relative to that. 201 202The header has a fixed size of 16 bytes. 203 204size|function 205---|--- 20632-bits|Magic 0xCA7A5F75 20732-bits|Fileoffset to root trie entry 20832-bits|Size of the trie file when it was created (to detect truncation) 20932-bits|Fileoffset to the filepath map 21032-bits|Number of filepaths 211 212### Filepath line tables 213 214Immediately after the file header are the line length tables. 215 216As the input files are parsed, line length tables are written for each file... 217at that time the rest of the parser data is held in memory so nothing else is 218in the file yet. These allow you to map logical line numbers in the file to 219file offsets space- and time- efficiently without having to walk through the 220file contents. 221 222The line information is cut into blocks, allowing quick skipping over the VLI 223data that doesn't contain the line you want just by following the 8-byte header 224part. 225 226Once you find the block with your line, you have to iteratively add the VLIs 227until you hit the one you want. 228 229For normal text files with average line length below 128, the VLIs will 230typically be a single byte. So a block of 200 line lengths is typically 231208 bytes long. 232 233There is a final linetable chunk consisting of all zeros to indicate the end 234of the filepath line chunk series for a filepath. 235 236size|function 237---|--- 23816-bit|length of this chunk itself in bytes 23916-bit|count of lines covered in this chunk 24032-bit|count of bytes in the input file this chunk covers 241VLI...|for each line in the chunk, the number of bytes in the line 242 243 244### Filepaths 245 246The single trie in the file may contain information from multiple files, for 247example one trie may cover all files in a directory. The "Filepaths" are 248listed after the line tables, and referred to by index thereafter. 249 250For each filepath, one after the other: 251 252size|function 253---|--- 254VLI|fileoffset of the start of this filepath's line table 255VLI|count of lines in the file 256VLI|length of filepath in bytes 257...|the filepath (with no NUL) 258 259### Filepath map 260 261To facilitate rapid filepath lookup, there's a filepath map table with a 32-bit 262fileoffset per filepath. This is the way to convert filepath indexes to 263information on the filepath like its name, etc 264 265size|function 266---|--- 26732-bit...|fileoffset to filepath table for each filepath 268 269### Trie entries 270 271Immediately after that, the trie entries are dumped, for each one a header: 272 273#### Trie entry header 274 275size|function 276---|--- 277VLI|Fileoffset of first file table in this trie entry instance list 278VLI|number of child trie entries this trie entry has 279VLI|number of instances this trie entry has 280 281The child list follows immediately after this header 282 283#### Trie entry instance file 284 285For each file that has instances of this symbol: 286 287size|function 288---|--- 289VLI|Fileoffset of next file table in this trie entry instance list 290VLI|filepath index 291VLI|count of line number instances following 292 293#### Trie entry file line number table 294 295Then for the file mentioned above, a list of all line numbers in the file with 296the symbol in them, in ascending order. As a VLI, the median size per entry 297will typically be ~15.9 bits due to the probability of line numbers below 16K. 298 299size|function 300---|--- 301VLI|line number 302... 303 304#### Trie entry child table 305 306For each child node 307 308size|function 309---|--- 310VLI|file offset of child 311VLI|instance count belonging directly to this child 312VLI|aggregated number of instances down all descendent paths of child 313VLI|aggregated number of children down all descendent paths of child 314VLI|match string length 315...|the match string 316