1.. _module-pw_tokenizer-token-databases: 2 3=============== 4Token databases 5=============== 6.. pigweed-module-subpage:: 7 :name: pw_tokenizer 8 9Token databases store a mapping of tokens to the strings they represent. An ELF 10file can be used as a token database, but it only contains the strings for its 11exact build. A token database file aggregates tokens from multiple ELF files, so 12that a single database can decode tokenized strings from any known ELF. 13 14Token databases contain the token, removal date (if any), and string for each 15tokenized string. 16 17---------------------- 18Token database formats 19---------------------- 20Three token database formats are supported: CSV, binary, and directory. Tokens 21may also be read from ELF files or ``.a`` archives, but cannot be written to 22these formats. 23 24CSV database format 25=================== 26The CSV database format has four columns: the token in hexadecimal, the removal 27date (if any) in year-month-day format, the token domain, and the string 28literal. The domain and string are quoted, and quote characters within the 29domain or string are represented as two quote characters. 30 31This example database contains six strings, three of which have removal dates. 32 33.. code-block:: 34 35 141c35d5, ,"","The answer: ""%s""" 36 2e668cd6,2019-12-25,"","Jello, world!" 37 7a22c974, ,"metrics","%f" 38 7b940e2a, ,"","Hello %s! %hd %e" 39 851beeb6, ,"","%u %d" 40 881436a0,2020-01-01,"","The answer is: %s" 41 e13b0f94,2020-04-01,"metrics","%llu" 42 43Legacy CSV databases did not include the domain, so only had three columns. 44These databases are still supported, but tokens are always in the default domain 45(``""``). 46 47Binary database format 48====================== 49The binary database format is comprised of a 16-byte header followed by a series 50of 8-byte entries. Each entry stores the token and the removal date, which is 510xFFFFFFFF if there is none. The string literals are stored next in the same 52order as the entries. Strings are stored with null terminators. See 53`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ 54for full details. 55 56The binary form of the CSV database is shown below. It contains the same 57information, but in a more compact and easily processed form. It takes 141 B 58compared with the CSV database's 211 B. 59 60.. code-block:: text 61 62 [header] 63 0x00: 454b4f54 0000534e TOKENS.. 64 0x08: 00000006 00000000 ........ 65 66 [entries] 67 0x10: 141c35d5 ffffffff .5...... 68 0x18: 2e668cd6 07e30c19 ..f..... 69 0x20: 7b940e2a ffffffff *..{.... 70 0x28: 851beeb6 ffffffff ........ 71 0x30: 881436a0 07e40101 .6...... 72 0x38: e13b0f94 07e40401 ..;..... 73 74 [string table] 75 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" 76 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H 77 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. 78 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer 79 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. 80 81.. _module-pw_tokenizer-directory-database-format: 82 83Directory database format 84========================= 85pw_tokenizer can consume directories of CSV databases. A directory database 86will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all 87of which will be used for subsequent detokenization lookups. 88 89An example directory database might look something like this: 90 91.. code-block:: text 92 93 directory_token_database 94 ├── database.pw_tokenizer.csv 95 ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv 96 └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv 97 98This format is optimized for storage in a Git repository alongside source code. 99The token database commands randomly generate unique file names for the CSVs in 100the database to prevent merge conflicts. Running ``mark_removed`` or ``purge`` 101commands in the database CLI consolidates the files to a single CSV. 102 103The database command line tool supports a ``--discard-temporary 104<upstream_commit>`` option for ``add``. In this mode, the tool attempts to 105discard temporary tokens. It identifies the latest CSV not present in the 106provided ``<upstream_commit>``, and tokens present that CSV that are not in the 107newly added tokens are discarded. This helps keep temporary tokens (e.g from 108debug logs) out of the database. 109 110ELF section database format 111=========================== 112During compilation, pw_tokenizer stores its entries in an ELF section. Entries 113are stored as a header followed by the domain and string. The format for these 114entries is described below. 115 116.. literalinclude:: public/pw_tokenizer/internal/tokenize_string.h 117 :start-after: [pw_tokenizer-elf-entry] 118 :end-before: [pw_tokenizer-elf-entry] 119 :language: C 120 121JSON support 122============ 123While pw_tokenizer doesn't specify a JSON database format, a token database can 124be created from a JSON formatted array of strings. This is useful for side-band 125token database generation for strings that are not embedded as parsable tokens 126in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for 127instructions on generating a token database from a JSON file. 128 129.. _module-pw_tokenizer-managing-token-databases: 130 131------------------------ 132Managing token databases 133------------------------ 134Token databases are managed with the ``database.py`` script. This script can be 135used to extract tokens from compilation artifacts and manage database files. 136Invoke ``database.py`` with ``-h`` for full usage information. 137 138An example ELF file with tokenized logs is provided at 139``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that 140file to experiment with the ``database.py`` commands. 141 142.. _module-pw_tokenizer-database-creation: 143 144Create a database 145================= 146The ``create`` command makes a new token database from ELF files (.elf, .o, .so, 147etc.), archives (.a), existing token databases (CSV or binary), or a JSON file 148containing an array of strings. 149 150.. code-block:: console 151 152 $ ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... 153 154Two database output formats are supported: CSV and binary. Provide 155``--type binary`` to ``create`` to generate a binary database instead of the 156default CSV. CSV databases are great for checking into a source control or for 157human review. Binary databases are more compact and simpler to parse. The C++ 158detokenizer library only supports binary databases currently. 159 160.. _module-pw_tokenizer-update-token-database: 161 162Update a database 163================= 164As new tokenized strings are added, update the database with the ``add`` 165command. 166 167.. code-block:: console 168 169 $ ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... 170 171This command adds new tokens from ELF files or other databases to the database. 172Adding tokens already present in the database updates the date removed, if any, 173to the latest. 174 175A CSV token database can be checked into a source repository and updated as code 176changes are made. The build system can invoke ``database.py`` to update the 177database after each build. 178 179GN integration 180============== 181Token databases may be updated or created as part of a GN build. The 182``pw_tokenizer_database`` template provided by 183``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized 184strings database or creates a new database with artifacts from one or more GN 185targets or other database files. 186 187To create a new database, set the ``create`` variable to the desired database 188type (``"csv"`` or ``"binary"``). The database will be created in the output 189directory. To update an existing database, provide the path to the database with 190the ``database`` variable. 191 192.. code-block:: 193 194 import("//build_overrides/pigweed.gni") 195 196 import("$dir_pw_tokenizer/database.gni") 197 198 pw_tokenizer_database("my_database") { 199 database = "database_in_the_source_tree.csv" 200 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ] 201 input_databases = [ "other_database.csv" ] 202 } 203 204Instead of specifying GN targets, paths or globs to output files may be provided 205with the ``paths`` option. 206 207.. code-block:: 208 209 pw_tokenizer_database("my_database") { 210 database = "database_in_the_source_tree.csv" 211 deps = [ ":apps" ] 212 optional_paths = [ "$root_build_dir/**/*.elf" ] 213 } 214 215.. note:: 216 217 The ``paths`` and ``optional_targets`` arguments do not add anything to 218 ``deps``, so there is no guarantee that the referenced artifacts will exist 219 when the database is updated. Provide ``targets`` or ``deps`` or build other 220 GN targets first if this is a concern. 221 222CMake integration 223================= 224Token databases may be updated or created as part of a CMake build. The 225``pw_tokenizer_database`` template provided by 226``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized 227strings database or creates a new database with artifacts from a CMake target. 228 229To create a new database, set the ``CREATE`` variable to the desired database 230type (``"csv"`` or ``"binary"``). The database will be created in the output 231directory. 232 233.. code-block:: 234 235 include("$dir_pw_tokenizer/database.cmake") 236 237 pw_tokenizer_database("my_database") { 238 CREATE binary 239 TARGET my_target.ext 240 DEPS ${deps_list} 241 } 242 243To update an existing database, provide the path to the database with 244the ``database`` variable. 245 246.. code-block:: 247 248 pw_tokenizer_database("my_database") { 249 DATABASE database_in_the_source_tree.csv 250 TARGET my_target.ext 251 DEPS ${deps_list} 252 } 253 254.. _module-pw_tokenizer-collisions: 255 256---------------- 257Token collisions 258---------------- 259Tokens are calculated with a hash function. It is possible for different 260strings to hash to the same token. When this happens, multiple strings will have 261the same token in the database, and it may not be possible to unambiguously 262decode a token. 263 264The detokenization tools attempt to resolve collisions automatically. Collisions 265are resolved based on two things: 266 267- whether the tokenized data matches the strings arguments' (if any), and 268- if / when the string was marked as having been removed from the database. 269 270Resolving collisions 271==================== 272Collisions may occur occasionally. Run the command 273``python -m pw_tokenizer.database report <database>`` to see information about a 274token database, including any collisions. 275 276If there are collisions, take the following steps to resolve them. 277 278- Change one of the colliding strings slightly to give it a new token. 279- In C (not C++), artificial collisions may occur if strings longer than 280 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider 281 setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. See 282 ``pw_tokenizer/public/pw_tokenizer/config.h``. 283- Run the ``mark_removed`` command with the latest version of the build 284 artifacts to mark missing strings as removed. This deprioritizes them in 285 collision resolution. 286 287 .. code-block:: console 288 289 $ python -m pw_tokenizer.database mark_removed --database <database> <ELF files> 290 291 The ``purge`` command may be used to delete these tokens from the database. 292 293Probability of collisions 294========================= 295Hashes of any size have a collision risk. The probability of one at least 296one collision occurring for a given number of strings is unintuitively high 297(this is known as the `birthday problem 298<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are 299used for tokens, the probability of collisions increases substantially. 300 301This table shows the approximate number of strings that can be hashed to have a 3021% or 50% probability of at least one collision (assuming a uniform, random 303hash). 304 305+-------+---------------------------------------+ 306| Token | Collision probability by string count | 307| bits +--------------------+------------------+ 308| | 50% | 1% | 309+=======+====================+==================+ 310| 32 | 77000 | 9300 | 311+-------+--------------------+------------------+ 312| 31 | 54000 | 6600 | 313+-------+--------------------+------------------+ 314| 24 | 4800 | 580 | 315+-------+--------------------+------------------+ 316| 16 | 300 | 36 | 317+-------+--------------------+------------------+ 318| 8 | 19 | 3 | 319+-------+--------------------+------------------+ 320 321Keep this table in mind when masking tokens (see 322:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when 323tokenizing a small set of strings, such as module names, but won't be suitable 324for large sets of strings, like log messages. 325