1.. _module-pw_tokenizer-token-databases: 2 3=============== 4Token databases 5=============== 6.. pigweed-module-subpage:: 7 :name: pw_tokenizer 8 9Token databases store a mapping of tokens to the strings they represent. An ELF 10file can be used as a token database, but it only contains the strings for its 11exact build. A token database file aggregates tokens from multiple ELF files, so 12that a single database can decode tokenized strings from any known ELF. 13 14Token databases contain the token, removal date (if any), and string for each 15tokenized string. 16 17---------------------- 18Token database formats 19---------------------- 20Three token database formats are supported: CSV, binary, and directory. Tokens 21may also be read from ELF files or ``.a`` archives, but cannot be written to 22these formats. 23 24CSV database format 25=================== 26The CSV database format has three columns: the token in hexadecimal, the removal 27date (if any) in year-month-day format, and the string literal, surrounded by 28quotes. Quote characters within the string are represented as two quote 29characters. 30 31This example database contains six strings, three of which have removal dates. 32 33.. code-block:: 34 35 141c35d5, ,"The answer: ""%s""" 36 2e668cd6,2019-12-25,"Jello, world!" 37 7b940e2a, ,"Hello %s! %hd %e" 38 851beeb6, ,"%u %d" 39 881436a0,2020-01-01,"The answer is: %s" 40 e13b0f94,2020-04-01,"%llu" 41 42Binary database format 43====================== 44The binary database format is comprised of a 16-byte header followed by a series 45of 8-byte entries. Each entry stores the token and the removal date, which is 460xFFFFFFFF if there is none. The string literals are stored next in the same 47order as the entries. Strings are stored with null terminators. See 48`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ 49for full details. 50 51The binary form of the CSV database is shown below. It contains the same 52information, but in a more compact and easily processed form. It takes 141 B 53compared with the CSV database's 211 B. 54 55.. code-block:: text 56 57 [header] 58 0x00: 454b4f54 0000534e TOKENS.. 59 0x08: 00000006 00000000 ........ 60 61 [entries] 62 0x10: 141c35d5 ffffffff .5...... 63 0x18: 2e668cd6 07e30c19 ..f..... 64 0x20: 7b940e2a ffffffff *..{.... 65 0x28: 851beeb6 ffffffff ........ 66 0x30: 881436a0 07e40101 .6...... 67 0x38: e13b0f94 07e40401 ..;..... 68 69 [string table] 70 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" 71 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H 72 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. 73 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer 74 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. 75 76.. _module-pw_tokenizer-directory-database-format: 77 78Directory database format 79========================= 80pw_tokenizer can consume directories of CSV databases. A directory database 81will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all 82of which will be used for subsequent detokenization lookups. 83 84An example directory database might look something like this: 85 86.. code-block:: text 87 88 directory_token_database 89 ├── database.pw_tokenizer.csv 90 ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv 91 └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv 92 93This format is optimized for storage in a Git repository alongside source code. 94The token database commands randomly generate unique file names for the CSVs in 95the database to prevent merge conflicts. Running ``mark_removed`` or ``purge`` 96commands in the database CLI consolidates the files to a single CSV. 97 98The database command line tool supports a ``--discard-temporary 99<upstream_commit>`` option for ``add``. In this mode, the tool attempts to 100discard temporary tokens. It identifies the latest CSV not present in the 101provided ``<upstream_commit>``, and tokens present that CSV that are not in the 102newly added tokens are discarded. This helps keep temporary tokens (e.g from 103debug logs) out of the database. 104 105JSON support 106============ 107While pw_tokenizer doesn't specify a JSON database format, a token database can 108be created from a JSON formatted array of strings. This is useful for side-band 109token database generation for strings that are not embedded as parsable tokens 110in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for 111instructions on generating a token database from a JSON file. 112 113.. _module-pw_tokenizer-managing-token-databases: 114 115------------------------ 116Managing token databases 117------------------------ 118Token databases are managed with the ``database.py`` script. This script can be 119used to extract tokens from compilation artifacts and manage database files. 120Invoke ``database.py`` with ``-h`` for full usage information. 121 122An example ELF file with tokenized logs is provided at 123``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that 124file to experiment with the ``database.py`` commands. 125 126.. _module-pw_tokenizer-database-creation: 127 128Create a database 129================= 130The ``create`` command makes a new token database from ELF files (.elf, .o, .so, 131etc.), archives (.a), existing token databases (CSV or binary), or a JSON file 132containing an array of strings. 133 134.. code-block:: sh 135 136 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... 137 138Two database output formats are supported: CSV and binary. Provide 139``--type binary`` to ``create`` to generate a binary database instead of the 140default CSV. CSV databases are great for checking into a source control or for 141human review. Binary databases are more compact and simpler to parse. The C++ 142detokenizer library only supports binary databases currently. 143 144.. _module-pw_tokenizer-update-token-database: 145 146Update a database 147================= 148As new tokenized strings are added, update the database with the ``add`` 149command. 150 151.. code-block:: sh 152 153 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... 154 155This command adds new tokens from ELF files or other databases to the database. 156Adding tokens already present in the database updates the date removed, if any, 157to the latest. 158 159A CSV token database can be checked into a source repository and updated as code 160changes are made. The build system can invoke ``database.py`` to update the 161database after each build. 162 163GN integration 164============== 165Token databases may be updated or created as part of a GN build. The 166``pw_tokenizer_database`` template provided by 167``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized 168strings database or creates a new database with artifacts from one or more GN 169targets or other database files. 170 171To create a new database, set the ``create`` variable to the desired database 172type (``"csv"`` or ``"binary"``). The database will be created in the output 173directory. To update an existing database, provide the path to the database with 174the ``database`` variable. 175 176.. code-block:: 177 178 import("//build_overrides/pigweed.gni") 179 180 import("$dir_pw_tokenizer/database.gni") 181 182 pw_tokenizer_database("my_database") { 183 database = "database_in_the_source_tree.csv" 184 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ] 185 input_databases = [ "other_database.csv" ] 186 } 187 188Instead of specifying GN targets, paths or globs to output files may be provided 189with the ``paths`` option. 190 191.. code-block:: 192 193 pw_tokenizer_database("my_database") { 194 database = "database_in_the_source_tree.csv" 195 deps = [ ":apps" ] 196 optional_paths = [ "$root_build_dir/**/*.elf" ] 197 } 198 199.. note:: 200 201 The ``paths`` and ``optional_targets`` arguments do not add anything to 202 ``deps``, so there is no guarantee that the referenced artifacts will exist 203 when the database is updated. Provide ``targets`` or ``deps`` or build other 204 GN targets first if this is a concern. 205 206CMake integration 207================= 208Token databases may be updated or created as part of a CMake build. The 209``pw_tokenizer_database`` template provided by 210``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized 211strings database or creates a new database with artifacts from a CMake target. 212 213To create a new database, set the ``CREATE`` variable to the desired database 214type (``"csv"`` or ``"binary"``). The database will be created in the output 215directory. 216 217.. code-block:: 218 219 include("$dir_pw_tokenizer/database.cmake") 220 221 pw_tokenizer_database("my_database") { 222 CREATE binary 223 TARGET my_target.ext 224 DEPS ${deps_list} 225 } 226 227To update an existing database, provide the path to the database with 228the ``database`` variable. 229 230.. code-block:: 231 232 pw_tokenizer_database("my_database") { 233 DATABASE database_in_the_source_tree.csv 234 TARGET my_target.ext 235 DEPS ${deps_list} 236 } 237 238.. _module-pw_tokenizer-collisions: 239 240---------------- 241Token collisions 242---------------- 243Tokens are calculated with a hash function. It is possible for different 244strings to hash to the same token. When this happens, multiple strings will have 245the same token in the database, and it may not be possible to unambiguously 246decode a token. 247 248The detokenization tools attempt to resolve collisions automatically. Collisions 249are resolved based on two things: 250 251- whether the tokenized data matches the strings arguments' (if any), and 252- if / when the string was marked as having been removed from the database. 253 254Resolving collisions 255==================== 256Collisions may occur occasionally. Run the command 257``python -m pw_tokenizer.database report <database>`` to see information about a 258token database, including any collisions. 259 260If there are collisions, take the following steps to resolve them. 261 262- Change one of the colliding strings slightly to give it a new token. 263- In C (not C++), artificial collisions may occur if strings longer than 264 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider 265 setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. See 266 ``pw_tokenizer/public/pw_tokenizer/config.h``. 267- Run the ``mark_removed`` command with the latest version of the build 268 artifacts to mark missing strings as removed. This deprioritizes them in 269 collision resolution. 270 271 .. code-block:: sh 272 273 python -m pw_tokenizer.database mark_removed --database <database> <ELF files> 274 275 The ``purge`` command may be used to delete these tokens from the database. 276 277Probability of collisions 278========================= 279Hashes of any size have a collision risk. The probability of one at least 280one collision occurring for a given number of strings is unintuitively high 281(this is known as the `birthday problem 282<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are 283used for tokens, the probability of collisions increases substantially. 284 285This table shows the approximate number of strings that can be hashed to have a 2861% or 50% probability of at least one collision (assuming a uniform, random 287hash). 288 289+-------+---------------------------------------+ 290| Token | Collision probability by string count | 291| bits +--------------------+------------------+ 292| | 50% | 1% | 293+=======+====================+==================+ 294| 32 | 77000 | 9300 | 295+-------+--------------------+------------------+ 296| 31 | 54000 | 6600 | 297+-------+--------------------+------------------+ 298| 24 | 4800 | 580 | 299+-------+--------------------+------------------+ 300| 16 | 300 | 36 | 301+-------+--------------------+------------------+ 302| 8 | 19 | 3 | 303+-------+--------------------+------------------+ 304 305Keep this table in mind when masking tokens (see 306:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when 307tokenizing a small set of strings, such as module names, but won't be suitable 308for large sets of strings, like log messages. 309