• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1.. _module-pw_tokenizer-token-databases:
2
3===============
4Token databases
5===============
6.. pigweed-module-subpage::
7   :name: pw_tokenizer
8
9Token databases store a mapping of tokens to the strings they represent. An ELF
10file can be used as a token database, but it only contains the strings for its
11exact build. A token database file aggregates tokens from multiple ELF files, so
12that a single database can decode tokenized strings from any known ELF.
13
14Token databases contain the token, removal date (if any), and string for each
15tokenized string.
16
17----------------------
18Token database formats
19----------------------
20Three token database formats are supported: CSV, binary, and directory. Tokens
21may also be read from ELF files or ``.a`` archives, but cannot be written to
22these formats.
23
24CSV database format
25===================
26The CSV database format has three columns: the token in hexadecimal, the removal
27date (if any) in year-month-day format, and the string literal, surrounded by
28quotes. Quote characters within the string are represented as two quote
29characters.
30
31This example database contains six strings, three of which have removal dates.
32
33.. code-block::
34
35   141c35d5,          ,"The answer: ""%s"""
36   2e668cd6,2019-12-25,"Jello, world!"
37   7b940e2a,          ,"Hello %s! %hd %e"
38   851beeb6,          ,"%u %d"
39   881436a0,2020-01-01,"The answer is: %s"
40   e13b0f94,2020-04-01,"%llu"
41
42Binary database format
43======================
44The binary database format is comprised of a 16-byte header followed by a series
45of 8-byte entries. Each entry stores the token and the removal date, which is
460xFFFFFFFF if there is none. The string literals are stored next in the same
47order as the entries. Strings are stored with null terminators. See
48`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
49for full details.
50
51The binary form of the CSV database is shown below. It contains the same
52information, but in a more compact and easily processed form. It takes 141 B
53compared with the CSV database's 211 B.
54
55.. code-block:: text
56
57   [header]
58   0x00: 454b4f54 0000534e  TOKENS..
59   0x08: 00000006 00000000  ........
60
61   [entries]
62   0x10: 141c35d5 ffffffff  .5......
63   0x18: 2e668cd6 07e30c19  ..f.....
64   0x20: 7b940e2a ffffffff  *..{....
65   0x28: 851beeb6 ffffffff  ........
66   0x30: 881436a0 07e40101  .6......
67   0x38: e13b0f94 07e40401  ..;.....
68
69   [string table]
70   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
71   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
72   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
73   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
74   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
75
76.. _module-pw_tokenizer-directory-database-format:
77
78Directory database format
79=========================
80pw_tokenizer can consume directories of CSV databases. A directory database
81will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
82of which will be used for subsequent detokenization lookups.
83
84An example directory database might look something like this:
85
86.. code-block:: text
87
88   directory_token_database
89   ├── database.pw_tokenizer.csv
90   ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
91   └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv
92
93This format is optimized for storage in a Git repository alongside source code.
94The token database commands randomly generate unique file names for the CSVs in
95the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
96commands in the database CLI consolidates the files to a single CSV.
97
98The database command line tool supports a ``--discard-temporary
99<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
100discard temporary tokens. It identifies the latest CSV not present in the
101provided ``<upstream_commit>``, and tokens present that CSV that are not in the
102newly added tokens are discarded. This helps keep temporary tokens (e.g from
103debug logs) out of the database.
104
105JSON support
106============
107While pw_tokenizer doesn't specify a JSON database format, a token database can
108be created from a JSON formatted array of strings. This is useful for side-band
109token database generation for strings that are not embedded as parsable tokens
110in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
111instructions on generating a token database from a JSON file.
112
113.. _module-pw_tokenizer-managing-token-databases:
114
115------------------------
116Managing token databases
117------------------------
118Token databases are managed with the ``database.py`` script. This script can be
119used to extract tokens from compilation artifacts and manage database files.
120Invoke ``database.py`` with ``-h`` for full usage information.
121
122An example ELF file with tokenized logs is provided at
123``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
124file to experiment with the ``database.py`` commands.
125
126.. _module-pw_tokenizer-database-creation:
127
128Create a database
129=================
130The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
131etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
132containing an array of strings.
133
134.. code-block:: sh
135
136   ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
137
138Two database output formats are supported: CSV and binary. Provide
139``--type binary`` to ``create`` to generate a binary database instead of the
140default CSV. CSV databases are great for checking into a source control or for
141human review. Binary databases are more compact and simpler to parse. The C++
142detokenizer library only supports binary databases currently.
143
144.. _module-pw_tokenizer-update-token-database:
145
146Update a database
147=================
148As new tokenized strings are added, update the database with the ``add``
149command.
150
151.. code-block:: sh
152
153   ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
154
155This command adds new tokens from ELF files or other databases to the database.
156Adding tokens already present in the database updates the date removed, if any,
157to the latest.
158
159A CSV token database can be checked into a source repository and updated as code
160changes are made. The build system can invoke ``database.py`` to update the
161database after each build.
162
163GN integration
164==============
165Token databases may be updated or created as part of a GN build. The
166``pw_tokenizer_database`` template provided by
167``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
168strings database or creates a new database with artifacts from one or more GN
169targets or other database files.
170
171To create a new database, set the ``create`` variable to the desired database
172type (``"csv"`` or ``"binary"``). The database will be created in the output
173directory. To update an existing database, provide the path to the database with
174the ``database`` variable.
175
176.. code-block::
177
178   import("//build_overrides/pigweed.gni")
179
180   import("$dir_pw_tokenizer/database.gni")
181
182   pw_tokenizer_database("my_database") {
183     database = "database_in_the_source_tree.csv"
184     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
185     input_databases = [ "other_database.csv" ]
186   }
187
188Instead of specifying GN targets, paths or globs to output files may be provided
189with the ``paths`` option.
190
191.. code-block::
192
193   pw_tokenizer_database("my_database") {
194     database = "database_in_the_source_tree.csv"
195     deps = [ ":apps" ]
196     optional_paths = [ "$root_build_dir/**/*.elf" ]
197   }
198
199.. note::
200
201   The ``paths`` and ``optional_targets`` arguments do not add anything to
202   ``deps``, so there is no guarantee that the referenced artifacts will exist
203   when the database is updated. Provide ``targets`` or ``deps`` or build other
204   GN targets first if this is a concern.
205
206CMake integration
207=================
208Token databases may be updated or created as part of a CMake build. The
209``pw_tokenizer_database`` template provided by
210``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
211strings database or creates a new database with artifacts from a CMake target.
212
213To create a new database, set the ``CREATE`` variable to the desired database
214type (``"csv"`` or ``"binary"``). The database will be created in the output
215directory.
216
217.. code-block::
218
219   include("$dir_pw_tokenizer/database.cmake")
220
221   pw_tokenizer_database("my_database") {
222     CREATE binary
223     TARGET my_target.ext
224     DEPS ${deps_list}
225   }
226
227To update an existing database, provide the path to the database with
228the ``database`` variable.
229
230.. code-block::
231
232   pw_tokenizer_database("my_database") {
233     DATABASE database_in_the_source_tree.csv
234     TARGET my_target.ext
235     DEPS ${deps_list}
236   }
237
238.. _module-pw_tokenizer-collisions:
239
240----------------
241Token collisions
242----------------
243Tokens are calculated with a hash function. It is possible for different
244strings to hash to the same token. When this happens, multiple strings will have
245the same token in the database, and it may not be possible to unambiguously
246decode a token.
247
248The detokenization tools attempt to resolve collisions automatically. Collisions
249are resolved based on two things:
250
251- whether the tokenized data matches the strings arguments' (if any), and
252- if / when the string was marked as having been removed from the database.
253
254Resolving collisions
255====================
256Collisions may occur occasionally. Run the command
257``python -m pw_tokenizer.database report <database>`` to see information about a
258token database, including any collisions.
259
260If there are collisions, take the following steps to resolve them.
261
262- Change one of the colliding strings slightly to give it a new token.
263- In C (not C++), artificial collisions may occur if strings longer than
264  ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
265  setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
266  ``pw_tokenizer/public/pw_tokenizer/config.h``.
267- Run the ``mark_removed`` command with the latest version of the build
268  artifacts to mark missing strings as removed. This deprioritizes them in
269  collision resolution.
270
271  .. code-block:: sh
272
273     python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
274
275  The ``purge`` command may be used to delete these tokens from the database.
276
277Probability of collisions
278=========================
279Hashes of any size have a collision risk. The probability of one at least
280one collision occurring for a given number of strings is unintuitively high
281(this is known as the `birthday problem
282<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
283used for tokens, the probability of collisions increases substantially.
284
285This table shows the approximate number of strings that can be hashed to have a
2861% or 50% probability of at least one collision (assuming a uniform, random
287hash).
288
289+-------+---------------------------------------+
290| Token | Collision probability by string count |
291| bits  +--------------------+------------------+
292|       |         50%        |          1%      |
293+=======+====================+==================+
294|   32  |       77000        |        9300      |
295+-------+--------------------+------------------+
296|   31  |       54000        |        6600      |
297+-------+--------------------+------------------+
298|   24  |        4800        |         580      |
299+-------+--------------------+------------------+
300|   16  |         300        |          36      |
301+-------+--------------------+------------------+
302|    8  |          19        |           3      |
303+-------+--------------------+------------------+
304
305Keep this table in mind when masking tokens (see
306:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
307tokenizing a small set of strings, such as module names, but won't be suitable
308for large sets of strings, like log messages.
309