• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1.. _module-pw_tokenizer-token-databases:
2
3===============
4Token databases
5===============
6.. pigweed-module-subpage::
7   :name: pw_tokenizer
8
9Token databases store a mapping of tokens to the strings they represent. An ELF
10file can be used as a token database, but it only contains the strings for its
11exact build. A token database file aggregates tokens from multiple ELF files, so
12that a single database can decode tokenized strings from any known ELF.
13
14Token databases contain the token, removal date (if any), and string for each
15tokenized string.
16
17----------------------
18Token database formats
19----------------------
20Three token database formats are supported: CSV, binary, and directory. Tokens
21may also be read from ELF files or ``.a`` archives, but cannot be written to
22these formats.
23
24CSV database format
25===================
26The CSV database format has four columns: the token in hexadecimal, the removal
27date (if any) in year-month-day format, the token domain, and the string
28literal. The domain and string are quoted, and quote characters within the
29domain or string are represented as two quote characters.
30
31This example database contains six strings, three of which have removal dates.
32
33.. code-block::
34
35   141c35d5,          ,"","The answer: ""%s"""
36   2e668cd6,2019-12-25,"","Jello, world!"
37   7a22c974,          ,"metrics","%f"
38   7b940e2a,          ,"","Hello %s! %hd %e"
39   851beeb6,          ,"","%u %d"
40   881436a0,2020-01-01,"","The answer is: %s"
41   e13b0f94,2020-04-01,"metrics","%llu"
42
43Legacy CSV databases did not include the domain, so only had three columns.
44These databases are still supported, but tokens are always in the default domain
45(``""``).
46
47Binary database format
48======================
49The binary database format is comprised of a 16-byte header followed by a series
50of 8-byte entries. Each entry stores the token and the removal date, which is
510xFFFFFFFF if there is none. The string literals are stored next in the same
52order as the entries. Strings are stored with null terminators. See
53`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
54for full details.
55
56The binary form of the CSV database is shown below. It contains the same
57information, but in a more compact and easily processed form. It takes 141 B
58compared with the CSV database's 211 B.
59
60.. code-block:: text
61
62   [header]
63   0x00: 454b4f54 0000534e  TOKENS..
64   0x08: 00000006 00000000  ........
65
66   [entries]
67   0x10: 141c35d5 ffffffff  .5......
68   0x18: 2e668cd6 07e30c19  ..f.....
69   0x20: 7b940e2a ffffffff  *..{....
70   0x28: 851beeb6 ffffffff  ........
71   0x30: 881436a0 07e40101  .6......
72   0x38: e13b0f94 07e40401  ..;.....
73
74   [string table]
75   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
76   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
77   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
78   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
79   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
80
81.. _module-pw_tokenizer-directory-database-format:
82
83Directory database format
84=========================
85pw_tokenizer can consume directories of CSV databases. A directory database
86will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
87of which will be used for subsequent detokenization lookups.
88
89An example directory database might look something like this:
90
91.. code-block:: text
92
93   directory_token_database
94   ├── database.pw_tokenizer.csv
95   ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
96   └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv
97
98This format is optimized for storage in a Git repository alongside source code.
99The token database commands randomly generate unique file names for the CSVs in
100the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
101commands in the database CLI consolidates the files to a single CSV.
102
103The database command line tool supports a ``--discard-temporary
104<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
105discard temporary tokens. It identifies the latest CSV not present in the
106provided ``<upstream_commit>``, and tokens present that CSV that are not in the
107newly added tokens are discarded. This helps keep temporary tokens (e.g from
108debug logs) out of the database.
109
110ELF section database format
111===========================
112During compilation, pw_tokenizer stores its entries in an ELF section. Entries
113are stored as a header followed by the domain and string. The format for these
114entries is described below.
115
116.. literalinclude:: public/pw_tokenizer/internal/tokenize_string.h
117   :start-after: [pw_tokenizer-elf-entry]
118   :end-before: [pw_tokenizer-elf-entry]
119   :language: C
120
121JSON support
122============
123While pw_tokenizer doesn't specify a JSON database format, a token database can
124be created from a JSON formatted array of strings. This is useful for side-band
125token database generation for strings that are not embedded as parsable tokens
126in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
127instructions on generating a token database from a JSON file.
128
129.. _module-pw_tokenizer-managing-token-databases:
130
131------------------------
132Managing token databases
133------------------------
134Token databases are managed with the ``database.py`` script. This script can be
135used to extract tokens from compilation artifacts and manage database files.
136Invoke ``database.py`` with ``-h`` for full usage information.
137
138An example ELF file with tokenized logs is provided at
139``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
140file to experiment with the ``database.py`` commands.
141
142.. _module-pw_tokenizer-database-creation:
143
144Create a database
145=================
146The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
147etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
148containing an array of strings.
149
150.. code-block:: console
151
152   $ ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
153
154Two database output formats are supported: CSV and binary. Provide
155``--type binary`` to ``create`` to generate a binary database instead of the
156default CSV. CSV databases are great for checking into a source control or for
157human review. Binary databases are more compact and simpler to parse. The C++
158detokenizer library only supports binary databases currently.
159
160.. _module-pw_tokenizer-update-token-database:
161
162Update a database
163=================
164As new tokenized strings are added, update the database with the ``add``
165command.
166
167.. code-block:: console
168
169   $ ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
170
171This command adds new tokens from ELF files or other databases to the database.
172Adding tokens already present in the database updates the date removed, if any,
173to the latest.
174
175A CSV token database can be checked into a source repository and updated as code
176changes are made. The build system can invoke ``database.py`` to update the
177database after each build.
178
179GN integration
180==============
181Token databases may be updated or created as part of a GN build. The
182``pw_tokenizer_database`` template provided by
183``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
184strings database or creates a new database with artifacts from one or more GN
185targets or other database files.
186
187To create a new database, set the ``create`` variable to the desired database
188type (``"csv"`` or ``"binary"``). The database will be created in the output
189directory. To update an existing database, provide the path to the database with
190the ``database`` variable.
191
192.. code-block::
193
194   import("//build_overrides/pigweed.gni")
195
196   import("$dir_pw_tokenizer/database.gni")
197
198   pw_tokenizer_database("my_database") {
199     database = "database_in_the_source_tree.csv"
200     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
201     input_databases = [ "other_database.csv" ]
202   }
203
204Instead of specifying GN targets, paths or globs to output files may be provided
205with the ``paths`` option.
206
207.. code-block::
208
209   pw_tokenizer_database("my_database") {
210     database = "database_in_the_source_tree.csv"
211     deps = [ ":apps" ]
212     optional_paths = [ "$root_build_dir/**/*.elf" ]
213   }
214
215.. note::
216
217   The ``paths`` and ``optional_targets`` arguments do not add anything to
218   ``deps``, so there is no guarantee that the referenced artifacts will exist
219   when the database is updated. Provide ``targets`` or ``deps`` or build other
220   GN targets first if this is a concern.
221
222CMake integration
223=================
224Token databases may be updated or created as part of a CMake build. The
225``pw_tokenizer_database`` template provided by
226``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
227strings database or creates a new database with artifacts from a CMake target.
228
229To create a new database, set the ``CREATE`` variable to the desired database
230type (``"csv"`` or ``"binary"``). The database will be created in the output
231directory.
232
233.. code-block::
234
235   include("$dir_pw_tokenizer/database.cmake")
236
237   pw_tokenizer_database("my_database") {
238     CREATE binary
239     TARGET my_target.ext
240     DEPS ${deps_list}
241   }
242
243To update an existing database, provide the path to the database with
244the ``database`` variable.
245
246.. code-block::
247
248   pw_tokenizer_database("my_database") {
249     DATABASE database_in_the_source_tree.csv
250     TARGET my_target.ext
251     DEPS ${deps_list}
252   }
253
254.. _module-pw_tokenizer-collisions:
255
256----------------
257Token collisions
258----------------
259Tokens are calculated with a hash function. It is possible for different
260strings to hash to the same token. When this happens, multiple strings will have
261the same token in the database, and it may not be possible to unambiguously
262decode a token.
263
264The detokenization tools attempt to resolve collisions automatically. Collisions
265are resolved based on two things:
266
267- whether the tokenized data matches the strings arguments' (if any), and
268- if / when the string was marked as having been removed from the database.
269
270Resolving collisions
271====================
272Collisions may occur occasionally. Run the command
273``python -m pw_tokenizer.database report <database>`` to see information about a
274token database, including any collisions.
275
276If there are collisions, take the following steps to resolve them.
277
278- Change one of the colliding strings slightly to give it a new token.
279- In C (not C++), artificial collisions may occur if strings longer than
280  ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
281  setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
282  ``pw_tokenizer/public/pw_tokenizer/config.h``.
283- Run the ``mark_removed`` command with the latest version of the build
284  artifacts to mark missing strings as removed. This deprioritizes them in
285  collision resolution.
286
287  .. code-block:: console
288
289     $ python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
290
291  The ``purge`` command may be used to delete these tokens from the database.
292
293Probability of collisions
294=========================
295Hashes of any size have a collision risk. The probability of one at least
296one collision occurring for a given number of strings is unintuitively high
297(this is known as the `birthday problem
298<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
299used for tokens, the probability of collisions increases substantially.
300
301This table shows the approximate number of strings that can be hashed to have a
3021% or 50% probability of at least one collision (assuming a uniform, random
303hash).
304
305+-------+---------------------------------------+
306| Token | Collision probability by string count |
307| bits  +--------------------+------------------+
308|       |         50%        |          1%      |
309+=======+====================+==================+
310|   32  |       77000        |        9300      |
311+-------+--------------------+------------------+
312|   31  |       54000        |        6600      |
313+-------+--------------------+------------------+
314|   24  |        4800        |         580      |
315+-------+--------------------+------------------+
316|   16  |         300        |          36      |
317+-------+--------------------+------------------+
318|    8  |          19        |           3      |
319+-------+--------------------+------------------+
320
321Keep this table in mind when masking tokens (see
322:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
323tokenizing a small set of strings, such as module names, but won't be suitable
324for large sets of strings, like log messages.
325