• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1.. _module-pw_tokenizer:
2
3============
4pw_tokenizer
5============
6:bdg-primary:`host`
7:bdg-primary:`device`
8:bdg-secondary:`Python`
9:bdg-secondary:`C++`
10:bdg-secondary:`TypeScript`
11:bdg-success:`stable`
12
13Logging is critical, but developers are often forced to choose between
14additional logging or saving crucial flash space. The ``pw_tokenizer`` module
15helps address this by replacing printf-style strings with binary tokens during
16compilation. This enables extensive logging with substantially less memory
17usage.
18
19.. note::
20  This usage of the term "tokenizer" is not related to parsing! The
21  module is called tokenizer because it replaces a whole string literal with an
22  integer token. It does not parse strings into separate tokens.
23
24The most common application of ``pw_tokenizer`` is binary logging, and it is
25designed to integrate easily into existing logging systems. However, the
26tokenizer is general purpose and can be used to tokenize any strings, with or
27without printf-style arguments.
28
29**Why tokenize strings?**
30
31* Dramatically reduce binary size by removing string literals from binaries.
32* Reduce I/O traffic, RAM, and flash usage by sending and storing compact tokens
33  instead of strings. We've seen over 50% reduction in encoded log contents.
34* Reduce CPU usage by replacing snprintf calls with simple tokenization code.
35* Remove potentially sensitive log, assert, and other strings from binaries.
36
37--------------
38Basic overview
39--------------
40There are two sides to ``pw_tokenizer``, which we call tokenization and
41detokenization.
42
43* **Tokenization** converts string literals in the source code to binary tokens
44  at compile time. If the string has printf-style arguments, these are encoded
45  to compact binary form at runtime.
46* **Detokenization** converts tokenized strings back to the original
47  human-readable strings.
48
49Here's an overview of what happens when ``pw_tokenizer`` is used:
50
511. During compilation, the ``pw_tokenizer`` module hashes string literals to
52   generate stable 32-bit tokens.
532. The tokenization macro removes these strings by declaring them in an ELF
54   section that is excluded from the final binary.
553. After compilation, strings are extracted from the ELF to build a database of
56   tokenized strings for use by the detokenizer. The ELF file may also be used
57   directly.
584. During operation, the device encodes the string token and its arguments, if
59   any.
605. The encoded tokenized strings are sent off-device or stored.
616. Off-device, the detokenizer tools use the token database to decode the
62   strings to human-readable form.
63
64Example: tokenized logging
65==========================
66This example demonstrates using ``pw_tokenizer`` for logging. In this example,
67tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
68size (49 → 15 bytes).
69
70**Before**: plain text logging
71
72+------------------+-------------------------------------------+---------------+
73| Location         | Logging Content                           | Size in bytes |
74+==================+===========================================+===============+
75| Source contains  | ``LOG("Battery state: %s; battery         |               |
76|                  | voltage: %d mV", state, voltage);``       |               |
77+------------------+-------------------------------------------+---------------+
78| Binary contains  | ``"Battery state: %s; battery             | 41            |
79|                  | voltage: %d mV"``                         |               |
80+------------------+-------------------------------------------+---------------+
81|                  | (log statement is called with             |               |
82|                  | ``"CHARGING"`` and ``3989`` as arguments) |               |
83+------------------+-------------------------------------------+---------------+
84| Device transmits | ``"Battery state: CHARGING; battery       | 49            |
85|                  | voltage: 3989 mV"``                       |               |
86+------------------+-------------------------------------------+---------------+
87| When viewed      | ``"Battery state: CHARGING; battery       |               |
88|                  | voltage: 3989 mV"``                       |               |
89+------------------+-------------------------------------------+---------------+
90
91**After**: tokenized logging
92
93+------------------+-----------------------------------------------------------+---------+
94| Location         | Logging Content                                           | Size in |
95|                  |                                                           | bytes   |
96+==================+===========================================================+=========+
97| Source contains  | ``LOG("Battery state: %s; battery                         |         |
98|                  | voltage: %d mV", state, voltage);``                       |         |
99+------------------+-----------------------------------------------------------+---------+
100| Binary contains  | ``d9 28 47 8e`` (0x8e4728d9)                              | 4       |
101+------------------+-----------------------------------------------------------+---------+
102|                  | (log statement is called with                             |         |
103|                  | ``"CHARGING"`` and ``3989`` as arguments)                 |         |
104+------------------+-----------------------------------------------------------+---------+
105| Device transmits | =============== ============================== ========== | 15      |
106|                  | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e``  |         |
107|                  | --------------- ------------------------------ ---------- |         |
108|                  | Token           ``"CHARGING"`` argument        ``3989``,  |         |
109|                  |                                                as         |         |
110|                  |                                                varint     |         |
111|                  | =============== ============================== ========== |         |
112+------------------+-----------------------------------------------------------+---------+
113| When viewed      | ``"Battery state: CHARGING; battery voltage: 3989 mV"``   |         |
114+------------------+-----------------------------------------------------------+---------+
115
116---------------
117Getting started
118---------------
119Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
120section describes one way ``pw_tokenizer`` might be integrated with a project.
121These steps can be adapted as needed.
122
1231. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel are
124   provided. For Make or other build systems, add the files specified in the
125   BUILD.gn's ``pw_tokenizer`` target to the build.
1262. Use the tokenization macros in your code. See `Tokenization`_.
1273. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
128   linker script. In GN and CMake, this step is done automatically.
1294. Compile your code to produce an ELF file.
1305. Run ``database.py create`` on the ELF file to generate a CSV token
131   database. See `Managing token databases`_.
1326. Commit the token database to your repository. See notes in `Database
133   management`_.
1347. Integrate a ``database.py add`` command to your build to automatically update
135   the committed token database. In GN, use the ``pw_tokenizer_database``
136   template to do this. See `Update a database`_.
1378. Integrate ``detokenize.py`` or the C++ detokenization library with your tools
138   to decode tokenized logs. See `Detokenization`_.
139
140Using with Zephyr
141=================
142When building ``pw_tokenizer`` with Zephyr, 3 Kconfigs can be used currently:
143
144* ``CONFIG_PIGWEED_TOKENIZER`` will automatically link ``pw_tokenizer`` as well
145  as any dependencies.
146* ``CONFIG_PIGWEED_TOKENIZER_BASE64`` will automatically link
147  ``pw_tokenizer.base64`` as well as any dependencies.
148* ``CONFIG_PIGWEED_DETOKENIZER`` will automatically link
149  ``pw_tokenizer.decoder`` as well as any dependencies.
150
151Once enabled, the tokenizer headers can be included like any Zephyr headers:
152
153.. code-block:: cpp
154
155   #include <pw_tokenizer/tokenize.h>
156
157.. note::
158  Zephyr handles the additional linker sections via
159  ``pw_tokenizer_linker_rules.ld`` which is added to the end of the linker file
160  via a call to ``zephyr_linker_sources(SECTIONS ...)``.
161
162------------
163Tokenization
164------------
165Tokenization converts a string literal to a token. If it's a printf-style
166string, its arguments are encoded along with it. The results of tokenization can
167be sent off device or stored in place of a full string.
168
169.. doxygentypedef:: pw_tokenizer_Token
170
171Tokenization macros
172===================
173Adding tokenization to a project is simple. To tokenize a string, include
174``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
175
176Tokenize a string literal
177-------------------------
178``pw_tokenizer`` provides macros for tokenizing string literals with no
179arguments.
180
181.. doxygendefine:: PW_TOKENIZE_STRING
182.. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN
183.. doxygendefine:: PW_TOKENIZE_STRING_MASK
184
185The tokenization macros above cannot be used inside other expressions.
186
187.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable.
188  :class: checkmark
189
190  .. code:: cpp
191
192    constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!");
193
194    void Function() {
195      constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?");
196    }
197
198.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression.
199  :class: error
200
201  .. code:: cpp
202
203   void BadExample() {
204     ProcessToken(PW_TOKENIZE_STRING("This won't compile!"));
205   }
206
207  Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead.
208
209An alternate set of macros are provided for use inside expressions. These make
210use of lambda functions, so while they can be used inside expressions, they
211require C++ and cannot be assigned to constexpr variables or be used with
212special function variables like ``__func__``.
213
214.. doxygendefine:: PW_TOKENIZE_STRING_EXPR
215.. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN_EXPR
216.. doxygendefine:: PW_TOKENIZE_STRING_MASK_EXPR
217
218.. admonition:: When to use these macros
219
220  Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string
221  literals that do not need %-style arguments encoded.
222
223.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions.
224  :class: checkmark
225
226  .. code:: cpp
227
228    void GoodExample() {
229      ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!"));
230    }
231
232.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable.
233  :class: error
234
235  .. code:: cpp
236
237     constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!"));
238
239  Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable.
240
241.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`.
242  :class: error
243
244  .. code:: cpp
245
246    void BadExample() {
247      // This compiles, but __func__ will not be the outer function's name, and
248      // there may be compiler warnings.
249      constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__);
250    }
251
252  Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros.
253
254.. _module-pw_tokenizer-custom-macro:
255
256Tokenize a message with arguments in a custom macro
257---------------------------------------------------
258Projects can leverage the tokenization machinery in whichever way best suits
259their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized
260data to a global handler function. A project's custom tokenization macro can
261handle tokenized data in a function of their choosing.
262
263``pw_tokenizer`` provides two low-level macros for projects to use
264to create custom tokenization macros.
265
266.. doxygendefine:: PW_TOKENIZE_FORMAT_STRING
267.. doxygendefine:: PW_TOKENIZER_ARG_TYPES
268
269The outputs of these macros are typically passed to an encoding function. That
270function encodes the token, argument types, and argument data to a buffer using
271helpers provided by ``pw_tokenizer/encode_args.h``.
272
273.. doxygenfunction:: pw::tokenizer::EncodeArgs
274.. doxygenclass:: pw::tokenizer::EncodedMessage
275   :members:
276.. doxygenfunction:: pw_tokenizer_EncodeArgs
277
278Example
279^^^^^^^
280The following example implements a custom tokenization macro similar to
281:ref:`module-pw_log_tokenized`.
282
283.. code-block:: cpp
284
285   #include "pw_tokenizer/tokenize.h"
286
287   #ifndef __cplusplus
288   extern "C" {
289   #endif
290
291   void EncodeTokenizedMessage(uint32_t metadata,
292                               pw_tokenizer_Token token,
293                               pw_tokenizer_ArgTypes types,
294                               ...);
295
296   #ifndef __cplusplus
297   }  // extern "C"
298   #endif
299
300   #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...)         \
301     do {                                                                 \
302       PW_TOKENIZE_FORMAT_STRING(                                         \
303           PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \
304       EncodeTokenizedMessage(payload,                                    \
305                              _pw_tokenizer_token,                        \
306                              PW_TOKENIZER_ARG_TYPES(__VA_ARGS__)         \
307                                  PW_COMMA_ARGS(__VA_ARGS__));            \
308     } while (0)
309
310In this example, the ``EncodeTokenizedMessage`` function would handle encoding
311and processing the message. Encoding is done by the
312:cpp:class:`pw::tokenizer::EncodedMessage` class or
313:cpp:func:`pw::tokenizer::EncodeArgs` function from
314``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or
315stored as needed.
316
317.. code-block:: cpp
318
319   #include "pw_log_tokenized/log_tokenized.h"
320   #include "pw_tokenizer/encode_args.h"
321
322   void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
323                               pw::span<std::byte> message);
324
325   extern "C" void EncodeTokenizedMessage(const uint32_t metadata,
326                                          const pw_tokenizer_Token token,
327                                          const pw_tokenizer_ArgTypes types,
328                                          ...) {
329     va_list args;
330     va_start(args, types);
331     pw::tokenizer::EncodedMessage<> encoded_message(token, types, args);
332     va_end(args);
333
334     HandleTokenizedMessage(metadata, encoded_message);
335   }
336
337.. admonition:: Why use a custom macro
338
339   - Optimal code size. Invoking a free function with the tokenized data results
340     in the smallest possible call site.
341   - Pass additional arguments, such as metadata, with the tokenized message.
342   - Integrate ``pw_tokenizer`` with other systems.
343
344Tokenize a message with arguments to a buffer
345---------------------------------------------
346.. doxygendefine:: PW_TOKENIZE_TO_BUFFER
347.. doxygendefine:: PW_TOKENIZE_TO_BUFFER_DOMAIN
348.. doxygendefine:: PW_TOKENIZE_TO_BUFFER_MASK
349
350.. admonition:: Why use this macro
351
352   - Encode a tokenized message for consumption within a function.
353   - Encode a tokenized message into an existing buffer.
354
355   Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a
356   logging macro, because it will result in larger code size than passing the
357   tokenized data to a function.
358
359Binary logging with pw_tokenizer
360================================
361String tokenization can be used to convert plain text logs to a compact,
362efficient binary format. See :ref:`module-pw_log_tokenized`.
363
364Tokenizing function names
365=========================
366The string literal tokenization functions support tokenizing string literals or
367constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
368special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
369as ``static constexpr char[]`` in C++ instead of the standard ``static const
370char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
371tokenized while compiling C++ with GCC or Clang.
372
373.. code-block:: cpp
374
375   // Tokenize the special function name variables.
376   constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
377   constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
378
379Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
380They are defined as static character arrays, so they cannot be implicitly
381concatentated with string literals. For example, ``printf(__func__ ": %d",
382123);`` will not compile.
383
384Tokenization in Python
385======================
386The Python ``pw_tokenizer.encode`` module has limited support for encoding
387tokenized messages with the ``encode_token_and_args`` function.
388
389.. autofunction:: pw_tokenizer.encode.encode_token_and_args
390
391This function requires a string's token is already calculated. Typically these
392tokens are provided by a database, but they can be manually created using the
393tokenizer hash.
394
395.. autofunction:: pw_tokenizer.tokens.pw_tokenizer_65599_hash
396
397This is particularly useful for offline token database generation in cases where
398tokenized strings in a binary cannot be embedded as parsable pw_tokenizer
399entries.
400
401.. note::
402   In C, the hash length of a string has a fixed limit controlled by
403   ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed
404   to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching
405   hash length limit. When creating an offline database, it's a good idea to
406   generate tokens for both, and merge the databases.
407
408Encoding
409========
410The token is a 32-bit hash calculated during compilation. The string is encoded
411little-endian with the token followed by arguments, if any. For example, the
41231-byte string ``You can go about your business.`` hashes to 0xdac9a244.
413This is encoded as 4 bytes: ``44 a2 c9 da``.
414
415Arguments are encoded as follows:
416
417* **Integers**  (1--10 bytes) --
418  `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
419  similarly to Protocol Buffers. Smaller values take fewer bytes.
420* **Floating point numbers** (4 bytes) -- Single precision floating point.
421* **Strings** (1--128 bytes) -- Length byte followed by the string contents.
422  The top bit of the length whether the string was truncated or not. The
423  remaining 7 bits encode the string length, with a maximum of 127 bytes.
424
425.. TODO(hepler): insert diagram here!
426
427.. tip::
428   ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s``
429   arguments short or avoid encoding them as strings (e.g. encode an enum as an
430   integer instead of a string). See also `Tokenized strings as %s arguments`_.
431
432Buffer sizing helper
433--------------------
434.. doxygenfunction:: pw::tokenizer::MinEncodingBufferSizeBytes
435
436Encoding command line utility
437-----------------------------
438The ``pw_tokenizer.encode`` command line tool can be used to encode tokenized
439strings.
440
441.. code-block:: bash
442
443  python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...]
444
445Example:
446
447.. code-block:: text
448
449  $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them
450        Raw input: "There's... %d many of %s!" % (2, 'them')
451  Formatted input: There's... 2 many of them!
452            Token: 0xb6ef8b2d
453          Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes]
454  Prefixed Base64: $LYvvtgQEdGhlbQ==
455
456See ``--help`` for full usage details.
457
458Token generation: fixed length hashing at compile time
459======================================================
460String tokens are generated using a modified version of the x65599 hash used by
461the SDBM project. All hashing is done at compile time.
462
463In C code, strings are hashed with a preprocessor macro. For compatibility with
464macros, the hash must be limited to a fixed maximum number of characters. This
465value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
466``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
467the complexity of the hashing macros.
468
469C++ macros use a constexpr function instead of a macro. This function works with
470any length of string and has lower compilation time impact than the C macros.
471For consistency, C++ tokenization uses the same hash algorithm, but the
472calculated values will differ between C and C++ for strings longer than
473``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
474
475.. _module-pw_tokenizer-domains:
476
477Tokenization domains
478====================
479``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
480string label associated with each tokenized string. This allows projects to keep
481tokens from different sources separate. Potential use cases include the
482following:
483
484* Keep large sets of tokenized strings separate to avoid collisions.
485* Create a separate database for a small number of strings that use truncated
486  tokens, for example only 10 or 16 bits instead of the full 32 bits.
487
488If no domain is specified, the domain is empty (``""``). For many projects, this
489default domain is sufficient, so no additional configuration is required.
490
491.. code-block:: cpp
492
493   // Tokenizes this string to the default ("") domain.
494   PW_TOKENIZE_STRING("Hello, world!");
495
496   // Tokenizes this string to the "my_custom_domain" domain.
497   PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
498
499The database and detokenization command line tools default to reading from the
500default domain. The domain may be specified for ELF files by appending
501``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
502example, the following reads strings in ``some_domain`` from ``my_image.elf``.
503
504.. code-block:: sh
505
506   ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
507
508See `Managing token databases`_ for information about the ``database.py``
509command line tool.
510
511.. _module-pw_tokenizer-masks:
512
513Smaller tokens with masking
514===========================
515``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
516fewer than 32 bits does not improve runtime or code size efficiency. However,
517when tokens are packed into data structures or stored in arrays, the size of the
518token directly affects memory usage. In those cases, every bit counts, and it
519may be desireable to use fewer bits for the token.
520
521``pw_tokenizer`` allows users to provide a mask to apply to the token. This
522masked token is used in both the token database and the code. The masked token
523is not a masked version of the full 32-bit token, the masked token is the token.
524This makes it trivial to decode tokens that use fewer than 32 bits.
525
526Masking functionality is provided through the ``*_MASK`` versions of the macros.
527For example, the following generates 16-bit tokens and packs them into an
528existing value.
529
530.. code-block:: cpp
531
532   constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
533   uint32_t packed_word = (other_bits << 16) | token;
534
535Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
536used for tokens, the more likely two strings are to hash to the same token. See
537`token collisions`_.
538
539Masked tokens without arguments may be encoded in fewer bytes. For example, the
54016-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``)
541rather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller
542than four bytes. Tokens with arguments must always be encoded as four bytes.
543
544Token collisions
545================
546Tokens are calculated with a hash function. It is possible for different
547strings to hash to the same token. When this happens, multiple strings will have
548the same token in the database, and it may not be possible to unambiguously
549decode a token.
550
551The detokenization tools attempt to resolve collisions automatically. Collisions
552are resolved based on two things:
553
554- whether the tokenized data matches the strings arguments' (if any), and
555- if / when the string was marked as having been removed from the database.
556
557Working with collisions
558-----------------------
559Collisions may occur occasionally. Run the command
560``python -m pw_tokenizer.database report <database>`` to see information about a
561token database, including any collisions.
562
563If there are collisions, take the following steps to resolve them.
564
565- Change one of the colliding strings slightly to give it a new token.
566- In C (not C++), artificial collisions may occur if strings longer than
567  ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
568  setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
569  ``pw_tokenizer/public/pw_tokenizer/config.h``.
570- Run the ``mark_removed`` command with the latest version of the build
571  artifacts to mark missing strings as removed. This deprioritizes them in
572  collision resolution.
573
574  .. code-block:: sh
575
576     python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
577
578  The ``purge`` command may be used to delete these tokens from the database.
579
580Probability of collisions
581-------------------------
582Hashes of any size have a collision risk. The probability of one at least
583one collision occurring for a given number of strings is unintuitively high
584(this is known as the `birthday problem
585<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
586used for tokens, the probability of collisions increases substantially.
587
588This table shows the approximate number of strings that can be hashed to have a
5891% or 50% probability of at least one collision (assuming a uniform, random
590hash).
591
592+-------+---------------------------------------+
593| Token | Collision probability by string count |
594| bits  +--------------------+------------------+
595|       |         50%        |          1%      |
596+=======+====================+==================+
597|   32  |       77000        |        9300      |
598+-------+--------------------+------------------+
599|   31  |       54000        |        6600      |
600+-------+--------------------+------------------+
601|   24  |        4800        |         580      |
602+-------+--------------------+------------------+
603|   16  |         300        |          36      |
604+-------+--------------------+------------------+
605|    8  |          19        |           3      |
606+-------+--------------------+------------------+
607
608Keep this table in mind when masking tokens (see `Smaller tokens with
609masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
610such as module names, but won't be suitable for large sets of strings, like log
611messages.
612
613---------------
614Token databases
615---------------
616Token databases store a mapping of tokens to the strings they represent. An ELF
617file can be used as a token database, but it only contains the strings for its
618exact build. A token database file aggregates tokens from multiple ELF files, so
619that a single database can decode tokenized strings from any known ELF.
620
621Token databases contain the token, removal date (if any), and string for each
622tokenized string.
623
624Token database formats
625======================
626Three token database formats are supported: CSV, binary, and directory. Tokens
627may also be read from ELF files or ``.a`` archives, but cannot be written to
628these formats.
629
630CSV database format
631-------------------
632The CSV database format has three columns: the token in hexadecimal, the removal
633date (if any) in year-month-day format, and the string literal, surrounded by
634quotes. Quote characters within the string are represented as two quote
635characters.
636
637This example database contains six strings, three of which have removal dates.
638
639.. code-block::
640
641   141c35d5,          ,"The answer: ""%s"""
642   2e668cd6,2019-12-25,"Jello, world!"
643   7b940e2a,          ,"Hello %s! %hd %e"
644   851beeb6,          ,"%u %d"
645   881436a0,2020-01-01,"The answer is: %s"
646   e13b0f94,2020-04-01,"%llu"
647
648Binary database format
649----------------------
650The binary database format is comprised of a 16-byte header followed by a series
651of 8-byte entries. Each entry stores the token and the removal date, which is
6520xFFFFFFFF if there is none. The string literals are stored next in the same
653order as the entries. Strings are stored with null terminators. See
654`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
655for full details.
656
657The binary form of the CSV database is shown below. It contains the same
658information, but in a more compact and easily processed form. It takes 141 B
659compared with the CSV database's 211 B.
660
661.. code-block:: text
662
663   [header]
664   0x00: 454b4f54 0000534e  TOKENS..
665   0x08: 00000006 00000000  ........
666
667   [entries]
668   0x10: 141c35d5 ffffffff  .5......
669   0x18: 2e668cd6 07e30c19  ..f.....
670   0x20: 7b940e2a ffffffff  *..{....
671   0x28: 851beeb6 ffffffff  ........
672   0x30: 881436a0 07e40101  .6......
673   0x38: e13b0f94 07e40401  ..;.....
674
675   [string table]
676   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
677   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
678   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
679   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
680   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
681
682Directory database format
683-------------------------
684pw_tokenizer can consume directories of CSV databases. A directory database
685will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
686of which will be used for subsequent detokenization lookups.
687
688An example directory database might look something like this:
689
690.. code-block:: text
691
692   token_database
693   ├── chuck_e_cheese.pw_tokenizer.csv
694   ├── fungi_ble.pw_tokenizer.csv
695   └── some_more
696       └── arcade.pw_tokenizer.csv
697
698This format is optimized for storage in a Git repository alongside source code.
699The token database commands randomly generate unique file names for the CSVs in
700the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
701commands in the database CLI consolidates the files to a single CSV.
702
703The database command line tool supports a ``--discard-temporary
704<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
705discard temporary tokens. It identifies the latest CSV not present in the
706provided ``<upstream_commit>``, and tokens present that CSV that are not in the
707newly added tokens are discarded. This helps keep temporary tokens (e.g from
708debug logs) out of the database.
709
710JSON support
711============
712While pw_tokenizer doesn't specify a JSON database format, a token database can
713be created from a JSON formatted array of strings. This is useful for side-band
714token database generation for strings that are not embedded as parsable tokens
715in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
716instructions on generating a token database from a JSON file.
717
718Managing token databases
719========================
720Token databases are managed with the ``database.py`` script. This script can be
721used to extract tokens from compilation artifacts and manage database files.
722Invoke ``database.py`` with ``-h`` for full usage information.
723
724An example ELF file with tokenized logs is provided at
725``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
726file to experiment with the ``database.py`` commands.
727
728.. _module-pw_tokenizer-database-creation:
729
730Create a database
731-----------------
732The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
733etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
734containing an array of strings.
735
736.. code-block:: sh
737
738   ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
739
740Two database output formats are supported: CSV and binary. Provide
741``--type binary`` to ``create`` to generate a binary database instead of the
742default CSV. CSV databases are great for checking into a source control or for
743human review. Binary databases are more compact and simpler to parse. The C++
744detokenizer library only supports binary databases currently.
745
746Update a database
747-----------------
748As new tokenized strings are added, update the database with the ``add``
749command.
750
751.. code-block:: sh
752
753   ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
754
755This command adds new tokens from ELF files or other databases to the database.
756Adding tokens already present in the database updates the date removed, if any,
757to the latest.
758
759A CSV token database can be checked into a source repository and updated as code
760changes are made. The build system can invoke ``database.py`` to update the
761database after each build.
762
763GN integration
764--------------
765Token databases may be updated or created as part of a GN build. The
766``pw_tokenizer_database`` template provided by
767``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
768strings database or creates a new database with artifacts from one or more GN
769targets or other database files.
770
771To create a new database, set the ``create`` variable to the desired database
772type (``"csv"`` or ``"binary"``). The database will be created in the output
773directory. To update an existing database, provide the path to the database with
774the ``database`` variable.
775
776.. code-block::
777
778   import("//build_overrides/pigweed.gni")
779
780   import("$dir_pw_tokenizer/database.gni")
781
782   pw_tokenizer_database("my_database") {
783     database = "database_in_the_source_tree.csv"
784     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
785     input_databases = [ "other_database.csv" ]
786   }
787
788Instead of specifying GN targets, paths or globs to output files may be provided
789with the ``paths`` option.
790
791.. code-block::
792
793   pw_tokenizer_database("my_database") {
794     database = "database_in_the_source_tree.csv"
795     deps = [ ":apps" ]
796     optional_paths = [ "$root_build_dir/**/*.elf" ]
797   }
798
799.. note::
800
801   The ``paths`` and ``optional_targets`` arguments do not add anything to
802   ``deps``, so there is no guarantee that the referenced artifacts will exist
803   when the database is updated. Provide ``targets`` or ``deps`` or build other
804   GN targets first if this is a concern.
805
806--------------
807Detokenization
808--------------
809Detokenization is the process of expanding a token to the string it represents
810and decoding its arguments. This module provides Python, C++ and TypeScript
811detokenization libraries.
812
813**Example: decoding tokenized logs**
814
815A project might tokenize its log messages with the `Base64 format`_. Consider
816the following log file, which has four tokenized logs and one plain text log:
817
818.. code-block:: text
819
820   20200229 14:38:58 INF $HL2VHA==
821   20200229 14:39:00 DBG $5IhTKg==
822   20200229 14:39:20 DBG Crunching numbers to calculate probability of success
823   20200229 14:39:21 INF $EgFj8lVVAUI=
824   20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
825
826The project's log strings are stored in a database like the following:
827
828.. code-block::
829
830   1c95bd1c,          ,"Initiating retrieval process for recovery object"
831   2a5388e4,          ,"Determining optimal approach and coordinating vectors"
832   3743540c,          ,"Recovery object retrieval failed with status %s"
833   f2630112,          ,"Calculated acceptable probability of success (%.2f%%)"
834
835Using the detokenizing tools with the database, the logs can be decoded:
836
837.. code-block:: text
838
839   20200229 14:38:58 INF Initiating retrieval process for recovery object
840   20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
841   20200229 14:39:20 DBG Crunching numbers to calculate probability of success
842   20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
843   20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
844
845.. note::
846
847   This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
848   much space as the default binary format when encoded. For projects that wish
849   to interleave tokenized with plain text, using Base64 is a worthwhile
850   tradeoff.
851
852Python
853======
854To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
855package, and instantiate it with paths to token databases or ELF files.
856
857.. code-block:: python
858
859   import pw_tokenizer
860
861   detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
862
863   def process_log_message(log_message):
864       result = detokenizer.detokenize(log_message.payload)
865       self._log(str(result))
866
867The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
868class, which can be used in place of the standard ``Detokenizer``. This class
869monitors database files for changes and automatically reloads them when they
870change. This is helpful for long-running tools that use detokenization. The
871class also supports token domains for the given database files in the
872``<path>#<domain>`` format.
873
874For messages that are optionally tokenized and may be encoded as binary,
875Base64, or plaintext UTF-8, use
876:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to
877determine the correct method to detokenize and always provide a printable
878string. For more information on this feature, see
879:ref:`module-pw_tokenizer-proto`.
880
881C99 ``printf`` Compatibility Notes
882----------------------------------
883This implementation is designed to align with the
884`C99 specification, section 7.19.6
885<https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_.
886Notably, this specification is slightly different than what is implemented
887in most compilers due to each compiler choosing to interpret undefined
888behavior in slightly different ways. Treat the following description as the
889source of truth.
890
891This implementation supports:
892
893- Overall Format: ``%[flags][width][.precision][length][specifier]``
894- Flags (Zero or More)
895   - ``-``: Left-justify within the given field width; Right justification is
896     the default (see Width modifier).
897   - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or
898     ``-``) even for positive numbers. By default, only negative numbers are
899     preceded with a ``-`` sign.
900   - (space): If no sign is going to be written, a blank space is inserted
901     before the value.
902   - ``#``: Specifies an alternative print syntax should be used.
903      - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with
904        ``0``, ``0x`` or ``0X``, respectively, for values different than zero.
905      - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it
906        forces the written output to contain a decimal point even if no more
907        digits follow. By default, if no digits follow, no decimal point is
908        written.
909   - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when
910     padding is specified (see width sub-specifier).
911- Width (Optional)
912   - ``(number)``: Minimum number of characters to be printed. If the value to
913     be printed is shorter than this number, the result is padded with blank
914     spaces or ``0`` if the ``0`` flag is present. The value is not truncated
915     even if the result is larger. If the value is negative and the ``0`` flag
916     is present, the ``0``\s are padded after the ``-`` symbol.
917   - ``*``: The width is not specified in the format string, but as an
918     additional integer value argument preceding the argument that has to be
919     formatted.
920- Precision (Optional)
921   - ``.(number)``
922      - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum
923        number of digits to be written. If the value to be written is shorter
924        than this number, the result is padded with leading zeros. The value is
925        not truncated even if the result is longer.
926
927        - A precision of ``0`` means that no character is written for the value
928          ``0``.
929
930      - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number
931        of digits to be printed after the decimal point. By default, this is
932        ``6``.
933
934      - For ``g`` and ``G``, specifies the maximum number of significant digits
935        to be printed.
936
937      - For ``s``, specifies the maximum number of characters to be printed. By
938        default all characters are printed until the ending null character is
939        encountered.
940
941      - If the period is specified without an explicit value for precision,
942        ``0`` is assumed.
943   - ``.*``: The precision is not specified in the format string, but as an
944     additional integer value argument preceding the argument that has to be
945     formatted.
946- Length (Optional)
947   - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
948     to convey the argument will be a ``signed char`` or ``unsigned char``.
949     However, this is largely ignored in the implementation due to it not being
950     necessary for Python or argument decoding (since the argument is always
951     encoded at least as a 32-bit integer).
952   - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
953     to convey the argument will be a ``signed short int`` or
954     ``unsigned short int``. However, this is largely ignored in the
955     implementation due to it not being necessary for Python or argument
956     decoding (since the argument is always encoded at least as a 32-bit
957     integer).
958   - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
959     to convey the argument will be a ``signed long int`` or
960     ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that
961     the arguments will be encoded with ``wchar_t`` values (which isn't
962     different from normal ``char`` values). However, this is largely ignored in
963     the implementation due to it not being necessary for Python or argument
964     decoding (since the argument is always encoded at least as a 32-bit
965     integer).
966   - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
967     to convey the argument will be a ``signed long long int`` or
968     ``unsigned long long int``. This is required to properly decode the
969     argument as a 64-bit integer.
970   - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or
971     ``G`` conversion specifiers applies to a long double argument. However,
972     this is ignored in the implementation due to floating point value encoded
973     that is unaffected by bit width.
974   - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
975     to convey the argument will be a ``intmax_t`` or ``uintmax_t``.
976   - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
977     to convey the argument will be a ``size_t``. This will force the argument
978     to be decoded as an unsigned integer.
979   - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
980     to convey the argument will be a ``ptrdiff_t``.
981   - If a length modifier is provided for an incorrect specifier, it is ignored.
982- Specifier (Required)
983   - ``d`` / ``i``: Used for signed decimal integers.
984
985   - ``u``: Used for unsigned decimal integers.
986
987   - ``o``: Used for unsigned decimal integers and specifies formatting should
988     be as an octal number.
989
990   - ``x``: Used for unsigned decimal integers and specifies formatting should
991     be as a hexadecimal number using all lowercase letters.
992
993   - ``X``: Used for unsigned decimal integers and specifies formatting should
994     be as a hexadecimal number using all uppercase letters.
995
996   - ``f``: Used for floating-point values and specifies to use lowercase,
997     decimal floating point formatting.
998
999     - Default precision is ``6`` decimal places unless explicitly specified.
1000
1001   - ``F``: Used for floating-point values and specifies to use uppercase,
1002     decimal floating point formatting.
1003
1004     - Default precision is ``6`` decimal places unless explicitly specified.
1005
1006   - ``e``: Used for floating-point values and specifies to use lowercase,
1007     exponential (scientific) formatting.
1008
1009     - Default precision is ``6`` decimal places unless explicitly specified.
1010
1011   - ``E``: Used for floating-point values and specifies to use uppercase,
1012     exponential (scientific) formatting.
1013
1014     - Default precision is ``6`` decimal places unless explicitly specified.
1015
1016   - ``g``: Used for floating-point values and specified to use ``f`` or ``e``
1017     formatting depending on which would be the shortest representation.
1018
1019     - Precision specifies the number of significant digits, not just digits
1020       after the decimal place.
1021
1022     - If the precision is specified as ``0``, it is interpreted to mean ``1``.
1023
1024     - ``e`` formatting is used if the the exponent would be less than ``-4`` or
1025       is greater than or equal to the precision.
1026
1027     - Trailing zeros are removed unless the ``#`` flag is set.
1028
1029     - A decimal point only appears if it is followed by a digit.
1030
1031     - ``NaN`` or infinities always follow ``f`` formatting.
1032
1033   - ``G``: Used for floating-point values and specified to use ``f`` or ``e``
1034     formatting depending on which would be the shortest representation.
1035
1036     - Precision specifies the number of significant digits, not just digits
1037       after the decimal place.
1038
1039     - If the precision is specified as ``0``, it is interpreted to mean ``1``.
1040
1041     - ``E`` formatting is used if the the exponent would be less than ``-4`` or
1042       is greater than or equal to the precision.
1043
1044     - Trailing zeros are removed unless the ``#`` flag is set.
1045
1046     - A decimal point only appears if it is followed by a digit.
1047
1048     - ``NaN`` or infinities always follow ``F`` formatting.
1049
1050   - ``c``: Used for formatting a ``char`` value.
1051
1052   - ``s``: Used for formatting a string of ``char`` values.
1053
1054     - If width is specified, the null terminator character is included as a
1055       character for width count.
1056
1057     - If precision is specified, no more ``char``\s than that value will be
1058       written from the string (padding is used to fill additional width).
1059
1060   - ``p``: Used for formatting a pointer address.
1061
1062   - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags,
1063     width, precision, or length modifiers).
1064
1065Underspecified details:
1066
1067- If both ``+`` and (space) flags appear, the (space) is ignored.
1068- The ``+`` and (space) flags will error if used with ``c`` or ``s``.
1069- The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or
1070  ``p``.
1071- The ``0`` flag will error if used with ``c``, ``s``, or ``p``.
1072- Both ``+`` and (space) can work with the unsigned integer specifiers ``u``,
1073  ``o``, ``x``, and ``X``.
1074- If a length modifier is provided for an incorrect specifier, it is ignored.
1075- The ``z`` length modifier will decode arugments as signed as long as ``d`` or
1076  ``i`` is used.
1077- ``p`` is implementation defined.
1078
1079  - For this implementation, it will print with a ``0x`` prefix and then the
1080    pointer value was printed using ``%08X``.
1081
1082  - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or
1083    ``0`` flags.
1084
1085  - None of the length modifiers are usable with ``p``.
1086
1087  - This implementation will try to adhere to user-specified width (assuming the
1088    width provided is larger than the guaranteed minimum of ``10``).
1089
1090  - Specifying precision for ``p`` is considered an error.
1091- Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail
1092  to decode. Some C stdlib implementations support any modifiers being
1093  present between ``%``, but ignore any for the output.
1094- If a width is specified with the ``0`` flag for a negative value, the padded
1095  ``0``\s will appear after the ``-`` symbol.
1096- A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means
1097  that no character is written for the value ``0``.
1098- Precision cannot be specified for ``c``.
1099- Using ``*`` or fixed precision with the ``s`` specifier still requires the
1100  string argument to be null-terminated. This is due to argument encoding
1101  happening on the C/C++-side while the precision value is not read or
1102  otherwise used until decoding happens in this Python code.
1103
1104Non-conformant details:
1105
1106- ``n`` specifier: We do not support the ``n`` specifier since it is impossible
1107  for us to retroactively tell the original program how many characters have
1108  been printed since this decoding happens a great deal of time after the
1109  device sent it, usually on a separate processing device entirely.
1110
1111C++
1112===
1113The C++ detokenization libraries can be used in C++ or any language that can
1114call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
1115Java Native Interface (JNI) implementation is provided.
1116
1117The C++ detokenization library uses binary-format token databases (created with
1118``database.py create --type binary``). Read a binary format database from a
1119file or include it in the source code. Pass the database array to
1120``TokenDatabase::Create``, and construct a detokenizer.
1121
1122.. code-block:: cpp
1123
1124   Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
1125
1126   std::string ProcessLog(span<uint8_t> log_data) {
1127     return detokenizer.Detokenize(log_data).BestString();
1128   }
1129
1130The ``TokenDatabase`` class verifies that its data is valid before using it. If
1131it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
1132``ok()`` returns false. If the token database is included in the source code,
1133this check can be done at compile time.
1134
1135.. code-block:: cpp
1136
1137   // This line fails to compile with a static_assert if the database is invalid.
1138   constexpr TokenDatabase kDefaultDatabase =  TokenDatabase::Create<kData>();
1139
1140   Detokenizer OpenDatabase(std::string_view path) {
1141     std::vector<uint8_t> data = ReadWholeFile(path);
1142
1143     TokenDatabase database = TokenDatabase::Create(data);
1144
1145     // This checks if the file contained a valid database. It is safe to use a
1146     // TokenDatabase that failed to load (it will be empty), but it may be
1147     // desirable to provide a default database or otherwise handle the error.
1148     if (database.ok()) {
1149       return Detokenizer(database);
1150     }
1151     return Detokenizer(kDefaultDatabase);
1152   }
1153
1154
1155TypeScript
1156==========
1157To detokenize in TypeScript, import ``Detokenizer`` from the ``pigweedjs``
1158package, and instantiate it with a CSV token database.
1159
1160.. code-block:: typescript
1161
1162   import { pw_tokenizer, pw_hdlc } from 'pigweedjs';
1163   const { Detokenizer } = pw_tokenizer;
1164   const { Frame } = pw_hdlc;
1165
1166   const detokenizer = new Detokenizer(String(tokenCsv));
1167
1168   function processLog(frame: Frame){
1169     const result = detokenizer.detokenize(frame);
1170     console.log(result);
1171   }
1172
1173For messages that are encoded in Base64, use ``Detokenizer::detokenizeBase64``.
1174`detokenizeBase64` will also attempt to detokenize nested Base64 tokens. There
1175is also `detokenizeUint8Array` that works just like `detokenize` but expects
1176`Uint8Array` instead of a `Frame` argument.
1177
1178Protocol buffers
1179================
1180``pw_tokenizer`` provides utilities for handling tokenized fields in protobufs.
1181See :ref:`module-pw_tokenizer-proto` for details.
1182
1183.. toctree::
1184   :hidden:
1185
1186   proto.rst
1187
1188-------------
1189Base64 format
1190-------------
1191The tokenizer encodes messages to a compact binary representation. Applications
1192may desire a textual representation of tokenized strings. This makes it easy to
1193use tokenized messages alongside plain text messages, but comes at a small
1194efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
1195as binary messages.
1196
1197The Base64 format is comprised of a ``$`` character followed by the
1198Base64-encoded contents of the tokenized message. For example, consider
1199tokenizing the string ``This is an example: %d!`` with the argument -1. The
1200string's token is 0x4b016e66.
1201
1202.. code-block:: text
1203
1204   Source code: PW_LOG("This is an example: %d!", -1);
1205
1206    Plain text: This is an example: -1! [23 bytes]
1207
1208        Binary: 66 6e 01 4b 01          [ 5 bytes]
1209
1210        Base64: $Zm4BSwE=               [ 9 bytes]
1211
1212Encoding
1213========
1214To encode with the Base64 format, add a call to
1215``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
1216in the tokenizer handler function. For example,
1217
1218.. code-block:: cpp
1219
1220   void TokenizedMessageHandler(const uint8_t encoded_message[],
1221                                size_t size_bytes) {
1222     pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode(
1223         pw::span(encoded_message, size_bytes));
1224
1225     TransmitLogMessage(base64.data(), base64.size());
1226   }
1227
1228Decoding
1229========
1230The Python ``Detokenizer`` class supprts decoding and detokenizing prefixed
1231Base64 messages with ``detokenize_base64`` and related methods.
1232
1233.. tip::
1234   The Python detokenization tools support recursive detokenization for prefixed
1235   Base64 text. Tokenized strings found in detokenized text are detokenized, so
1236   prefixed Base64 messages can be passed as ``%s`` arguments.
1237
1238   For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
1239   passed as an argument to the printf-style string ``Nested message: %s``, which
1240   encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
1241   as follows:
1242
1243   ::
1244
1245     "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
1246
1247Base64 decoding is supported in C++ or C with the
1248``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
1249functions.
1250
1251Investigating undecoded messages
1252================================
1253Tokenized messages cannot be decoded if the token is not recognized. The Python
1254package includes the ``parse_message`` tool, which parses tokenized Base64
1255messages without looking up the token in a database. This tool attempts to guess
1256the types of the arguments and displays potential ways to decode them.
1257
1258This tool can be used to extract argument information from an otherwise unusable
1259message. It could help identify which statement in the code produced the
1260message. This tool is not particularly helpful for tokenized messages without
1261arguments, since all it can do is show the value of the unknown token.
1262
1263The tool is executed by passing Base64 tokenized messages, with or without the
1264``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to
1265see full usage information.
1266
1267Example
1268-------
1269.. code-block::
1270
1271   $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d
1272
1273   INF Decoding arguments for '$329JMwA='
1274   INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
1275   INF Token:  0x33496fdf
1276   INF Args:   b'\x00' [00] (1 bytes)
1277   INF Decoding with up to 8 %s or %d arguments
1278   INF   Attempt 1: [%s]
1279   INF   Attempt 2: [%d] 0
1280
1281   INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
1282   INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
1283   INF Token:  0xe7a58492
1284   INF Args:   b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
1285   INF Decoding with up to 8 %s or %d arguments
1286   INF   Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
1287   INF   Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK
1288
1289Command line utilities
1290----------------------
1291``pw_tokenizer`` provides two standalone command line utilities for detokenizing
1292Base64-encoded tokenized strings.
1293
1294* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
1295  stdin.
1296* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a
1297  connected serial device.
1298
1299If the ``pw_tokenizer`` Python package is installed, these tools may be executed
1300as runnable modules. For example:
1301
1302.. code-block::
1303
1304   # Detokenize Base64-encoded strings in a file
1305   python -m pw_tokenizer.detokenize -i input_file.txt
1306
1307   # Detokenize Base64-encoded strings in output from a serial device
1308   python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0
1309
1310See the ``--help`` options for these tools for full usage information.
1311
1312--------------------
1313Deployment war story
1314--------------------
1315The tokenizer module was developed to bring tokenized logging to an
1316in-development product. The product already had an established text-based
1317logging system. Deploying tokenization was straightforward and had substantial
1318benefits.
1319
1320Results
1321=======
1322* Log contents shrunk by over 50%, even with Base64 encoding.
1323
1324  * Significant size savings for encoded logs, even using the less-efficient
1325    Base64 encoding required for compatibility with the existing log system.
1326  * Freed valuable communication bandwidth.
1327  * Allowed storing many more logs in crash dumps.
1328
1329* Substantial flash savings.
1330
1331  * Reduced the size firmware images by up to 18%.
1332
1333* Simpler logging code.
1334
1335  * Removed CPU-heavy ``snprintf`` calls.
1336  * Removed complex code for forwarding log arguments to a low-priority task.
1337
1338This section describes the tokenizer deployment process and highlights key
1339insights.
1340
1341Firmware deployment
1342===================
1343* In the project's logging macro, calls to the underlying logging function were
1344  replaced with a tokenized log macro invocation.
1345* The log level was passed as the payload argument to facilitate runtime log
1346  level control.
1347* For this project, it was necessary to encode the log messages as text. In
1348  the handler function the log messages were encoded in the $-prefixed `Base64
1349  format`_, then dispatched as normal log messages.
1350* Asserts were tokenized a callback-based API that has been removed (a
1351  :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better
1352  alternative).
1353
1354.. attention::
1355  Do not encode line numbers in tokenized strings. This results in a huge
1356  number of lines being added to the database, since every time code moves,
1357  new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line
1358  numbers are encoded in the log metadata. Line numbers may also be included by
1359  by adding ``"%d"`` to the format string and passing ``__LINE__``.
1360
1361Database management
1362===================
1363* The token database was stored as a CSV file in the project's Git repo.
1364* The token database was automatically updated as part of the build, and
1365  developers were expected to check in the database changes alongside their code
1366  changes.
1367* A presubmit check verified that all strings added by a change were added to
1368  the token database.
1369* The token database included logs and asserts for all firmware images in the
1370  project.
1371* No strings were purged from the token database.
1372
1373.. tip::
1374   Merge conflicts may be a frequent occurrence with an in-source CSV database.
1375   Use the `Directory database format`_ instead.
1376
1377Decoding tooling deployment
1378===========================
1379* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
1380
1381  * Product-specific Python command line tools, using
1382    ``pw_tokenizer.Detokenizer``.
1383  * Standalone script for decoding prefixed Base64 tokens in files or
1384    live output (e.g. from ``adb``), using ``detokenize.py``'s command line
1385    interface.
1386
1387* The C++ detokenizer library was deployed to two Android apps with a Java
1388  Native Interface (JNI) layer.
1389
1390  * The binary token database was included as a raw resource in the APK.
1391  * In one app, the built-in token database could be overridden by copying a
1392    file to the phone.
1393
1394.. tip::
1395   Make the tokenized logging tools simple to use for your project.
1396
1397   * Provide simple wrapper shell scripts that fill in arguments for the
1398     project. For example, point ``detokenize.py`` to the project's token
1399     databases.
1400   * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in
1401     continuously-running tools, so that users don't have to restart the tool
1402     when the token database updates.
1403   * Integrate detokenization everywhere it is needed. Integrating the tools
1404     takes just a few lines of code, and token databases can be embedded in APKs
1405     or binaries.
1406
1407---------------------------
1408Limitations and future work
1409---------------------------
1410
1411GCC bug: tokenization in template functions
1412===========================================
1413GCC incorrectly ignores the section attribute for template `functions
1414<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables
1415<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. For example, the
1416following won't work when compiling with GCC and tokenized logging:
1417
1418.. code-block:: cpp
1419
1420   template <...>
1421   void DoThings() {
1422     int value = GetValue();
1423     // This log won't work with tokenized logs due to the templated context.
1424     PW_LOG_INFO("Got value: %d", value);
1425     ...
1426   }
1427
1428The bug causes tokenized strings in template functions to be emitted into
1429``.rodata`` instead of the special tokenized string section. This causes two
1430problems:
1431
14321. Tokenized strings will not be discovered by the token database tools.
14332. Tokenized strings may not be removed from the final binary.
1434
1435There are two workarounds.
1436
1437#. **Use Clang.** Clang puts the string data in the requested section, as
1438   expected. No extra steps are required.
1439
1440#. **Move tokenization calls to a non-templated context.** Creating a separate
1441   non-templated function and invoking it from the template resolves the issue.
1442   This enables tokenizing in most cases encountered in practice with
1443   templates.
1444
1445   .. code-block:: cpp
1446
1447      // In .h file:
1448      void LogThings(value);
1449
1450      template <...>
1451      void DoThings() {
1452        int value = GetValue();
1453        // This log will work: calls non-templated helper.
1454        LogThings(value);
1455        ...
1456      }
1457
1458      // In .cc file:
1459      void LogThings(int value) {
1460        // Tokenized logging works as expected in this non-templated context.
1461        PW_LOG_INFO("Got value %d", value);
1462      }
1463
1464There is a third option, which isn't implemented yet, which is to compile the
1465binary twice: once to extract the tokens, and once for the production binary
1466(without tokens). If this is interesting to you please get in touch.
1467
146864-bit tokenization
1469===================
1470The Python and C++ detokenizing libraries currently assume that strings were
1471tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
1472``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
1473device performed the tokenization.
1474
1475Supporting detokenization of strings tokenized on 64-bit targets would be
1476simple. This could be done by adding an option to switch the 32-bit types to
147764-bit. The tokenizer stores the sizes of these types in the
1478``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
1479by checking the ELF file, if necessary.
1480
1481Tokenization in headers
1482=======================
1483Tokenizing code in header files (inline functions or templates) may trigger
1484warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
1485is because tokenization requires declaring a character array for each tokenized
1486string. If the tokenized string includes macros that change value, the size of
1487this character array changes, which means the same static variable is defined
1488with different sizes. It should be safe to suppress these warnings, but, when
1489possible, code that tokenizes strings with macros that can change value should
1490be moved to source files rather than headers.
1491
1492Tokenized strings as ``%s`` arguments
1493=====================================
1494Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
1495encoded 1:1, with no tokenization. It would be better to send a tokenized string
1496literal as an integer instead of a string argument, but this is not yet
1497supported.
1498
1499A string token could be sent by marking an integer % argument in a way
1500recognized by the detokenization tools. The detokenizer would expand the
1501argument to the string represented by the integer.
1502
1503.. code-block:: cpp
1504
1505   #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
1506
1507   constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
1508
1509   PW_TOKENIZE_STRING("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
1510
1511Strings with arguments could be encoded to a buffer, but since printf strings
1512are null-terminated, a binary encoding would not work. These strings can be
1513prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
1514
1515Another possibility: encode strings with arguments to a ``uint64_t`` and send
1516them as an integer. This would be efficient and simple, but only support a small
1517number of arguments.
1518
1519-------------
1520Compatibility
1521-------------
1522* C11
1523* C++14
1524* Python 3
1525
1526------------
1527Dependencies
1528------------
1529* ``pw_varint`` module
1530* ``pw_preprocessor`` module
1531* ``pw_span`` module
1532