1.. _module-pw_tokenizer: 2 3============ 4pw_tokenizer 5============ 6:bdg-primary:`host` 7:bdg-primary:`device` 8:bdg-secondary:`Python` 9:bdg-secondary:`C++` 10:bdg-secondary:`TypeScript` 11:bdg-success:`stable` 12 13Logging is critical, but developers are often forced to choose between 14additional logging or saving crucial flash space. The ``pw_tokenizer`` module 15helps address this by replacing printf-style strings with binary tokens during 16compilation. This enables extensive logging with substantially less memory 17usage. 18 19.. note:: 20 This usage of the term "tokenizer" is not related to parsing! The 21 module is called tokenizer because it replaces a whole string literal with an 22 integer token. It does not parse strings into separate tokens. 23 24The most common application of ``pw_tokenizer`` is binary logging, and it is 25designed to integrate easily into existing logging systems. However, the 26tokenizer is general purpose and can be used to tokenize any strings, with or 27without printf-style arguments. 28 29**Why tokenize strings?** 30 31* Dramatically reduce binary size by removing string literals from binaries. 32* Reduce I/O traffic, RAM, and flash usage by sending and storing compact tokens 33 instead of strings. We've seen over 50% reduction in encoded log contents. 34* Reduce CPU usage by replacing snprintf calls with simple tokenization code. 35* Remove potentially sensitive log, assert, and other strings from binaries. 36 37-------------- 38Basic overview 39-------------- 40There are two sides to ``pw_tokenizer``, which we call tokenization and 41detokenization. 42 43* **Tokenization** converts string literals in the source code to binary tokens 44 at compile time. If the string has printf-style arguments, these are encoded 45 to compact binary form at runtime. 46* **Detokenization** converts tokenized strings back to the original 47 human-readable strings. 48 49Here's an overview of what happens when ``pw_tokenizer`` is used: 50 511. During compilation, the ``pw_tokenizer`` module hashes string literals to 52 generate stable 32-bit tokens. 532. The tokenization macro removes these strings by declaring them in an ELF 54 section that is excluded from the final binary. 553. After compilation, strings are extracted from the ELF to build a database of 56 tokenized strings for use by the detokenizer. The ELF file may also be used 57 directly. 584. During operation, the device encodes the string token and its arguments, if 59 any. 605. The encoded tokenized strings are sent off-device or stored. 616. Off-device, the detokenizer tools use the token database to decode the 62 strings to human-readable form. 63 64Example: tokenized logging 65========================== 66This example demonstrates using ``pw_tokenizer`` for logging. In this example, 67tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded 68size (49 → 15 bytes). 69 70**Before**: plain text logging 71 72+------------------+-------------------------------------------+---------------+ 73| Location | Logging Content | Size in bytes | 74+==================+===========================================+===============+ 75| Source contains | ``LOG("Battery state: %s; battery | | 76| | voltage: %d mV", state, voltage);`` | | 77+------------------+-------------------------------------------+---------------+ 78| Binary contains | ``"Battery state: %s; battery | 41 | 79| | voltage: %d mV"`` | | 80+------------------+-------------------------------------------+---------------+ 81| | (log statement is called with | | 82| | ``"CHARGING"`` and ``3989`` as arguments) | | 83+------------------+-------------------------------------------+---------------+ 84| Device transmits | ``"Battery state: CHARGING; battery | 49 | 85| | voltage: 3989 mV"`` | | 86+------------------+-------------------------------------------+---------------+ 87| When viewed | ``"Battery state: CHARGING; battery | | 88| | voltage: 3989 mV"`` | | 89+------------------+-------------------------------------------+---------------+ 90 91**After**: tokenized logging 92 93+------------------+-----------------------------------------------------------+---------+ 94| Location | Logging Content | Size in | 95| | | bytes | 96+==================+===========================================================+=========+ 97| Source contains | ``LOG("Battery state: %s; battery | | 98| | voltage: %d mV", state, voltage);`` | | 99+------------------+-----------------------------------------------------------+---------+ 100| Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 | 101+------------------+-----------------------------------------------------------+---------+ 102| | (log statement is called with | | 103| | ``"CHARGING"`` and ``3989`` as arguments) | | 104+------------------+-----------------------------------------------------------+---------+ 105| Device transmits | =============== ============================== ========== | 15 | 106| | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | | 107| | --------------- ------------------------------ ---------- | | 108| | Token ``"CHARGING"`` argument ``3989``, | | 109| | as | | 110| | varint | | 111| | =============== ============================== ========== | | 112+------------------+-----------------------------------------------------------+---------+ 113| When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | | 114+------------------+-----------------------------------------------------------+---------+ 115 116--------------- 117Getting started 118--------------- 119Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This 120section describes one way ``pw_tokenizer`` might be integrated with a project. 121These steps can be adapted as needed. 122 1231. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel are 124 provided. For Make or other build systems, add the files specified in the 125 BUILD.gn's ``pw_tokenizer`` target to the build. 1262. Use the tokenization macros in your code. See `Tokenization`_. 1273. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's 128 linker script. In GN and CMake, this step is done automatically. 1294. Compile your code to produce an ELF file. 1305. Run ``database.py create`` on the ELF file to generate a CSV token 131 database. See `Managing token databases`_. 1326. Commit the token database to your repository. See notes in `Database 133 management`_. 1347. Integrate a ``database.py add`` command to your build to automatically update 135 the committed token database. In GN, use the ``pw_tokenizer_database`` 136 template to do this. See `Update a database`_. 1378. Integrate ``detokenize.py`` or the C++ detokenization library with your tools 138 to decode tokenized logs. See `Detokenization`_. 139 140Using with Zephyr 141================= 142When building ``pw_tokenizer`` with Zephyr, 3 Kconfigs can be used currently: 143 144* ``CONFIG_PIGWEED_TOKENIZER`` will automatically link ``pw_tokenizer`` as well 145 as any dependencies. 146* ``CONFIG_PIGWEED_TOKENIZER_BASE64`` will automatically link 147 ``pw_tokenizer.base64`` as well as any dependencies. 148* ``CONFIG_PIGWEED_DETOKENIZER`` will automatically link 149 ``pw_tokenizer.decoder`` as well as any dependencies. 150 151Once enabled, the tokenizer headers can be included like any Zephyr headers: 152 153.. code-block:: cpp 154 155 #include <pw_tokenizer/tokenize.h> 156 157.. note:: 158 Zephyr handles the additional linker sections via 159 ``pw_tokenizer_linker_rules.ld`` which is added to the end of the linker file 160 via a call to ``zephyr_linker_sources(SECTIONS ...)``. 161 162------------ 163Tokenization 164------------ 165Tokenization converts a string literal to a token. If it's a printf-style 166string, its arguments are encoded along with it. The results of tokenization can 167be sent off device or stored in place of a full string. 168 169.. doxygentypedef:: pw_tokenizer_Token 170 171Tokenization macros 172=================== 173Adding tokenization to a project is simple. To tokenize a string, include 174``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros. 175 176Tokenize a string literal 177------------------------- 178``pw_tokenizer`` provides macros for tokenizing string literals with no 179arguments. 180 181.. doxygendefine:: PW_TOKENIZE_STRING 182.. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN 183.. doxygendefine:: PW_TOKENIZE_STRING_MASK 184 185The tokenization macros above cannot be used inside other expressions. 186 187.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable. 188 :class: checkmark 189 190 .. code:: cpp 191 192 constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!"); 193 194 void Function() { 195 constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?"); 196 } 197 198.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression. 199 :class: error 200 201 .. code:: cpp 202 203 void BadExample() { 204 ProcessToken(PW_TOKENIZE_STRING("This won't compile!")); 205 } 206 207 Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead. 208 209An alternate set of macros are provided for use inside expressions. These make 210use of lambda functions, so while they can be used inside expressions, they 211require C++ and cannot be assigned to constexpr variables or be used with 212special function variables like ``__func__``. 213 214.. doxygendefine:: PW_TOKENIZE_STRING_EXPR 215.. doxygendefine:: PW_TOKENIZE_STRING_DOMAIN_EXPR 216.. doxygendefine:: PW_TOKENIZE_STRING_MASK_EXPR 217 218.. admonition:: When to use these macros 219 220 Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string 221 literals that do not need %-style arguments encoded. 222 223.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions. 224 :class: checkmark 225 226 .. code:: cpp 227 228 void GoodExample() { 229 ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!")); 230 } 231 232.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable. 233 :class: error 234 235 .. code:: cpp 236 237 constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!")); 238 239 Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable. 240 241.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`. 242 :class: error 243 244 .. code:: cpp 245 246 void BadExample() { 247 // This compiles, but __func__ will not be the outer function's name, and 248 // there may be compiler warnings. 249 constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__); 250 } 251 252 Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros. 253 254.. _module-pw_tokenizer-custom-macro: 255 256Tokenize a message with arguments in a custom macro 257--------------------------------------------------- 258Projects can leverage the tokenization machinery in whichever way best suits 259their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized 260data to a global handler function. A project's custom tokenization macro can 261handle tokenized data in a function of their choosing. 262 263``pw_tokenizer`` provides two low-level macros for projects to use 264to create custom tokenization macros. 265 266.. doxygendefine:: PW_TOKENIZE_FORMAT_STRING 267.. doxygendefine:: PW_TOKENIZER_ARG_TYPES 268 269The outputs of these macros are typically passed to an encoding function. That 270function encodes the token, argument types, and argument data to a buffer using 271helpers provided by ``pw_tokenizer/encode_args.h``. 272 273.. doxygenfunction:: pw::tokenizer::EncodeArgs 274.. doxygenclass:: pw::tokenizer::EncodedMessage 275 :members: 276.. doxygenfunction:: pw_tokenizer_EncodeArgs 277 278Example 279^^^^^^^ 280The following example implements a custom tokenization macro similar to 281:ref:`module-pw_log_tokenized`. 282 283.. code-block:: cpp 284 285 #include "pw_tokenizer/tokenize.h" 286 287 #ifndef __cplusplus 288 extern "C" { 289 #endif 290 291 void EncodeTokenizedMessage(uint32_t metadata, 292 pw_tokenizer_Token token, 293 pw_tokenizer_ArgTypes types, 294 ...); 295 296 #ifndef __cplusplus 297 } // extern "C" 298 #endif 299 300 #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \ 301 do { \ 302 PW_TOKENIZE_FORMAT_STRING( \ 303 PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \ 304 EncodeTokenizedMessage(payload, \ 305 _pw_tokenizer_token, \ 306 PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \ 307 PW_COMMA_ARGS(__VA_ARGS__)); \ 308 } while (0) 309 310In this example, the ``EncodeTokenizedMessage`` function would handle encoding 311and processing the message. Encoding is done by the 312:cpp:class:`pw::tokenizer::EncodedMessage` class or 313:cpp:func:`pw::tokenizer::EncodeArgs` function from 314``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or 315stored as needed. 316 317.. code-block:: cpp 318 319 #include "pw_log_tokenized/log_tokenized.h" 320 #include "pw_tokenizer/encode_args.h" 321 322 void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata, 323 pw::span<std::byte> message); 324 325 extern "C" void EncodeTokenizedMessage(const uint32_t metadata, 326 const pw_tokenizer_Token token, 327 const pw_tokenizer_ArgTypes types, 328 ...) { 329 va_list args; 330 va_start(args, types); 331 pw::tokenizer::EncodedMessage<> encoded_message(token, types, args); 332 va_end(args); 333 334 HandleTokenizedMessage(metadata, encoded_message); 335 } 336 337.. admonition:: Why use a custom macro 338 339 - Optimal code size. Invoking a free function with the tokenized data results 340 in the smallest possible call site. 341 - Pass additional arguments, such as metadata, with the tokenized message. 342 - Integrate ``pw_tokenizer`` with other systems. 343 344Tokenize a message with arguments to a buffer 345--------------------------------------------- 346.. doxygendefine:: PW_TOKENIZE_TO_BUFFER 347.. doxygendefine:: PW_TOKENIZE_TO_BUFFER_DOMAIN 348.. doxygendefine:: PW_TOKENIZE_TO_BUFFER_MASK 349 350.. admonition:: Why use this macro 351 352 - Encode a tokenized message for consumption within a function. 353 - Encode a tokenized message into an existing buffer. 354 355 Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a 356 logging macro, because it will result in larger code size than passing the 357 tokenized data to a function. 358 359Binary logging with pw_tokenizer 360================================ 361String tokenization can be used to convert plain text logs to a compact, 362efficient binary format. See :ref:`module-pw_log_tokenized`. 363 364Tokenizing function names 365========================= 366The string literal tokenization functions support tokenizing string literals or 367constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the 368special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared 369as ``static constexpr char[]`` in C++ instead of the standard ``static const 370char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be 371tokenized while compiling C++ with GCC or Clang. 372 373.. code-block:: cpp 374 375 // Tokenize the special function name variables. 376 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__); 377 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__); 378 379Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals. 380They are defined as static character arrays, so they cannot be implicitly 381concatentated with string literals. For example, ``printf(__func__ ": %d", 382123);`` will not compile. 383 384Tokenization in Python 385====================== 386The Python ``pw_tokenizer.encode`` module has limited support for encoding 387tokenized messages with the ``encode_token_and_args`` function. 388 389.. autofunction:: pw_tokenizer.encode.encode_token_and_args 390 391This function requires a string's token is already calculated. Typically these 392tokens are provided by a database, but they can be manually created using the 393tokenizer hash. 394 395.. autofunction:: pw_tokenizer.tokens.pw_tokenizer_65599_hash 396 397This is particularly useful for offline token database generation in cases where 398tokenized strings in a binary cannot be embedded as parsable pw_tokenizer 399entries. 400 401.. note:: 402 In C, the hash length of a string has a fixed limit controlled by 403 ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed 404 to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching 405 hash length limit. When creating an offline database, it's a good idea to 406 generate tokens for both, and merge the databases. 407 408Encoding 409======== 410The token is a 32-bit hash calculated during compilation. The string is encoded 411little-endian with the token followed by arguments, if any. For example, the 41231-byte string ``You can go about your business.`` hashes to 0xdac9a244. 413This is encoded as 4 bytes: ``44 a2 c9 da``. 414 415Arguments are encoded as follows: 416 417* **Integers** (1--10 bytes) -- 418 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_, 419 similarly to Protocol Buffers. Smaller values take fewer bytes. 420* **Floating point numbers** (4 bytes) -- Single precision floating point. 421* **Strings** (1--128 bytes) -- Length byte followed by the string contents. 422 The top bit of the length whether the string was truncated or not. The 423 remaining 7 bits encode the string length, with a maximum of 127 bytes. 424 425.. TODO(hepler): insert diagram here! 426 427.. tip:: 428 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` 429 arguments short or avoid encoding them as strings (e.g. encode an enum as an 430 integer instead of a string). See also `Tokenized strings as %s arguments`_. 431 432Buffer sizing helper 433-------------------- 434.. doxygenfunction:: pw::tokenizer::MinEncodingBufferSizeBytes 435 436Encoding command line utility 437----------------------------- 438The ``pw_tokenizer.encode`` command line tool can be used to encode tokenized 439strings. 440 441.. code-block:: bash 442 443 python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...] 444 445Example: 446 447.. code-block:: text 448 449 $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them 450 Raw input: "There's... %d many of %s!" % (2, 'them') 451 Formatted input: There's... 2 many of them! 452 Token: 0xb6ef8b2d 453 Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes] 454 Prefixed Base64: $LYvvtgQEdGhlbQ== 455 456See ``--help`` for full usage details. 457 458Token generation: fixed length hashing at compile time 459====================================================== 460String tokens are generated using a modified version of the x65599 hash used by 461the SDBM project. All hashing is done at compile time. 462 463In C code, strings are hashed with a preprocessor macro. For compatibility with 464macros, the hash must be limited to a fixed maximum number of characters. This 465value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing 466``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to 467the complexity of the hashing macros. 468 469C++ macros use a constexpr function instead of a macro. This function works with 470any length of string and has lower compilation time impact than the C macros. 471For consistency, C++ tokenization uses the same hash algorithm, but the 472calculated values will differ between C and C++ for strings longer than 473``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters. 474 475.. _module-pw_tokenizer-domains: 476 477Tokenization domains 478==================== 479``pw_tokenizer`` supports having multiple tokenization domains. Domains are a 480string label associated with each tokenized string. This allows projects to keep 481tokens from different sources separate. Potential use cases include the 482following: 483 484* Keep large sets of tokenized strings separate to avoid collisions. 485* Create a separate database for a small number of strings that use truncated 486 tokens, for example only 10 or 16 bits instead of the full 32 bits. 487 488If no domain is specified, the domain is empty (``""``). For many projects, this 489default domain is sufficient, so no additional configuration is required. 490 491.. code-block:: cpp 492 493 // Tokenizes this string to the default ("") domain. 494 PW_TOKENIZE_STRING("Hello, world!"); 495 496 // Tokenizes this string to the "my_custom_domain" domain. 497 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!"); 498 499The database and detokenization command line tools default to reading from the 500default domain. The domain may be specified for ELF files by appending 501``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For 502example, the following reads strings in ``some_domain`` from ``my_image.elf``. 503 504.. code-block:: sh 505 506 ./database.py create --database my_db.csv path/to/my_image.elf#some_domain 507 508See `Managing token databases`_ for information about the ``database.py`` 509command line tool. 510 511.. _module-pw_tokenizer-masks: 512 513Smaller tokens with masking 514=========================== 515``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using 516fewer than 32 bits does not improve runtime or code size efficiency. However, 517when tokens are packed into data structures or stored in arrays, the size of the 518token directly affects memory usage. In those cases, every bit counts, and it 519may be desireable to use fewer bits for the token. 520 521``pw_tokenizer`` allows users to provide a mask to apply to the token. This 522masked token is used in both the token database and the code. The masked token 523is not a masked version of the full 32-bit token, the masked token is the token. 524This makes it trivial to decode tokens that use fewer than 32 bits. 525 526Masking functionality is provided through the ``*_MASK`` versions of the macros. 527For example, the following generates 16-bit tokens and packs them into an 528existing value. 529 530.. code-block:: cpp 531 532 constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!"); 533 uint32_t packed_word = (other_bits << 16) | token; 534 535Tokens are hashes, so tokens of any size have a collision risk. The fewer bits 536used for tokens, the more likely two strings are to hash to the same token. See 537`token collisions`_. 538 539Masked tokens without arguments may be encoded in fewer bytes. For example, the 54016-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``) 541rather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller 542than four bytes. Tokens with arguments must always be encoded as four bytes. 543 544Token collisions 545================ 546Tokens are calculated with a hash function. It is possible for different 547strings to hash to the same token. When this happens, multiple strings will have 548the same token in the database, and it may not be possible to unambiguously 549decode a token. 550 551The detokenization tools attempt to resolve collisions automatically. Collisions 552are resolved based on two things: 553 554- whether the tokenized data matches the strings arguments' (if any), and 555- if / when the string was marked as having been removed from the database. 556 557Working with collisions 558----------------------- 559Collisions may occur occasionally. Run the command 560``python -m pw_tokenizer.database report <database>`` to see information about a 561token database, including any collisions. 562 563If there are collisions, take the following steps to resolve them. 564 565- Change one of the colliding strings slightly to give it a new token. 566- In C (not C++), artificial collisions may occur if strings longer than 567 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider 568 setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. See 569 ``pw_tokenizer/public/pw_tokenizer/config.h``. 570- Run the ``mark_removed`` command with the latest version of the build 571 artifacts to mark missing strings as removed. This deprioritizes them in 572 collision resolution. 573 574 .. code-block:: sh 575 576 python -m pw_tokenizer.database mark_removed --database <database> <ELF files> 577 578 The ``purge`` command may be used to delete these tokens from the database. 579 580Probability of collisions 581------------------------- 582Hashes of any size have a collision risk. The probability of one at least 583one collision occurring for a given number of strings is unintuitively high 584(this is known as the `birthday problem 585<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are 586used for tokens, the probability of collisions increases substantially. 587 588This table shows the approximate number of strings that can be hashed to have a 5891% or 50% probability of at least one collision (assuming a uniform, random 590hash). 591 592+-------+---------------------------------------+ 593| Token | Collision probability by string count | 594| bits +--------------------+------------------+ 595| | 50% | 1% | 596+=======+====================+==================+ 597| 32 | 77000 | 9300 | 598+-------+--------------------+------------------+ 599| 31 | 54000 | 6600 | 600+-------+--------------------+------------------+ 601| 24 | 4800 | 580 | 602+-------+--------------------+------------------+ 603| 16 | 300 | 36 | 604+-------+--------------------+------------------+ 605| 8 | 19 | 3 | 606+-------+--------------------+------------------+ 607 608Keep this table in mind when masking tokens (see `Smaller tokens with 609masking`_). 16 bits might be acceptable when tokenizing a small set of strings, 610such as module names, but won't be suitable for large sets of strings, like log 611messages. 612 613--------------- 614Token databases 615--------------- 616Token databases store a mapping of tokens to the strings they represent. An ELF 617file can be used as a token database, but it only contains the strings for its 618exact build. A token database file aggregates tokens from multiple ELF files, so 619that a single database can decode tokenized strings from any known ELF. 620 621Token databases contain the token, removal date (if any), and string for each 622tokenized string. 623 624Token database formats 625====================== 626Three token database formats are supported: CSV, binary, and directory. Tokens 627may also be read from ELF files or ``.a`` archives, but cannot be written to 628these formats. 629 630CSV database format 631------------------- 632The CSV database format has three columns: the token in hexadecimal, the removal 633date (if any) in year-month-day format, and the string literal, surrounded by 634quotes. Quote characters within the string are represented as two quote 635characters. 636 637This example database contains six strings, three of which have removal dates. 638 639.. code-block:: 640 641 141c35d5, ,"The answer: ""%s""" 642 2e668cd6,2019-12-25,"Jello, world!" 643 7b940e2a, ,"Hello %s! %hd %e" 644 851beeb6, ,"%u %d" 645 881436a0,2020-01-01,"The answer is: %s" 646 e13b0f94,2020-04-01,"%llu" 647 648Binary database format 649---------------------- 650The binary database format is comprised of a 16-byte header followed by a series 651of 8-byte entries. Each entry stores the token and the removal date, which is 6520xFFFFFFFF if there is none. The string literals are stored next in the same 653order as the entries. Strings are stored with null terminators. See 654`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ 655for full details. 656 657The binary form of the CSV database is shown below. It contains the same 658information, but in a more compact and easily processed form. It takes 141 B 659compared with the CSV database's 211 B. 660 661.. code-block:: text 662 663 [header] 664 0x00: 454b4f54 0000534e TOKENS.. 665 0x08: 00000006 00000000 ........ 666 667 [entries] 668 0x10: 141c35d5 ffffffff .5...... 669 0x18: 2e668cd6 07e30c19 ..f..... 670 0x20: 7b940e2a ffffffff *..{.... 671 0x28: 851beeb6 ffffffff ........ 672 0x30: 881436a0 07e40101 .6...... 673 0x38: e13b0f94 07e40401 ..;..... 674 675 [string table] 676 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" 677 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H 678 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. 679 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer 680 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. 681 682Directory database format 683------------------------- 684pw_tokenizer can consume directories of CSV databases. A directory database 685will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all 686of which will be used for subsequent detokenization lookups. 687 688An example directory database might look something like this: 689 690.. code-block:: text 691 692 token_database 693 ├── chuck_e_cheese.pw_tokenizer.csv 694 ├── fungi_ble.pw_tokenizer.csv 695 └── some_more 696 └── arcade.pw_tokenizer.csv 697 698This format is optimized for storage in a Git repository alongside source code. 699The token database commands randomly generate unique file names for the CSVs in 700the database to prevent merge conflicts. Running ``mark_removed`` or ``purge`` 701commands in the database CLI consolidates the files to a single CSV. 702 703The database command line tool supports a ``--discard-temporary 704<upstream_commit>`` option for ``add``. In this mode, the tool attempts to 705discard temporary tokens. It identifies the latest CSV not present in the 706provided ``<upstream_commit>``, and tokens present that CSV that are not in the 707newly added tokens are discarded. This helps keep temporary tokens (e.g from 708debug logs) out of the database. 709 710JSON support 711============ 712While pw_tokenizer doesn't specify a JSON database format, a token database can 713be created from a JSON formatted array of strings. This is useful for side-band 714token database generation for strings that are not embedded as parsable tokens 715in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for 716instructions on generating a token database from a JSON file. 717 718Managing token databases 719======================== 720Token databases are managed with the ``database.py`` script. This script can be 721used to extract tokens from compilation artifacts and manage database files. 722Invoke ``database.py`` with ``-h`` for full usage information. 723 724An example ELF file with tokenized logs is provided at 725``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that 726file to experiment with the ``database.py`` commands. 727 728.. _module-pw_tokenizer-database-creation: 729 730Create a database 731----------------- 732The ``create`` command makes a new token database from ELF files (.elf, .o, .so, 733etc.), archives (.a), existing token databases (CSV or binary), or a JSON file 734containing an array of strings. 735 736.. code-block:: sh 737 738 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... 739 740Two database output formats are supported: CSV and binary. Provide 741``--type binary`` to ``create`` to generate a binary database instead of the 742default CSV. CSV databases are great for checking into a source control or for 743human review. Binary databases are more compact and simpler to parse. The C++ 744detokenizer library only supports binary databases currently. 745 746Update a database 747----------------- 748As new tokenized strings are added, update the database with the ``add`` 749command. 750 751.. code-block:: sh 752 753 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... 754 755This command adds new tokens from ELF files or other databases to the database. 756Adding tokens already present in the database updates the date removed, if any, 757to the latest. 758 759A CSV token database can be checked into a source repository and updated as code 760changes are made. The build system can invoke ``database.py`` to update the 761database after each build. 762 763GN integration 764-------------- 765Token databases may be updated or created as part of a GN build. The 766``pw_tokenizer_database`` template provided by 767``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized 768strings database or creates a new database with artifacts from one or more GN 769targets or other database files. 770 771To create a new database, set the ``create`` variable to the desired database 772type (``"csv"`` or ``"binary"``). The database will be created in the output 773directory. To update an existing database, provide the path to the database with 774the ``database`` variable. 775 776.. code-block:: 777 778 import("//build_overrides/pigweed.gni") 779 780 import("$dir_pw_tokenizer/database.gni") 781 782 pw_tokenizer_database("my_database") { 783 database = "database_in_the_source_tree.csv" 784 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ] 785 input_databases = [ "other_database.csv" ] 786 } 787 788Instead of specifying GN targets, paths or globs to output files may be provided 789with the ``paths`` option. 790 791.. code-block:: 792 793 pw_tokenizer_database("my_database") { 794 database = "database_in_the_source_tree.csv" 795 deps = [ ":apps" ] 796 optional_paths = [ "$root_build_dir/**/*.elf" ] 797 } 798 799.. note:: 800 801 The ``paths`` and ``optional_targets`` arguments do not add anything to 802 ``deps``, so there is no guarantee that the referenced artifacts will exist 803 when the database is updated. Provide ``targets`` or ``deps`` or build other 804 GN targets first if this is a concern. 805 806-------------- 807Detokenization 808-------------- 809Detokenization is the process of expanding a token to the string it represents 810and decoding its arguments. This module provides Python, C++ and TypeScript 811detokenization libraries. 812 813**Example: decoding tokenized logs** 814 815A project might tokenize its log messages with the `Base64 format`_. Consider 816the following log file, which has four tokenized logs and one plain text log: 817 818.. code-block:: text 819 820 20200229 14:38:58 INF $HL2VHA== 821 20200229 14:39:00 DBG $5IhTKg== 822 20200229 14:39:20 DBG Crunching numbers to calculate probability of success 823 20200229 14:39:21 INF $EgFj8lVVAUI= 824 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk= 825 826The project's log strings are stored in a database like the following: 827 828.. code-block:: 829 830 1c95bd1c, ,"Initiating retrieval process for recovery object" 831 2a5388e4, ,"Determining optimal approach and coordinating vectors" 832 3743540c, ,"Recovery object retrieval failed with status %s" 833 f2630112, ,"Calculated acceptable probability of success (%.2f%%)" 834 835Using the detokenizing tools with the database, the logs can be decoded: 836 837.. code-block:: text 838 839 20200229 14:38:58 INF Initiating retrieval process for recovery object 840 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors 841 20200229 14:39:20 DBG Crunching numbers to calculate probability of success 842 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%) 843 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY 844 845.. note:: 846 847 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as 848 much space as the default binary format when encoded. For projects that wish 849 to interleave tokenized with plain text, using Base64 is a worthwhile 850 tradeoff. 851 852Python 853====== 854To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer`` 855package, and instantiate it with paths to token databases or ELF files. 856 857.. code-block:: python 858 859 import pw_tokenizer 860 861 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf') 862 863 def process_log_message(log_message): 864 result = detokenizer.detokenize(log_message.payload) 865 self._log(str(result)) 866 867The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer`` 868class, which can be used in place of the standard ``Detokenizer``. This class 869monitors database files for changes and automatically reloads them when they 870change. This is helpful for long-running tools that use detokenization. The 871class also supports token domains for the given database files in the 872``<path>#<domain>`` format. 873 874For messages that are optionally tokenized and may be encoded as binary, 875Base64, or plaintext UTF-8, use 876:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to 877determine the correct method to detokenize and always provide a printable 878string. For more information on this feature, see 879:ref:`module-pw_tokenizer-proto`. 880 881C99 ``printf`` Compatibility Notes 882---------------------------------- 883This implementation is designed to align with the 884`C99 specification, section 7.19.6 885<https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_. 886Notably, this specification is slightly different than what is implemented 887in most compilers due to each compiler choosing to interpret undefined 888behavior in slightly different ways. Treat the following description as the 889source of truth. 890 891This implementation supports: 892 893- Overall Format: ``%[flags][width][.precision][length][specifier]`` 894- Flags (Zero or More) 895 - ``-``: Left-justify within the given field width; Right justification is 896 the default (see Width modifier). 897 - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or 898 ``-``) even for positive numbers. By default, only negative numbers are 899 preceded with a ``-`` sign. 900 - (space): If no sign is going to be written, a blank space is inserted 901 before the value. 902 - ``#``: Specifies an alternative print syntax should be used. 903 - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with 904 ``0``, ``0x`` or ``0X``, respectively, for values different than zero. 905 - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it 906 forces the written output to contain a decimal point even if no more 907 digits follow. By default, if no digits follow, no decimal point is 908 written. 909 - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when 910 padding is specified (see width sub-specifier). 911- Width (Optional) 912 - ``(number)``: Minimum number of characters to be printed. If the value to 913 be printed is shorter than this number, the result is padded with blank 914 spaces or ``0`` if the ``0`` flag is present. The value is not truncated 915 even if the result is larger. If the value is negative and the ``0`` flag 916 is present, the ``0``\s are padded after the ``-`` symbol. 917 - ``*``: The width is not specified in the format string, but as an 918 additional integer value argument preceding the argument that has to be 919 formatted. 920- Precision (Optional) 921 - ``.(number)`` 922 - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum 923 number of digits to be written. If the value to be written is shorter 924 than this number, the result is padded with leading zeros. The value is 925 not truncated even if the result is longer. 926 927 - A precision of ``0`` means that no character is written for the value 928 ``0``. 929 930 - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number 931 of digits to be printed after the decimal point. By default, this is 932 ``6``. 933 934 - For ``g`` and ``G``, specifies the maximum number of significant digits 935 to be printed. 936 937 - For ``s``, specifies the maximum number of characters to be printed. By 938 default all characters are printed until the ending null character is 939 encountered. 940 941 - If the period is specified without an explicit value for precision, 942 ``0`` is assumed. 943 - ``.*``: The precision is not specified in the format string, but as an 944 additional integer value argument preceding the argument that has to be 945 formatted. 946- Length (Optional) 947 - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 948 to convey the argument will be a ``signed char`` or ``unsigned char``. 949 However, this is largely ignored in the implementation due to it not being 950 necessary for Python or argument decoding (since the argument is always 951 encoded at least as a 32-bit integer). 952 - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 953 to convey the argument will be a ``signed short int`` or 954 ``unsigned short int``. However, this is largely ignored in the 955 implementation due to it not being necessary for Python or argument 956 decoding (since the argument is always encoded at least as a 32-bit 957 integer). 958 - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 959 to convey the argument will be a ``signed long int`` or 960 ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that 961 the arguments will be encoded with ``wchar_t`` values (which isn't 962 different from normal ``char`` values). However, this is largely ignored in 963 the implementation due to it not being necessary for Python or argument 964 decoding (since the argument is always encoded at least as a 32-bit 965 integer). 966 - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 967 to convey the argument will be a ``signed long long int`` or 968 ``unsigned long long int``. This is required to properly decode the 969 argument as a 64-bit integer. 970 - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or 971 ``G`` conversion specifiers applies to a long double argument. However, 972 this is ignored in the implementation due to floating point value encoded 973 that is unaffected by bit width. 974 - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 975 to convey the argument will be a ``intmax_t`` or ``uintmax_t``. 976 - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 977 to convey the argument will be a ``size_t``. This will force the argument 978 to be decoded as an unsigned integer. 979 - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers 980 to convey the argument will be a ``ptrdiff_t``. 981 - If a length modifier is provided for an incorrect specifier, it is ignored. 982- Specifier (Required) 983 - ``d`` / ``i``: Used for signed decimal integers. 984 985 - ``u``: Used for unsigned decimal integers. 986 987 - ``o``: Used for unsigned decimal integers and specifies formatting should 988 be as an octal number. 989 990 - ``x``: Used for unsigned decimal integers and specifies formatting should 991 be as a hexadecimal number using all lowercase letters. 992 993 - ``X``: Used for unsigned decimal integers and specifies formatting should 994 be as a hexadecimal number using all uppercase letters. 995 996 - ``f``: Used for floating-point values and specifies to use lowercase, 997 decimal floating point formatting. 998 999 - Default precision is ``6`` decimal places unless explicitly specified. 1000 1001 - ``F``: Used for floating-point values and specifies to use uppercase, 1002 decimal floating point formatting. 1003 1004 - Default precision is ``6`` decimal places unless explicitly specified. 1005 1006 - ``e``: Used for floating-point values and specifies to use lowercase, 1007 exponential (scientific) formatting. 1008 1009 - Default precision is ``6`` decimal places unless explicitly specified. 1010 1011 - ``E``: Used for floating-point values and specifies to use uppercase, 1012 exponential (scientific) formatting. 1013 1014 - Default precision is ``6`` decimal places unless explicitly specified. 1015 1016 - ``g``: Used for floating-point values and specified to use ``f`` or ``e`` 1017 formatting depending on which would be the shortest representation. 1018 1019 - Precision specifies the number of significant digits, not just digits 1020 after the decimal place. 1021 1022 - If the precision is specified as ``0``, it is interpreted to mean ``1``. 1023 1024 - ``e`` formatting is used if the the exponent would be less than ``-4`` or 1025 is greater than or equal to the precision. 1026 1027 - Trailing zeros are removed unless the ``#`` flag is set. 1028 1029 - A decimal point only appears if it is followed by a digit. 1030 1031 - ``NaN`` or infinities always follow ``f`` formatting. 1032 1033 - ``G``: Used for floating-point values and specified to use ``f`` or ``e`` 1034 formatting depending on which would be the shortest representation. 1035 1036 - Precision specifies the number of significant digits, not just digits 1037 after the decimal place. 1038 1039 - If the precision is specified as ``0``, it is interpreted to mean ``1``. 1040 1041 - ``E`` formatting is used if the the exponent would be less than ``-4`` or 1042 is greater than or equal to the precision. 1043 1044 - Trailing zeros are removed unless the ``#`` flag is set. 1045 1046 - A decimal point only appears if it is followed by a digit. 1047 1048 - ``NaN`` or infinities always follow ``F`` formatting. 1049 1050 - ``c``: Used for formatting a ``char`` value. 1051 1052 - ``s``: Used for formatting a string of ``char`` values. 1053 1054 - If width is specified, the null terminator character is included as a 1055 character for width count. 1056 1057 - If precision is specified, no more ``char``\s than that value will be 1058 written from the string (padding is used to fill additional width). 1059 1060 - ``p``: Used for formatting a pointer address. 1061 1062 - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags, 1063 width, precision, or length modifiers). 1064 1065Underspecified details: 1066 1067- If both ``+`` and (space) flags appear, the (space) is ignored. 1068- The ``+`` and (space) flags will error if used with ``c`` or ``s``. 1069- The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or 1070 ``p``. 1071- The ``0`` flag will error if used with ``c``, ``s``, or ``p``. 1072- Both ``+`` and (space) can work with the unsigned integer specifiers ``u``, 1073 ``o``, ``x``, and ``X``. 1074- If a length modifier is provided for an incorrect specifier, it is ignored. 1075- The ``z`` length modifier will decode arugments as signed as long as ``d`` or 1076 ``i`` is used. 1077- ``p`` is implementation defined. 1078 1079 - For this implementation, it will print with a ``0x`` prefix and then the 1080 pointer value was printed using ``%08X``. 1081 1082 - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or 1083 ``0`` flags. 1084 1085 - None of the length modifiers are usable with ``p``. 1086 1087 - This implementation will try to adhere to user-specified width (assuming the 1088 width provided is larger than the guaranteed minimum of ``10``). 1089 1090 - Specifying precision for ``p`` is considered an error. 1091- Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail 1092 to decode. Some C stdlib implementations support any modifiers being 1093 present between ``%``, but ignore any for the output. 1094- If a width is specified with the ``0`` flag for a negative value, the padded 1095 ``0``\s will appear after the ``-`` symbol. 1096- A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means 1097 that no character is written for the value ``0``. 1098- Precision cannot be specified for ``c``. 1099- Using ``*`` or fixed precision with the ``s`` specifier still requires the 1100 string argument to be null-terminated. This is due to argument encoding 1101 happening on the C/C++-side while the precision value is not read or 1102 otherwise used until decoding happens in this Python code. 1103 1104Non-conformant details: 1105 1106- ``n`` specifier: We do not support the ``n`` specifier since it is impossible 1107 for us to retroactively tell the original program how many characters have 1108 been printed since this decoding happens a great deal of time after the 1109 device sent it, usually on a separate processing device entirely. 1110 1111C++ 1112=== 1113The C++ detokenization libraries can be used in C++ or any language that can 1114call into C++ with a C-linkage wrapper, such as Java or Rust. A reference 1115Java Native Interface (JNI) implementation is provided. 1116 1117The C++ detokenization library uses binary-format token databases (created with 1118``database.py create --type binary``). Read a binary format database from a 1119file or include it in the source code. Pass the database array to 1120``TokenDatabase::Create``, and construct a detokenizer. 1121 1122.. code-block:: cpp 1123 1124 Detokenizer detokenizer(TokenDatabase::Create(token_database_array)); 1125 1126 std::string ProcessLog(span<uint8_t> log_data) { 1127 return detokenizer.Detokenize(log_data).BestString(); 1128 } 1129 1130The ``TokenDatabase`` class verifies that its data is valid before using it. If 1131it is invalid, the ``TokenDatabase::Create`` returns an empty database for which 1132``ok()`` returns false. If the token database is included in the source code, 1133this check can be done at compile time. 1134 1135.. code-block:: cpp 1136 1137 // This line fails to compile with a static_assert if the database is invalid. 1138 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>(); 1139 1140 Detokenizer OpenDatabase(std::string_view path) { 1141 std::vector<uint8_t> data = ReadWholeFile(path); 1142 1143 TokenDatabase database = TokenDatabase::Create(data); 1144 1145 // This checks if the file contained a valid database. It is safe to use a 1146 // TokenDatabase that failed to load (it will be empty), but it may be 1147 // desirable to provide a default database or otherwise handle the error. 1148 if (database.ok()) { 1149 return Detokenizer(database); 1150 } 1151 return Detokenizer(kDefaultDatabase); 1152 } 1153 1154 1155TypeScript 1156========== 1157To detokenize in TypeScript, import ``Detokenizer`` from the ``pigweedjs`` 1158package, and instantiate it with a CSV token database. 1159 1160.. code-block:: typescript 1161 1162 import { pw_tokenizer, pw_hdlc } from 'pigweedjs'; 1163 const { Detokenizer } = pw_tokenizer; 1164 const { Frame } = pw_hdlc; 1165 1166 const detokenizer = new Detokenizer(String(tokenCsv)); 1167 1168 function processLog(frame: Frame){ 1169 const result = detokenizer.detokenize(frame); 1170 console.log(result); 1171 } 1172 1173For messages that are encoded in Base64, use ``Detokenizer::detokenizeBase64``. 1174`detokenizeBase64` will also attempt to detokenize nested Base64 tokens. There 1175is also `detokenizeUint8Array` that works just like `detokenize` but expects 1176`Uint8Array` instead of a `Frame` argument. 1177 1178Protocol buffers 1179================ 1180``pw_tokenizer`` provides utilities for handling tokenized fields in protobufs. 1181See :ref:`module-pw_tokenizer-proto` for details. 1182 1183.. toctree:: 1184 :hidden: 1185 1186 proto.rst 1187 1188------------- 1189Base64 format 1190------------- 1191The tokenizer encodes messages to a compact binary representation. Applications 1192may desire a textual representation of tokenized strings. This makes it easy to 1193use tokenized messages alongside plain text messages, but comes at a small 1194efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory 1195as binary messages. 1196 1197The Base64 format is comprised of a ``$`` character followed by the 1198Base64-encoded contents of the tokenized message. For example, consider 1199tokenizing the string ``This is an example: %d!`` with the argument -1. The 1200string's token is 0x4b016e66. 1201 1202.. code-block:: text 1203 1204 Source code: PW_LOG("This is an example: %d!", -1); 1205 1206 Plain text: This is an example: -1! [23 bytes] 1207 1208 Binary: 66 6e 01 4b 01 [ 5 bytes] 1209 1210 Base64: $Zm4BSwE= [ 9 bytes] 1211 1212Encoding 1213======== 1214To encode with the Base64 format, add a call to 1215``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode`` 1216in the tokenizer handler function. For example, 1217 1218.. code-block:: cpp 1219 1220 void TokenizedMessageHandler(const uint8_t encoded_message[], 1221 size_t size_bytes) { 1222 pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode( 1223 pw::span(encoded_message, size_bytes)); 1224 1225 TransmitLogMessage(base64.data(), base64.size()); 1226 } 1227 1228Decoding 1229======== 1230The Python ``Detokenizer`` class supprts decoding and detokenizing prefixed 1231Base64 messages with ``detokenize_base64`` and related methods. 1232 1233.. tip:: 1234 The Python detokenization tools support recursive detokenization for prefixed 1235 Base64 text. Tokenized strings found in detokenized text are detokenized, so 1236 prefixed Base64 messages can be passed as ``%s`` arguments. 1237 1238 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be 1239 passed as an argument to the printf-style string ``Nested message: %s``, which 1240 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message 1241 as follows: 1242 1243 :: 1244 1245 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!" 1246 1247Base64 decoding is supported in C++ or C with the 1248``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode`` 1249functions. 1250 1251Investigating undecoded messages 1252================================ 1253Tokenized messages cannot be decoded if the token is not recognized. The Python 1254package includes the ``parse_message`` tool, which parses tokenized Base64 1255messages without looking up the token in a database. This tool attempts to guess 1256the types of the arguments and displays potential ways to decode them. 1257 1258This tool can be used to extract argument information from an otherwise unusable 1259message. It could help identify which statement in the code produced the 1260message. This tool is not particularly helpful for tokenized messages without 1261arguments, since all it can do is show the value of the unknown token. 1262 1263The tool is executed by passing Base64 tokenized messages, with or without the 1264``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to 1265see full usage information. 1266 1267Example 1268------- 1269.. code-block:: 1270 1271 $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d 1272 1273 INF Decoding arguments for '$329JMwA=' 1274 INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes) 1275 INF Token: 0x33496fdf 1276 INF Args: b'\x00' [00] (1 bytes) 1277 INF Decoding with up to 8 %s or %d arguments 1278 INF Attempt 1: [%s] 1279 INF Attempt 2: [%d] 0 1280 1281 INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw==' 1282 INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes) 1283 INF Token: 0xe7a58492 1284 INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes) 1285 INF Decoding with up to 8 %s or %d arguments 1286 INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38 1287 INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK 1288 1289Command line utilities 1290---------------------- 1291``pw_tokenizer`` provides two standalone command line utilities for detokenizing 1292Base64-encoded tokenized strings. 1293 1294* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from 1295 stdin. 1296* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a 1297 connected serial device. 1298 1299If the ``pw_tokenizer`` Python package is installed, these tools may be executed 1300as runnable modules. For example: 1301 1302.. code-block:: 1303 1304 # Detokenize Base64-encoded strings in a file 1305 python -m pw_tokenizer.detokenize -i input_file.txt 1306 1307 # Detokenize Base64-encoded strings in output from a serial device 1308 python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0 1309 1310See the ``--help`` options for these tools for full usage information. 1311 1312-------------------- 1313Deployment war story 1314-------------------- 1315The tokenizer module was developed to bring tokenized logging to an 1316in-development product. The product already had an established text-based 1317logging system. Deploying tokenization was straightforward and had substantial 1318benefits. 1319 1320Results 1321======= 1322* Log contents shrunk by over 50%, even with Base64 encoding. 1323 1324 * Significant size savings for encoded logs, even using the less-efficient 1325 Base64 encoding required for compatibility with the existing log system. 1326 * Freed valuable communication bandwidth. 1327 * Allowed storing many more logs in crash dumps. 1328 1329* Substantial flash savings. 1330 1331 * Reduced the size firmware images by up to 18%. 1332 1333* Simpler logging code. 1334 1335 * Removed CPU-heavy ``snprintf`` calls. 1336 * Removed complex code for forwarding log arguments to a low-priority task. 1337 1338This section describes the tokenizer deployment process and highlights key 1339insights. 1340 1341Firmware deployment 1342=================== 1343* In the project's logging macro, calls to the underlying logging function were 1344 replaced with a tokenized log macro invocation. 1345* The log level was passed as the payload argument to facilitate runtime log 1346 level control. 1347* For this project, it was necessary to encode the log messages as text. In 1348 the handler function the log messages were encoded in the $-prefixed `Base64 1349 format`_, then dispatched as normal log messages. 1350* Asserts were tokenized a callback-based API that has been removed (a 1351 :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better 1352 alternative). 1353 1354.. attention:: 1355 Do not encode line numbers in tokenized strings. This results in a huge 1356 number of lines being added to the database, since every time code moves, 1357 new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line 1358 numbers are encoded in the log metadata. Line numbers may also be included by 1359 by adding ``"%d"`` to the format string and passing ``__LINE__``. 1360 1361Database management 1362=================== 1363* The token database was stored as a CSV file in the project's Git repo. 1364* The token database was automatically updated as part of the build, and 1365 developers were expected to check in the database changes alongside their code 1366 changes. 1367* A presubmit check verified that all strings added by a change were added to 1368 the token database. 1369* The token database included logs and asserts for all firmware images in the 1370 project. 1371* No strings were purged from the token database. 1372 1373.. tip:: 1374 Merge conflicts may be a frequent occurrence with an in-source CSV database. 1375 Use the `Directory database format`_ instead. 1376 1377Decoding tooling deployment 1378=========================== 1379* The Python detokenizer in ``pw_tokenizer`` was deployed to two places: 1380 1381 * Product-specific Python command line tools, using 1382 ``pw_tokenizer.Detokenizer``. 1383 * Standalone script for decoding prefixed Base64 tokens in files or 1384 live output (e.g. from ``adb``), using ``detokenize.py``'s command line 1385 interface. 1386 1387* The C++ detokenizer library was deployed to two Android apps with a Java 1388 Native Interface (JNI) layer. 1389 1390 * The binary token database was included as a raw resource in the APK. 1391 * In one app, the built-in token database could be overridden by copying a 1392 file to the phone. 1393 1394.. tip:: 1395 Make the tokenized logging tools simple to use for your project. 1396 1397 * Provide simple wrapper shell scripts that fill in arguments for the 1398 project. For example, point ``detokenize.py`` to the project's token 1399 databases. 1400 * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in 1401 continuously-running tools, so that users don't have to restart the tool 1402 when the token database updates. 1403 * Integrate detokenization everywhere it is needed. Integrating the tools 1404 takes just a few lines of code, and token databases can be embedded in APKs 1405 or binaries. 1406 1407--------------------------- 1408Limitations and future work 1409--------------------------- 1410 1411GCC bug: tokenization in template functions 1412=========================================== 1413GCC incorrectly ignores the section attribute for template `functions 1414<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables 1415<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. For example, the 1416following won't work when compiling with GCC and tokenized logging: 1417 1418.. code-block:: cpp 1419 1420 template <...> 1421 void DoThings() { 1422 int value = GetValue(); 1423 // This log won't work with tokenized logs due to the templated context. 1424 PW_LOG_INFO("Got value: %d", value); 1425 ... 1426 } 1427 1428The bug causes tokenized strings in template functions to be emitted into 1429``.rodata`` instead of the special tokenized string section. This causes two 1430problems: 1431 14321. Tokenized strings will not be discovered by the token database tools. 14332. Tokenized strings may not be removed from the final binary. 1434 1435There are two workarounds. 1436 1437#. **Use Clang.** Clang puts the string data in the requested section, as 1438 expected. No extra steps are required. 1439 1440#. **Move tokenization calls to a non-templated context.** Creating a separate 1441 non-templated function and invoking it from the template resolves the issue. 1442 This enables tokenizing in most cases encountered in practice with 1443 templates. 1444 1445 .. code-block:: cpp 1446 1447 // In .h file: 1448 void LogThings(value); 1449 1450 template <...> 1451 void DoThings() { 1452 int value = GetValue(); 1453 // This log will work: calls non-templated helper. 1454 LogThings(value); 1455 ... 1456 } 1457 1458 // In .cc file: 1459 void LogThings(int value) { 1460 // Tokenized logging works as expected in this non-templated context. 1461 PW_LOG_INFO("Got value %d", value); 1462 } 1463 1464There is a third option, which isn't implemented yet, which is to compile the 1465binary twice: once to extract the tokens, and once for the production binary 1466(without tokens). If this is interesting to you please get in touch. 1467 146864-bit tokenization 1469=================== 1470The Python and C++ detokenizing libraries currently assume that strings were 1471tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and 1472``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit 1473device performed the tokenization. 1474 1475Supporting detokenization of strings tokenized on 64-bit targets would be 1476simple. This could be done by adding an option to switch the 32-bit types to 147764-bit. The tokenizer stores the sizes of these types in the 1478``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified 1479by checking the ELF file, if necessary. 1480 1481Tokenization in headers 1482======================= 1483Tokenizing code in header files (inline functions or templates) may trigger 1484warnings such as ``-Wlto-type-mismatch`` under certain conditions. That 1485is because tokenization requires declaring a character array for each tokenized 1486string. If the tokenized string includes macros that change value, the size of 1487this character array changes, which means the same static variable is defined 1488with different sizes. It should be safe to suppress these warnings, but, when 1489possible, code that tokenizes strings with macros that can change value should 1490be moved to source files rather than headers. 1491 1492Tokenized strings as ``%s`` arguments 1493===================================== 1494Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are 1495encoded 1:1, with no tokenization. It would be better to send a tokenized string 1496literal as an integer instead of a string argument, but this is not yet 1497supported. 1498 1499A string token could be sent by marking an integer % argument in a way 1500recognized by the detokenization tools. The detokenizer would expand the 1501argument to the string represented by the integer. 1502 1503.. code-block:: cpp 1504 1505 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]" 1506 1507 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there"); 1508 1509 PW_TOKENIZE_STRING("Knock knock: %" PW_TOKEN_ARG "?", answer_token); 1510 1511Strings with arguments could be encoded to a buffer, but since printf strings 1512are null-terminated, a binary encoding would not work. These strings can be 1513prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_. 1514 1515Another possibility: encode strings with arguments to a ``uint64_t`` and send 1516them as an integer. This would be efficient and simple, but only support a small 1517number of arguments. 1518 1519------------- 1520Compatibility 1521------------- 1522* C11 1523* C++14 1524* Python 3 1525 1526------------ 1527Dependencies 1528------------ 1529* ``pw_varint`` module 1530* ``pw_preprocessor`` module 1531* ``pw_span`` module 1532