1.. _module-pw_tokenizer: 2 3------------ 4pw_tokenizer 5------------ 6Logging is critical, but developers are often forced to choose between 7additional logging or saving crucial flash space. The ``pw_tokenizer`` module 8helps address this by replacing printf-style strings with binary tokens during 9compilation. This enables extensive logging with substantially less memory 10usage. 11 12.. note:: 13 This usage of the term "tokenizer" is not related to parsing! The 14 module is called tokenizer because it replaces a whole string literal with an 15 integer token. It does not parse strings into separate tokens. 16 17The most common application of ``pw_tokenizer`` is binary logging, and it is 18designed to integrate easily into existing logging systems. However, the 19tokenizer is general purpose and can be used to tokenize any strings, with or 20without printf-style arguments. 21 22**Why tokenize strings?** 23 24 * Dramatically reduce binary size by removing string literals from binaries. 25 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact 26 tokens instead of strings. We've seen over 50% reduction in encoded log 27 contents. 28 * Reduce CPU usage by replacing snprintf calls with simple tokenization code. 29 * Remove potentially sensitive log, assert, and other strings from binaries. 30 31Basic overview 32============== 33There are two sides to ``pw_tokenizer``, which we call tokenization and 34detokenization. 35 36 * **Tokenization** converts string literals in the source code to 37 binary tokens at compile time. If the string has printf-style arguments, 38 these are encoded to compact binary form at runtime. 39 * **Detokenization** converts tokenized strings back to the original 40 human-readable strings. 41 42Here's an overview of what happens when ``pw_tokenizer`` is used: 43 44 1. During compilation, the ``pw_tokenizer`` module hashes string literals to 45 generate stable 32-bit tokens. 46 2. The tokenization macro removes these strings by declaring them in an ELF 47 section that is excluded from the final binary. 48 3. After compilation, strings are extracted from the ELF to build a database 49 of tokenized strings for use by the detokenizer. The ELF file may also be 50 used directly. 51 4. During operation, the device encodes the string token and its arguments, if 52 any. 53 5. The encoded tokenized strings are sent off-device or stored. 54 6. Off-device, the detokenizer tools use the token database to decode the 55 strings to human-readable form. 56 57Example: tokenized logging 58-------------------------- 59This example demonstrates using ``pw_tokenizer`` for logging. In this example, 60tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded 61size (49 → 15 bytes). 62 63**Before**: plain text logging 64 65+------------------+-------------------------------------------+---------------+ 66| Location | Logging Content | Size in bytes | 67+==================+===========================================+===============+ 68| Source contains | ``LOG("Battery state: %s; battery | | 69| | voltage: %d mV", state, voltage);`` | | 70+------------------+-------------------------------------------+---------------+ 71| Binary contains | ``"Battery state: %s; battery | 41 | 72| | voltage: %d mV"`` | | 73+------------------+-------------------------------------------+---------------+ 74| | (log statement is called with | | 75| | ``"CHARGING"`` and ``3989`` as arguments) | | 76+------------------+-------------------------------------------+---------------+ 77| Device transmits | ``"Battery state: CHARGING; battery | 49 | 78| | voltage: 3989 mV"`` | | 79+------------------+-------------------------------------------+---------------+ 80| When viewed | ``"Battery state: CHARGING; battery | | 81| | voltage: 3989 mV"`` | | 82+------------------+-------------------------------------------+---------------+ 83 84**After**: tokenized logging 85 86+------------------+-----------------------------------------------------------+---------+ 87| Location | Logging Content | Size in | 88| | | bytes | 89+==================+===========================================================+=========+ 90| Source contains | ``LOG("Battery state: %s; battery | | 91| | voltage: %d mV", state, voltage);`` | | 92+------------------+-----------------------------------------------------------+---------+ 93| Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 | 94+------------------+-----------------------------------------------------------+---------+ 95| | (log statement is called with | | 96| | ``"CHARGING"`` and ``3989`` as arguments) | | 97+------------------+-----------------------------------------------------------+---------+ 98| Device transmits | =============== ============================== ========== | 15 | 99| | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | | 100| | --------------- ------------------------------ ---------- | | 101| | Token ``"CHARGING"`` argument ``3989``, | | 102| | as | | 103| | varint | | 104| | =============== ============================== ========== | | 105+------------------+-----------------------------------------------------------+---------+ 106| When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | | 107+------------------+-----------------------------------------------------------+---------+ 108 109Getting started 110=============== 111Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This 112section describes one way ``pw_tokenizer`` might be integrated with a project. 113These steps can be adapted as needed. 114 115 1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel 116 are provided. For Make or other build systems, add the files specified in 117 the BUILD.gn's ``pw_tokenizer`` target to the build. 118 2. Use the tokenization macros in your code. See `Tokenization`_. 119 3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's 120 linker script. In GN and CMake, this step is done automatically. 121 4. Compile your code to produce an ELF file. 122 5. Run ``database.py create`` on the ELF file to generate a CSV token 123 database. See `Managing token databases`_. 124 6. Commit the token database to your repository. See notes in `Database 125 management`_. 126 7. Integrate a ``database.py add`` command to your build to automatically 127 update the committed token database. In GN, use the 128 ``pw_tokenizer_database`` template to do this. See `Update a database`_. 129 8. Integrate ``detokenize.py`` or the C++ detokenization library with your 130 tools to decode tokenized logs. See `Detokenization`_. 131 132Tokenization 133============ 134Tokenization converts a string literal to a token. If it's a printf-style 135string, its arguments are encoded along with it. The results of tokenization can 136be sent off device or stored in place of a full string. 137 138Tokenization macros 139------------------- 140Adding tokenization to a project is simple. To tokenize a string, include 141``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros. 142 143Tokenize a string literal 144^^^^^^^^^^^^^^^^^^^^^^^^^ 145The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t`` 146token. 147 148.. code-block:: cpp 149 150 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!"); 151 152.. admonition:: When to use this macro 153 154 Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have 155 %-style arguments. 156 157Tokenize to a handler function 158^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 159``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function, 160since it takes the fewest arguments. It encodes a tokenized string to a 161buffer on the stack. The size of the buffer is set with 162``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``. 163 164This macro is provided by the ``pw_tokenizer:global_handler`` facade. The 165backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage`` 166C-linkage function. 167 168.. code-block:: cpp 169 170 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...); 171 172 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[], 173 size_t size_bytes); 174 175``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a 176``uintptr_t`` argument to the global handler function. Values like a log level 177can be packed into the ``uintptr_t``. 178 179This macro is provided by the ``pw_tokenizer:global_handler_with_payload`` 180facade. The backend for this facade must define the 181``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function. 182 183.. code-block:: cpp 184 185 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload, 186 format_string_literal, 187 arguments...); 188 189 void pw_tokenizer_HandleEncodedMessageWithPayload( 190 uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes); 191 192.. admonition:: When to use these macros 193 194 Use anytime a global handler is sufficient, particularly for widely expanded 195 macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or 196 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros 197 for tokenizing printf-style strings. 198 199Tokenize to a callback 200^^^^^^^^^^^^^^^^^^^^^^ 201``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a 202``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at 203the call site. The size of the buffer is set with 204``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``. 205 206.. code-block:: cpp 207 208 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...); 209 210.. admonition:: When to use this macro 211 212 Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in 213 use for another purpose or more flexibility is needed. 214 215Tokenize to a buffer 216^^^^^^^^^^^^^^^^^^^^ 217The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes 218to a caller-provided buffer. 219 220.. code-block:: cpp 221 222 uint8_t buffer[BUFFER_SIZE]; 223 size_t size_bytes = sizeof(buffer); 224 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...); 225 226While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments 227than the other macros, so its per-use code size overhead is larger. 228 229.. admonition:: When to use this macro 230 231 Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the 232 other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in 233 widely expanded macros, such as a logging macro, because it will result in 234 larger code size than its alternatives. 235 236.. _module-pw_tokenizer-custom-macro: 237 238Tokenize with a custom macro 239^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240Projects may need more flexbility than the standard ``pw_tokenizer`` macros 241provide. To support this, projects may define custom tokenization macros. This 242requires the use of two low-level ``pw_tokenizer`` macros: 243 244.. c:macro:: PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...) 245 246 Tokenizes a format string and sets the ``_pw_tokenizer_token`` variable to the 247 token. Must be used in its own scope, since the same variable is used in every 248 invocation. 249 250 The tokenized string uses the specified :ref:`tokenization domain 251 <module-pw_tokenizer-domains>`. Use ``PW_TOKENIZER_DEFAULT_DOMAIN`` for the 252 default. The token also may be masked; use ``UINT32_MAX`` to keep all bits. 253 254.. c:macro:: PW_TOKENIZER_ARG_TYPES(...) 255 256 Converts a series of arguments to a compact format that replaces the format 257 string literal. 258 259Use these two macros within the custom tokenization macro to call a function 260that does the encoding. The following example implements a custom tokenization 261macro for use with :ref:`module-pw_log_tokenized`. 262 263.. code-block:: cpp 264 265 #include "pw_tokenizer/tokenize.h" 266 267 #ifndef __cplusplus 268 extern "C" { 269 #endif 270 271 void EncodeTokenizedMessage(pw_tokenizer_Payload metadata, 272 pw_tokenizer_Token token, 273 pw_tokenizer_ArgTypes types, 274 ...); 275 276 #ifndef __cplusplus 277 } // extern "C" 278 #endif 279 280 #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \ 281 do { \ 282 PW_TOKENIZE_FORMAT_STRING( \ 283 PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \ 284 EncodeTokenizedMessage(payload, \ 285 _pw_tokenizer_token, \ 286 PW_TOKENIZER_ARG_TYPES(__VA_ARGS__) \ 287 PW_COMMA_ARGS(__VA_ARGS__)); \ 288 } while (0) 289 290In this example, the ``EncodeTokenizedMessage`` function would handle encoding 291and processing the message. Encoding is done by the 292``pw::tokenizer::EncodedMessage`` class or ``pw::tokenizer::EncodeArgs`` 293function from ``pw_tokenizer/encode_args.h``. The encoded message can then be 294transmitted or stored as needed. 295 296.. code-block:: cpp 297 298 #include "pw_log_tokenized/log_tokenized.h" 299 #include "pw_tokenizer/encode_args.h" 300 301 void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata, 302 std::span<std::byte> message); 303 304 extern "C" void EncodeTokenizedMessage(const pw_tokenizer_Payload metadata, 305 const pw_tokenizer_Token token, 306 const pw_tokenizer_ArgTypes types, 307 ...) { 308 va_list args; 309 va_start(args, types); 310 pw::tokenizer::EncodedMessage encoded_message(token, types, args); 311 va_end(args); 312 313 HandleTokenizedMessage(metadata, encoded_message); 314 } 315 316.. admonition:: When to use a custom macro 317 318 Use existing tokenization macros whenever possible. A custom macro may be 319 needed to support use cases like the following: 320 321 * Variations of ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` that take 322 different arguments. 323 * Supporting global handler macros that use different handler functions. 324 325Binary logging with pw_tokenizer 326-------------------------------- 327String tokenization is perfect for logging. Consider the following log macro, 328which gathers the file, line number, and log message. It calls the ``RecordLog`` 329function, which formats the log string, collects a timestamp, and transmits the 330result. 331 332.. code-block:: cpp 333 334 #define LOG_INFO(format, ...) \ 335 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__) 336 337 void RecordLog(LogLevel level, const char* file, int line, const char* format, 338 ...) { 339 if (level < current_log_level) { 340 return; 341 } 342 343 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line); 344 345 va_list args; 346 va_start(args, format); 347 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args); 348 va_end(args); 349 350 TransmitLog(TimeSinceBootMillis(), buffer, size); 351 } 352 353It is trivial to convert this to a binary log using the tokenizer. The 354``RecordLog`` call is replaced with a 355``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The 356``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the 357timestamp and transmits the message with ``TransmitLog``. 358 359.. code-block:: cpp 360 361 #define LOG_INFO(format, ...) \ 362 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \ 363 (pw_tokenizer_Payload)LogLevel_INFO, \ 364 __FILE_NAME__ ":%d " format, \ 365 __LINE__, \ 366 __VA_ARGS__); \ 367 368 extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload( 369 uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) { 370 if (static_cast<LogLevel>(level) >= current_log_level) { 371 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes); 372 } 373 } 374 375Note that the ``__FILE_NAME__`` string is directly included in the log format 376string. Since the string is tokenized, this has no effect on binary size. A 377``%d`` for the line number is added to the format string, so that changing the 378line of the log message does not generate a new token. There is no overhead for 379additional tokens, but it may not be desirable to fill a token database with 380duplicate log lines. 381 382Tokenizing function names 383------------------------- 384The string literal tokenization functions support tokenizing string literals or 385constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the 386special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared 387as ``static constexpr char[]`` in C++ instead of the standard ``static const 388char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be 389tokenized while compiling C++ with GCC or Clang. 390 391.. code-block:: cpp 392 393 // Tokenize the special function name variables. 394 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__); 395 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__); 396 397 // Tokenize the function name variables to a handler function. 398 PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__) 399 PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__) 400 401Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals. 402They are defined as static character arrays, so they cannot be implicitly 403concatentated with string literals. For example, ``printf(__func__ ": %d", 404123);`` will not compile. 405 406Tokenization in Python 407---------------------- 408The Python ``pw_tokenizer.encode`` module has limited support for encoding 409tokenized messages with the ``encode_token_and_args`` function. 410 411.. autofunction:: pw_tokenizer.encode.encode_token_and_args 412 413Encoding 414-------- 415The token is a 32-bit hash calculated during compilation. The string is encoded 416little-endian with the token followed by arguments, if any. For example, the 41731-byte string ``You can go about your business.`` hashes to 0xdac9a244. 418This is encoded as 4 bytes: ``44 a2 c9 da``. 419 420Arguments are encoded as follows: 421 422 * **Integers** (1--10 bytes) -- 423 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_, 424 similarly to Protocol Buffers. Smaller values take fewer bytes. 425 * **Floating point numbers** (4 bytes) -- Single precision floating point. 426 * **Strings** (1--128 bytes) -- Length byte followed by the string contents. 427 The top bit of the length whether the string was truncated or 428 not. The remaining 7 bits encode the string length, with a maximum of 127 429 bytes. 430 431.. TODO: insert diagram here! 432 433.. tip:: 434 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments 435 short or avoid encoding them as strings (e.g. encode an enum as an integer 436 instead of a string). See also `Tokenized strings as %s arguments`_. 437 438Token generation: fixed length hashing at compile time 439------------------------------------------------------ 440String tokens are generated using a modified version of the x65599 hash used by 441the SDBM project. All hashing is done at compile time. 442 443In C code, strings are hashed with a preprocessor macro. For compatibility with 444macros, the hash must be limited to a fixed maximum number of characters. This 445value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing 446``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to 447the complexity of the hashing macros. 448 449C++ macros use a constexpr function instead of a macro. This function works with 450any length of string and has lower compilation time impact than the C macros. 451For consistency, C++ tokenization uses the same hash algorithm, but the 452calculated values will differ between C and C++ for strings longer than 453``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters. 454 455.. _module-pw_tokenizer-domains: 456 457Tokenization domains 458-------------------- 459``pw_tokenizer`` supports having multiple tokenization domains. Domains are a 460string label associated with each tokenized string. This allows projects to keep 461tokens from different sources separate. Potential use cases include the 462following: 463 464* Keep large sets of tokenized strings separate to avoid collisions. 465* Create a separate database for a small number of strings that use truncated 466 tokens, for example only 10 or 16 bits instead of the full 32 bits. 467 468If no domain is specified, the domain is empty (``""``). For many projects, this 469default domain is sufficient, so no additional configuration is required. 470 471.. code-block:: cpp 472 473 // Tokenizes this string to the default ("") domain. 474 PW_TOKENIZE_STRING("Hello, world!"); 475 476 // Tokenizes this string to the "my_custom_domain" domain. 477 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!"); 478 479The database and detokenization command line tools default to reading from the 480default domain. The domain may be specified for ELF files by appending 481``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For 482example, the following reads strings in ``some_domain`` from ``my_image.elf``. 483 484.. code-block:: sh 485 486 ./database.py create --database my_db.csv path/to/my_image.elf#some_domain 487 488See `Managing token databases`_ for information about the ``database.py`` 489command line tool. 490 491Smaller tokens with masking 492--------------------------- 493``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using 494fewer than 32 bits does not improve runtime or code size efficiency. However, 495when tokens are packed into data structures or stored in arrays, the size of the 496token directly affects memory usage. In those cases, every bit counts, and it 497may be desireable to use fewer bits for the token. 498 499``pw_tokenizer`` allows users to provide a mask to apply to the token. This 500masked token is used in both the token database and the code. The masked token 501is not a masked version of the full 32-bit token, the masked token is the token. 502This makes it trivial to decode tokens that use fewer than 32 bits. 503 504Masking functionality is provided through the ``*_MASK`` versions of the macros. 505For example, the following generates 16-bit tokens and packs them into an 506existing value. 507 508.. code-block:: cpp 509 510 constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!"); 511 uint32_t packed_word = (other_bits << 16) | token; 512 513Tokens are hashes, so tokens of any size have a collision risk. The fewer bits 514used for tokens, the more likely two strings are to hash to the same token. See 515`token collisions`_. 516 517Token collisions 518---------------- 519Tokens are calculated with a hash function. It is possible for different 520strings to hash to the same token. When this happens, multiple strings will have 521the same token in the database, and it may not be possible to unambiguously 522decode a token. 523 524The detokenization tools attempt to resolve collisions automatically. Collisions 525are resolved based on two things: 526 527 - whether the tokenized data matches the strings arguments' (if any), and 528 - if / when the string was marked as having been removed from the database. 529 530Working with collisions 531^^^^^^^^^^^^^^^^^^^^^^^ 532Collisions may occur occasionally. Run the command 533``python -m pw_tokenizer.database report <database>`` to see information about a 534token database, including any collisions. 535 536If there are collisions, take the following steps to resolve them. 537 538 - Change one of the colliding strings slightly to give it a new token. 539 - In C (not C++), artificial collisions may occur if strings longer than 540 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, 541 consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value. 542 See ``pw_tokenizer/public/pw_tokenizer/config.h``. 543 - Run the ``mark_removed`` command with the latest version of the build 544 artifacts to mark missing strings as removed. This deprioritizes them in 545 collision resolution. 546 547 .. code-block:: sh 548 549 python -m pw_tokenizer.database mark_removed --database <database> <ELF files> 550 551 The ``purge`` command may be used to delete these tokens from the database. 552 553Probability of collisions 554^^^^^^^^^^^^^^^^^^^^^^^^^ 555Hashes of any size have a collision risk. The probability of one at least 556one collision occurring for a given number of strings is unintuitively high 557(this is known as the `birthday problem 558<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are 559used for tokens, the probability of collisions increases substantially. 560 561This table shows the approximate number of strings that can be hashed to have a 5621% or 50% probability of at least one collision (assuming a uniform, random 563hash). 564 565+-------+---------------------------------------+ 566| Token | Collision probability by string count | 567| bits +--------------------+------------------+ 568| | 50% | 1% | 569+=======+====================+==================+ 570| 32 | 77000 | 9300 | 571+-------+--------------------+------------------+ 572| 31 | 54000 | 6600 | 573+-------+--------------------+------------------+ 574| 24 | 4800 | 580 | 575+-------+--------------------+------------------+ 576| 16 | 300 | 36 | 577+-------+--------------------+------------------+ 578| 8 | 19 | 3 | 579+-------+--------------------+------------------+ 580 581Keep this table in mind when masking tokens (see `Smaller tokens with 582masking`_). 16 bits might be acceptable when tokenizing a small set of strings, 583such as module names, but won't be suitable for large sets of strings, like log 584messages. 585 586Token databases 587=============== 588Token databases store a mapping of tokens to the strings they represent. An ELF 589file can be used as a token database, but it only contains the strings for its 590exact build. A token database file aggregates tokens from multiple ELF files, so 591that a single database can decode tokenized strings from any known ELF. 592 593Token databases contain the token, removal date (if any), and string for each 594tokenized string. Two token database formats are supported: CSV and binary. 595 596CSV database format 597------------------- 598The CSV database format has three columns: the token in hexadecimal, the removal 599date (if any) in year-month-day format, and the string literal, surrounded by 600quotes. Quote characters within the string are represented as two quote 601characters. 602 603This example database contains six strings, three of which have removal dates. 604 605.. code-block:: 606 607 141c35d5, ,"The answer: ""%s""" 608 2e668cd6,2019-12-25,"Jello, world!" 609 7b940e2a, ,"Hello %s! %hd %e" 610 851beeb6, ,"%u %d" 611 881436a0,2020-01-01,"The answer is: %s" 612 e13b0f94,2020-04-01,"%llu" 613 614Binary database format 615---------------------- 616The binary database format is comprised of a 16-byte header followed by a series 617of 8-byte entries. Each entry stores the token and the removal date, which is 6180xFFFFFFFF if there is none. The string literals are stored next in the same 619order as the entries. Strings are stored with null terminators. See 620`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_ 621for full details. 622 623The binary form of the CSV database is shown below. It contains the same 624information, but in a more compact and easily processed form. It takes 141 B 625compared with the CSV database's 211 B. 626 627.. code-block:: text 628 629 [header] 630 0x00: 454b4f54 0000534e TOKENS.. 631 0x08: 00000006 00000000 ........ 632 633 [entries] 634 0x10: 141c35d5 ffffffff .5...... 635 0x18: 2e668cd6 07e30c19 ..f..... 636 0x20: 7b940e2a ffffffff *..{.... 637 0x28: 851beeb6 ffffffff ........ 638 0x30: 881436a0 07e40101 .6...... 639 0x38: e13b0f94 07e40401 ..;..... 640 641 [string table] 642 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s" 643 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H 644 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e. 645 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer 646 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu. 647 648Managing token databases 649------------------------ 650Token databases are managed with the ``database.py`` script. This script can be 651used to extract tokens from compilation artifacts and manage database files. 652Invoke ``database.py`` with ``-h`` for full usage information. 653 654An example ELF file with tokenized logs is provided at 655``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that 656file to experiment with the ``database.py`` commands. 657 658Create a database 659^^^^^^^^^^^^^^^^^ 660The ``create`` command makes a new token database from ELF files (.elf, .o, .so, 661etc.), archives (.a), or existing token databases (CSV or binary). 662 663.. code-block:: sh 664 665 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE... 666 667Two database formats are supported: CSV and binary. Provide ``--type binary`` to 668``create`` to generate a binary database instead of the default CSV. CSV 669databases are great for checking into a source control or for human review. 670Binary databases are more compact and simpler to parse. The C++ detokenizer 671library only supports binary databases currently. 672 673Update a database 674^^^^^^^^^^^^^^^^^ 675As new tokenized strings are added, update the database with the ``add`` 676command. 677 678.. code-block:: sh 679 680 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE... 681 682A CSV token database can be checked into a source repository and updated as code 683changes are made. The build system can invoke ``database.py`` to update the 684database after each build. 685 686GN integration 687^^^^^^^^^^^^^^ 688Token databases may be updated or created as part of a GN build. The 689``pw_tokenizer_database`` template provided by 690``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized 691strings database or creates a new database with artifacts from one or more GN 692targets or other database files. 693 694To create a new database, set the ``create`` variable to the desired database 695type (``"csv"`` or ``"binary"``). The database will be created in the output 696directory. To update an existing database, provide the path to the database with 697the ``database`` variable. 698 699.. code-block:: 700 701 import("//build_overrides/pigweed.gni") 702 703 import("$dir_pw_tokenizer/database.gni") 704 705 pw_tokenizer_database("my_database") { 706 database = "database_in_the_source_tree.csv" 707 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ] 708 input_databases = [ "other_database.csv" ] 709 } 710 711Instead of specifying GN targets, paths or globs to output files may be provided 712with the ``paths`` option. 713 714.. code-block:: 715 716 pw_tokenizer_database("my_database") { 717 database = "database_in_the_source_tree.csv" 718 deps = [ ":apps" ] 719 optional_paths = [ "$root_build_dir/**/*.elf" ] 720 } 721 722.. note:: 723 724 The ``paths`` and ``optional_targets`` arguments do not add anything to 725 ``deps``, so there is no guarantee that the referenced artifacts will exist 726 when the database is updated. Provide ``targets`` or ``deps`` or build other 727 GN targets first if this is a concern. 728 729Detokenization 730============== 731Detokenization is the process of expanding a token to the string it represents 732and decoding its arguments. This module provides Python and C++ detokenization 733libraries. 734 735**Example: decoding tokenized logs** 736 737A project might tokenize its log messages with the `Base64 format`_. Consider 738the following log file, which has four tokenized logs and one plain text log: 739 740.. code-block:: text 741 742 20200229 14:38:58 INF $HL2VHA== 743 20200229 14:39:00 DBG $5IhTKg== 744 20200229 14:39:20 DBG Crunching numbers to calculate probability of success 745 20200229 14:39:21 INF $EgFj8lVVAUI= 746 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk= 747 748The project's log strings are stored in a database like the following: 749 750.. code-block:: 751 752 1c95bd1c, ,"Initiating retrieval process for recovery object" 753 2a5388e4, ,"Determining optimal approach and coordinating vectors" 754 3743540c, ,"Recovery object retrieval failed with status %s" 755 f2630112, ,"Calculated acceptable probability of success (%.2f%%)" 756 757Using the detokenizing tools with the database, the logs can be decoded: 758 759.. code-block:: text 760 761 20200229 14:38:58 INF Initiating retrieval process for recovery object 762 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors 763 20200229 14:39:20 DBG Crunching numbers to calculate probability of success 764 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%) 765 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY 766 767.. note:: 768 769 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as 770 much space as the default binary format when encoded. For projects that wish 771 to interleave tokenized with plain text, using Base64 is a worthwhile 772 tradeoff. 773 774Python 775------ 776To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer`` 777package, and instantiate it with paths to token databases or ELF files. 778 779.. code-block:: python 780 781 import pw_tokenizer 782 783 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf') 784 785 def process_log_message(log_message): 786 result = detokenizer.detokenize(log_message.payload) 787 self._log(str(result)) 788 789The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer`` 790class, which can be used in place of the standard ``Detokenizer``. This class 791monitors database files for changes and automatically reloads them when they 792change. This is helpful for long-running tools that use detokenization. 793 794C++ 795--- 796The C++ detokenization libraries can be used in C++ or any language that can 797call into C++ with a C-linkage wrapper, such as Java or Rust. A reference 798Java Native Interface (JNI) implementation is provided. 799 800The C++ detokenization library uses binary-format token databases (created with 801``database.py create --type binary``). Read a binary format database from a 802file or include it in the source code. Pass the database array to 803``TokenDatabase::Create``, and construct a detokenizer. 804 805.. code-block:: cpp 806 807 Detokenizer detokenizer(TokenDatabase::Create(token_database_array)); 808 809 std::string ProcessLog(span<uint8_t> log_data) { 810 return detokenizer.Detokenize(log_data).BestString(); 811 } 812 813The ``TokenDatabase`` class verifies that its data is valid before using it. If 814it is invalid, the ``TokenDatabase::Create`` returns an empty database for which 815``ok()`` returns false. If the token database is included in the source code, 816this check can be done at compile time. 817 818.. code-block:: cpp 819 820 // This line fails to compile with a static_assert if the database is invalid. 821 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>(); 822 823 Detokenizer OpenDatabase(std::string_view path) { 824 std::vector<uint8_t> data = ReadWholeFile(path); 825 826 TokenDatabase database = TokenDatabase::Create(data); 827 828 // This checks if the file contained a valid database. It is safe to use a 829 // TokenDatabase that failed to load (it will be empty), but it may be 830 // desirable to provide a default database or otherwise handle the error. 831 if (database.ok()) { 832 return Detokenizer(database); 833 } 834 return Detokenizer(kDefaultDatabase); 835 } 836 837Base64 format 838============= 839The tokenizer encodes messages to a compact binary representation. Applications 840may desire a textual representation of tokenized strings. This makes it easy to 841use tokenized messages alongside plain text messages, but comes at a small 842efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory 843as binary messages. 844 845The Base64 format is comprised of a ``$`` character followed by the 846Base64-encoded contents of the tokenized message. For example, consider 847tokenizing the string ``This is an example: %d!`` with the argument -1. The 848string's token is 0x4b016e66. 849 850.. code-block:: text 851 852 Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1); 853 854 Plain text: This is an example: -1! [23 bytes] 855 856 Binary: 66 6e 01 4b 01 [ 5 bytes] 857 858 Base64: $Zm4BSwE= [ 9 bytes] 859 860Encoding 861-------- 862To encode with the Base64 format, add a call to 863``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode`` 864in the tokenizer handler function. For example, 865 866.. code-block:: cpp 867 868 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[], 869 size_t size_bytes) { 870 char base64_buffer[64]; 871 size_t base64_size = pw::tokenizer::PrefixedBase64Encode( 872 pw::span(encoded_message, size_bytes), base64_buffer); 873 874 TransmitLogMessage(base64_buffer, base64_size); 875 } 876 877Decoding 878-------- 879Base64 decoding and detokenizing is supported in the Python detokenizer through 880the ``detokenize_base64`` and related functions. 881 882.. tip:: 883 The Python detokenization tools support recursive detokenization for prefixed 884 Base64 text. Tokenized strings found in detokenized text are detokenized, so 885 prefixed Base64 messages can be passed as ``%s`` arguments. 886 887 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be 888 passed as an argument to the printf-style string ``Nested message: %s``, which 889 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message 890 as follows: 891 892 :: 893 894 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!" 895 896Base64 decoding is supported in C++ or C with the 897``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode`` 898functions. 899 900.. code-block:: cpp 901 902 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[], 903 size_t size_bytes) { 904 char base64_buffer[64]; 905 size_t base64_size = pw::tokenizer::PrefixedBase64Encode( 906 pw::span(encoded_message, size_bytes), base64_buffer); 907 908 TransmitLogMessage(base64_buffer, base64_size); 909 } 910 911Command line utilities 912^^^^^^^^^^^^^^^^^^^^^^ 913``pw_tokenizer`` provides two standalone command line utilities for detokenizing 914Base64-encoded tokenized strings. 915 916* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from 917 stdin. 918* ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a 919 connected serial device. 920 921If the ``pw_tokenizer`` Python package is installed, these tools may be executed 922as runnable modules. For example: 923 924.. code-block:: 925 926 # Detokenize Base64-encoded strings in a file 927 python -m pw_tokenizer.detokenize -i input_file.txt 928 929 # Detokenize Base64-encoded strings in output from a serial device 930 python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0 931 932See the ``--help`` options for these tools for full usage information. 933 934Deployment war story 935==================== 936The tokenizer module was developed to bring tokenized logging to an 937in-development product. The product already had an established text-based 938logging system. Deploying tokenization was straightforward and had substantial 939benefits. 940 941Results 942------- 943 * Log contents shrunk by over 50%, even with Base64 encoding. 944 945 * Significant size savings for encoded logs, even using the less-efficient 946 Base64 encoding required for compatibility with the existing log system. 947 * Freed valuable communication bandwidth. 948 * Allowed storing many more logs in crash dumps. 949 950 * Substantial flash savings. 951 952 * Reduced the size firmware images by up to 18%. 953 954 * Simpler logging code. 955 956 * Removed CPU-heavy ``snprintf`` calls. 957 * Removed complex code for forwarding log arguments to a low-priority task. 958 959This section describes the tokenizer deployment process and highlights key 960insights. 961 962Firmware deployment 963------------------- 964 * In the project's logging macro, calls to the underlying logging function 965 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` 966 invocation. 967 * The log level was passed as the payload argument to facilitate runtime log 968 level control. 969 * For this project, it was necessary to encode the log messages as text. In 970 ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were 971 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log 972 messages. 973 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``. 974 975.. attention:: 976 Do not encode line numbers in tokenized strings. This results in a huge 977 number of lines being added to the database, since every time code moves, 978 new strings are tokenized. If line numbers are desired in a tokenized 979 string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument. 980 981Database management 982------------------- 983 * The token database was stored as a CSV file in the project's Git repo. 984 * The token database was automatically updated as part of the build, and 985 developers were expected to check in the database changes alongside their 986 code changes. 987 * A presubmit check verified that all strings added by a change were added to 988 the token database. 989 * The token database included logs and asserts for all firmware images in the 990 project. 991 * No strings were purged from the token database. 992 993.. tip:: 994 Merge conflicts may be a frequent occurrence with an in-source database. If 995 the database is in-source, make sure there is a simple script to resolve any 996 merge conflicts. The script could either keep both sets of lines or discard 997 local changes and regenerate the database. 998 999Decoding tooling deployment 1000--------------------------- 1001 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places: 1002 1003 * Product-specific Python command line tools, using 1004 ``pw_tokenizer.Detokenizer``. 1005 * Standalone script for decoding prefixed Base64 tokens in files or 1006 live output (e.g. from ``adb``), using ``detokenize.py``'s command line 1007 interface. 1008 1009 * The C++ detokenizer library was deployed to two Android apps with a Java 1010 Native Interface (JNI) layer. 1011 1012 * The binary token database was included as a raw resource in the APK. 1013 * In one app, the built-in token database could be overridden by copying a 1014 file to the phone. 1015 1016.. tip:: 1017 Make the tokenized logging tools simple to use for your project. 1018 1019 * Provide simple wrapper shell scripts that fill in arguments for the 1020 project. For example, point ``detokenize.py`` to the project's token 1021 databases. 1022 * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in 1023 continuously-running tools, so that users don't have to restart the tool 1024 when the token database updates. 1025 * Integrate detokenization everywhere it is needed. Integrating the tools 1026 takes just a few lines of code, and token databases can be embedded in 1027 APKs or binaries. 1028 1029Limitations and future work 1030=========================== 1031 1032GCC bug: tokenization in template functions 1033------------------------------------------- 1034GCC incorrectly ignores the section attribute for template 1035`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and 1036`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this 1037bug, tokenized strings in template functions may be emitted into ``.rodata`` 1038instead of the special tokenized string section. This causes two problems: 1039 1040 1. Tokenized strings will not be discovered by the token database tools. 1041 2. Tokenized strings may not be removed from the final binary. 1042 1043clang does **not** have this issue! Use clang to avoid this. 1044 1045It is possible to work around this bug in GCC. One approach would be to tag 1046format strings so that the database tools can find them in ``.rodata``. Then, to 1047remove the strings, compile two binaries: one metadata binary with all tokenized 1048strings and a second, final binary that removes the strings. The strings could 1049be removed by providing the appropriate linker flags or by removing the ``used`` 1050attribute from the tokenized string character array declaration. 1051 105264-bit tokenization 1053------------------- 1054The Python and C++ detokenizing libraries currently assume that strings were 1055tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and 1056``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit 1057device performed the tokenization. 1058 1059Supporting detokenization of strings tokenized on 64-bit targets would be 1060simple. This could be done by adding an option to switch the 32-bit types to 106164-bit. The tokenizer stores the sizes of these types in the 1062``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified 1063by checking the ELF file, if necessary. 1064 1065Tokenization in headers 1066----------------------- 1067Tokenizing code in header files (inline functions or templates) may trigger 1068warnings such as ``-Wlto-type-mismatch`` under certain conditions. That 1069is because tokenization requires declaring a character array for each tokenized 1070string. If the tokenized string includes macros that change value, the size of 1071this character array changes, which means the same static variable is defined 1072with different sizes. It should be safe to suppress these warnings, but, when 1073possible, code that tokenizes strings with macros that can change value should 1074be moved to source files rather than headers. 1075 1076Tokenized strings as ``%s`` arguments 1077------------------------------------- 1078Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are 1079encoded 1:1, with no tokenization. It would be better to send a tokenized string 1080literal as an integer instead of a string argument, but this is not yet 1081supported. 1082 1083A string token could be sent by marking an integer % argument in a way 1084recognized by the detokenization tools. The detokenizer would expand the 1085argument to the string represented by the integer. 1086 1087.. code-block:: cpp 1088 1089 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]" 1090 1091 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there"); 1092 1093 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token); 1094 1095Strings with arguments could be encoded to a buffer, but since printf strings 1096are null-terminated, a binary encoding would not work. These strings can be 1097prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_. 1098 1099Another possibility: encode strings with arguments to a ``uint64_t`` and send 1100them as an integer. This would be efficient and simple, but only support a small 1101number of arguments. 1102 1103Legacy tokenized string ELF format 1104================================== 1105The original version of ``pw_tokenizer`` stored tokenized stored as plain C 1106strings in the ELF file instead of structured tokenized string entries. Strings 1107in different domains were stored in different linker sections. The Python script 1108that parsed the ELF file would re-calculate the tokens. 1109 1110In the current version of ``pw_tokenizer``, tokenized strings are stored in a 1111structured entry containing a token, domain, and length-delimited string. This 1112has several advantages over the legacy format: 1113 1114* The Python script does not have to recalculate the token, so any hash 1115 algorithm may be used in the firmware. 1116* In C++, the tokenization hash no longer has a length limitation. 1117* Strings with null terminators in them are properly handled. 1118* Only one linker section is required in the linker script, instead of a 1119 separate section for each domain. 1120 1121To migrate to the new format, all that is required is update the linker sections 1122to match those in ``pw_tokenizer_linker_sections.ld``. Replace all 1123``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section. 1124The Python tooling continues to support the legacy tokenized string ELF format. 1125 1126Compatibility 1127============= 1128 * C11 1129 * C++11 1130 * Python 3 1131 1132Dependencies 1133============ 1134 * ``pw_varint`` module 1135 * ``pw_preprocessor`` module 1136 * ``pw_span`` module 1137