pigweed/pw_tokenizer/proto.rst

.. _module-pw_tokenizer-proto:

------------------------------------
Tokenized fields in protocol buffers
------------------------------------
Text may be represented in a few different ways:

- Plain ASCII or UTF-8 text (``This is plain text``)
- Base64-encoded tokenized message (``$ibafcA==``)
- Binary-encoded tokenized message (``89 b6 9f 70``)
- Little-endian 32-bit integer token (``0x709fb689``)

``pw_tokenizer`` provides tools for working with protobuf fields that may
contain tokenized text.

Tokenized field protobuf option
===============================
``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
This option may be applied to a protobuf field to indicate that it may contain a
tokenized string. A string that is optionally tokenized is represented with a
single ``bytes`` field annotated with ``(pw.tokenizer.format) =
TOKENIZATION_OPTIONAL``.

For example, the following protobuf has one field that may contain a tokenized
string.

.. code-block:: protobuf

  message MessageWithOptionallyTokenizedField {
    bytes just_bytes = 1;
    bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
    string just_text = 3;
  }

Decoding optionally tokenized strings
=====================================
The encoding used for an optionally tokenized field is not recorded in the
protobuf. Despite this, the text can reliably be decoded. This is accomplished
by attempting to decode the field as binary or Base64 tokenized data before
treating it like plain text.

The following diagram describes the decoding process for optionally tokenized
fields in detail.

.. mermaid::

  flowchart TD
     start([Received bytes]) --> binary

     binary[Decode as<br>binary tokenized] --> binary_ok
     binary_ok{Detokenizes<br>successfully?} -->|no| utf8
     binary_ok -->|yes| done_binary([Display decoded binary])

     utf8[Decode as UTF-8] --> utf8_ok
     utf8_ok{Valid UTF-8?} -->|no| base64_encode
     utf8_ok -->|yes| base64

     base64_encode[Encode as<br>tokenized Base64] --> display
     display([Display encoded Base64])

     base64[Decode as<br>Base64 tokenized] --> base64_ok

     base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text
     base64_ok -->|yes| base64_results

     is_plain_text{Text is<br>printable?} -->|no| base64_encode
     is_plain_text-->|yes| plain_text

     base64_results([Display decoded Base64])
     plain_text([Display text])

Potential decoding problems
---------------------------
The decoding process for optionally tokenized fields will yield correct results
in almost every situation. In rare circumstances, it is possible for it to fail,
but these can be avoided with a low-overhead mitigation if desired.

There are two ways in which the decoding process may fail.

Accidentally interpreting plain text as tokenized binary
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a plain-text string happens to decode as a binary tokenized message, the
incorrect message could be displayed. This is very unlikely to occur. While many
tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
that a device will happen to log one of these strings as plain text. The
overwhelming majority of these strings will be nonsense.

If an implementation wishes to guard against this extremely improbable
situation, it is possible to prevent it. This situation is prevented by
appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
When decoding, if there is an extra 0xFF byte, it is discarded.

Displaying undecoded binary as plain text instead of Base64
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a message fails to decode as binary tokenized and it is not valid UTF-8, it
is displayed as tokenized Base64. This makes it easily recognizable as a
tokenized message and makes it simple to decode later from the text output (for
example, with an updated token database).

A binary message for which the token is not known may coincidentally be valid
UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
When decoding with an out-of-date token database, it is possible that some
binary tokenized messages will be displayed as plain text rather than tokenized
Base64.

This situation is likely to occur, but should be infrequent. Even if it does
happen, it is not a serious issue. A very small number of strings will be
displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
Updating the token database would resolve the issue, though the non-Base64 logs
would be difficult decode later from a log file.

This situation can be avoided with the same approach described in
`Accidentally interpreting plain text as tokenized binary`_. Appending
an invalid UTF-8 character prevents the undecoded binary message from being
interpreted as plain text.

Python library
==============
The ``pw_tokenizer.proto`` module defines functions that may be used to
detokenize protobuf objects in Python. The function
:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
tokenized, replacing them with their detokenized version. For example:

.. code-block:: python

  my_detokenizer = pw_tokenizer.Detokenizer(some_database)

  my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
  pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

  assert my_message.tokenized_field == b'The detokenized string! Cool!'

pw_tokenizer.proto
------------------
.. automodule:: pw_tokenizer.proto
  :members: