1.. _module-pw_tokenizer-proto: 2 3------------------------------------ 4Tokenized fields in protocol buffers 5------------------------------------ 6Text may be represented in a few different ways: 7 8- Plain ASCII or UTF-8 text (``This is plain text``) 9- Base64-encoded tokenized message (``$ibafcA==``) 10- Binary-encoded tokenized message (``89 b6 9f 70``) 11- Little-endian 32-bit integer token (``0x709fb689``) 12 13``pw_tokenizer`` provides tools for working with protobuf fields that may 14contain tokenized text. 15 16Tokenized field protobuf option 17=============================== 18``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option. 19This option may be applied to a protobuf field to indicate that it may contain a 20tokenized string. A string that is optionally tokenized is represented with a 21single ``bytes`` field annotated with ``(pw.tokenizer.format) = 22TOKENIZATION_OPTIONAL``. 23 24For example, the following protobuf has one field that may contain a tokenized 25string. 26 27.. code-block:: protobuf 28 29 message MessageWithOptionallyTokenizedField { 30 bytes just_bytes = 1; 31 bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL]; 32 string just_text = 3; 33 } 34 35Decoding optionally tokenized strings 36===================================== 37The encoding used for an optionally tokenized field is not recorded in the 38protobuf. Despite this, the text can reliably be decoded. This is accomplished 39by attempting to decode the field as binary or Base64 tokenized data before 40treating it like plain text. 41 42The following diagram describes the decoding process for optionally tokenized 43fields in detail. 44 45.. mermaid:: 46 47 flowchart TD 48 start([Received bytes]) --> binary 49 50 binary[Decode as<br>binary tokenized] --> binary_ok 51 binary_ok{Detokenizes<br>successfully?} -->|no| utf8 52 binary_ok -->|yes| done_binary([Display decoded binary]) 53 54 utf8[Decode as UTF-8] --> utf8_ok 55 utf8_ok{Valid UTF-8?} -->|no| base64_encode 56 utf8_ok -->|yes| base64 57 58 base64_encode[Encode as<br>tokenized Base64] --> display 59 display([Display encoded Base64]) 60 61 base64[Decode as<br>Base64 tokenized] --> base64_ok 62 63 base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text 64 base64_ok -->|yes| base64_results 65 66 is_plain_text{Text is<br>printable?} -->|no| base64_encode 67 is_plain_text-->|yes| plain_text 68 69 base64_results([Display decoded Base64]) 70 plain_text([Display text]) 71 72Potential decoding problems 73--------------------------- 74The decoding process for optionally tokenized fields will yield correct results 75in almost every situation. In rare circumstances, it is possible for it to fail, 76but these can be avoided with a low-overhead mitigation if desired. 77 78There are two ways in which the decoding process may fail. 79 80Accidentally interpreting plain text as tokenized binary 81^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 82If a plain-text string happens to decode as a binary tokenized message, the 83incorrect message could be displayed. This is very unlikely to occur. While many 84tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely 85that a device will happen to log one of these strings as plain text. The 86overwhelming majority of these strings will be nonsense. 87 88If an implementation wishes to guard against this extremely improbable 89situation, it is possible to prevent it. This situation is prevented by 90appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data 91that happens to be valid UTF-8 (or all binary tokenized messages, if desired). 92When decoding, if there is an extra 0xFF byte, it is discarded. 93 94Displaying undecoded binary as plain text instead of Base64 95^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 96If a message fails to decode as binary tokenized and it is not valid UTF-8, it 97is displayed as tokenized Base64. This makes it easily recognizable as a 98tokenized message and makes it simple to decode later from the text output (for 99example, with an updated token database). 100 101A binary message for which the token is not known may coincidentally be valid 102UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters. 103When decoding with an out-of-date token database, it is possible that some 104binary tokenized messages will be displayed as plain text rather than tokenized 105Base64. 106 107This situation is likely to occur, but should be infrequent. Even if it does 108happen, it is not a serious issue. A very small number of strings will be 109displayed incorrectly, but these strings cannot be decoded anyway. One nonsense 110string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``). 111Updating the token database would resolve the issue, though the non-Base64 logs 112would be difficult decode later from a log file. 113 114This situation can be avoided with the same approach described in 115`Accidentally interpreting plain text as tokenized binary`_. Appending 116an invalid UTF-8 character prevents the undecoded binary message from being 117interpreted as plain text. 118 119Python library 120============== 121The ``pw_tokenizer.proto`` module defines functions that may be used to 122detokenize protobuf objects in Python. The function 123:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as 124tokenized, replacing them with their detokenized version. For example: 125 126.. code-block:: python 127 128 my_detokenizer = pw_tokenizer.Detokenizer(some_database) 129 130 my_message = SomeMessage(tokenized_field=b'$YS1EMQ==') 131 pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message) 132 133 assert my_message.tokenized_field == b'The detokenized string! Cool!' 134 135pw_tokenizer.proto 136------------------ 137.. automodule:: pw_tokenizer.proto 138 :members: 139