1 2<!--- 3This document contains embedded graphviz diagrams inside ```dot blocks. 4 5To convert it to rendered form using render.py: 6 $ ./render.py wrapping-upb.in.md 7 8You can also live-preview this document with all diagrams using Markdown Preview Enhanced 9in Visual Studio Code: 10 https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced 11---> 12 13# Building a protobuf library on upb 14 15This is a guide for creating a new protobuf implementation based on upb. It 16starts from the beginning and walks you through the process, highlighting 17some important design choices you will need to make. 18 19## Overview 20 21A protobuf implementation consists of two main pieces: 22 231. a code generator, run at compile time, to turn `.proto` files into source 24 files in your language (we will call this "zlang", assuming an extension of ".z"). 252. a runtime component, which implements the wire format and provides the data 26 structures for representing protobuf data and metadata. 27 28<br/> 29 30```dot {align="center"} 31digraph { 32 rankdir=LR; 33 newrank=true; 34 node [style="rounded,filled" shape=box] 35 "foo.proto" -> protoc; 36 "foo.proto" [shape=folder]; 37 protoc [fillcolor=lightgrey]; 38 protoc -> "protoc-gen-zlang"; 39 "protoc-gen-zlang" -> "foo.z"; 40 "protoc-gen-zlang" [fillcolor=palegreen3]; 41 "foo.z" [shape=folder]; 42 labelloc="b"; 43 label="Compile Time"; 44} 45``` 46 47<br/> 48 49```dot {align="center"} 50digraph { 51 newrank=true; 52 node [style="rounded,filled" shape=box fillcolor=lightgrey] 53 "foo.z" -> "zlang/upb glue (FFI)"; 54 "zlang/upb glue (FFI)" -> "upb (C)"; 55 "zlang/upb glue (FFI)" [fillcolor=palegreen3]; 56 labelloc="b"; 57 label="Runtime"; 58} 59``` 60 61The parts in green are what you will need to implement. 62 63Note that your code generator (`protoc-gen-zlang`) does *not* need to generate 64any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and 65serializers are fully table-driven, which means there is never any need or even 66benefit to generating C code for each proto. upb is capable of full-speed 67parsing even when schema data is loaded at runtime from strings embedded into 68`foo.z`. This is a key benefit of upb compared with C++ protos, which have 69traditionally relied on generated parsers in `foo.pb.cc` files to achieve full 70parsing speed, and suffered a ~10x speed penalty in the parser when the schema 71data was loaded at runtime. 72 73## Prerequisites 74 75There are a few things that the language runtime must provide in order to wrap 76upb. 77 781. **FFI**: To wrap upb, your language must be able to call into a C API 79 through a Foreign Function Interface (FFI). Most languages support FFI in 80 some form, either through "native extensions" (in which you write some C 81 code to implement new methods in your language) or through a direct FFI (in 82 which you can call into regular C functions directly from your language 83 using a special library). 842. **Finalizers, Destructors, or Cleaners**: The runtime must provide 85 finalizers or destructors of some sort. There must be a way of triggering a 86 call to a C function when the language garbage collects or otherwise 87 destroys an object. We don't care much whether it is a finalizer, a 88 destructor, or a cleaner, as long as it gets called eventually when the 89 object is destroyed. upb allocates memory in C space, and a finalizer is our 90 only way of making sure that memory is freed and does not leak. 913. **HashMap with weak values**: (optional) This is not a strong requirement, 92 but it is sometimes helpful to have a global hashmap with weak values to act 93 as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not 94 the keys). There is some question about whether we want to continue to use 95 this pattern going forward. 96 97## Reflection vs. MiniTables 98 99The first key design decision you will need to make is whether your generated 100code will access message data via reflection or minitables. Generally more 101dynamic languages will want to use reflection and more static languages will 102want to use minitables. 103 104### Reflection 105 106Reflection-based data access makes the most sense in highly dynamic language 107interpreters, where method dispatch is generally resolved via strings and hash 108table lookups. 109 110In such languages, you can often implement a special method like `__getattr__` 111(Python) or `method_missing` (Ruby) that receives the method name as a string. 112Using upb's reflection, you can look up a field name using the method name, 113thereby using a hash table belonging to upb instead of one provided by the 114language. 115 116```python 117class FooMessage: 118 # Written in Python for illustration, but in practice we will want to 119 # implement this in C for speed. 120 def __getattr__(self, name): 121 field = FooMessage.descriptor.fields_by_name[name] 122 return field.get_value(self) 123``` 124 125Using this design, we only need to attach a single `__getattr__` method to each 126message class, instead of defining a getter/setter for each field. In this way 127we can avoid duplicating hash tables between upb and the language interpreter, 128reducing memory usage. 129 130Reflection-based access requires loading full reflection at runtime. Your 131generated code will need to embed serialized descriptors (ie. a serialized 132message of `descriptor.proto`), which has some amount of size overhead and 133exposes all message/field names to the binary. It also forces a hash table 134lookup in the critical path of field access. If method calls in your language 135already have this overhead, then this is no added burden, but for statically 136dispatched languages it would cause extra overhead. 137 138If we take this path to its logical conclusion, all class creation can be 139performed fully dynamically, using only a binary descriptor as input. The 140"generated code" becomes little more than an embedded descriptor plus a 141library call to load it. Python has recently gone down this path. Generated 142code now looks something like this: 143 144```python 145# main_pb2.py 146from google3.net.proto2.python.internal import builder as _builder 147from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool 148 149DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>") 150_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) 151_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals()) 152``` 153 154This is all the runtime needs to create all of the classes for messages defined 155in that serialized descriptor. This code has no pretense of readability, but 156a separate `.pyi` stub file provides a fully expanded and readable list of all 157methods a user can expect to be available: 158 159```python 160# main_pb2.pyi 161from google3.net.proto2.python.public import descriptor as _descriptor 162from google3.net.proto2.python.public import message as _message 163from typing import ClassVar as _ClassVar, Optional as _Optional 164 165DESCRIPTOR: _descriptor.FileDescriptor 166 167class MyMessage(_message.Message): 168 __slots__ = ["my_field"] 169 MY_FIELD_FIELD_NUMBER: _ClassVar[int] 170 my_field: str 171 def __init__(self, my_field: _Optional[str] = ...) -> None: ... 172``` 173 174To use reflection-based access: 175 1761. Load and access descriptor data using the interfaces in upb/def.h. 1772. Access message data using the interfaces in upb/reflection.h. 178 179### MiniTables 180 181MiniTables are a "lite" schema representation that are much smaller than 182reflection. MiniTables omit names, options, and almost everything else from the 183`.proto` file, retaining only enough information to parse and serialize binary 184format. 185 186MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are 187a byte-oriented format that can be embedded into your generated code and passed 188to upb to construct MiniTables. MiniDescriptors only use printable characters, 189and therefore do not require escaping when embedding them into generated code 190strings. Overall the size savings of MiniDescriptors are ~60x compared with 191regular descriptors. 192 193MiniTables and MiniDescriptors are a natural choice for compiled languages that 194resolve method calls at compile time. For languages that are sometimes compiled 195and sometimes interpreted, there might not be an obvious choice. When a method 196call is statically bound, we want to remove as much overhead as possible, 197especially from accessors. In the extreme case, we can use unsafe APIs to read 198raw memory at a known offset: 199 200```java 201// Example of a maximally-optimized generated accessor. 202class FooMessage { 203 public long getBarField() { 204 // Using Unsafe should give us performance that is comparable to a 205 // native member access. 206 // 207 // The constant "24" is obtained from upb at compile time. 208 sun.misc.Unsafe.getLong(this.ptr, 24); 209 } 210} 211``` 212 213This design is very low-level, and tightly couples the generated code to one 214specific version of the schema and compiler. A slower but safer version would 215look up a field by field number: 216 217```java 218// Example of a more loosely-coupled accessor. 219class FooMessage { 220 public long getBarField() { 221 // The constant "2" is the field number. Internally this will look 222 // up the number "2" in the MiniTable and use that to read the value 223 // from the message. 224 upb.glue.getLong(this.ptr, 2); 225 } 226} 227``` 228 229One downside of MiniTables is that they cannot support parsing or serializing 230to JSON or TextFormat, because they do not know the field names. It should be 231possible to generate reflection data "on the side", into separate generated 232code files, so that reflection is only pulled in if it is being used. However 233APIs to do this do not exist yet. 234 235To use MiniTable-based access: 236 2371. Load and access MiniDescriptors data using the interfaces in upb/mini_table.h. 2382. Access message data using the interfaces in upb/msg_accessors.h. 239 240## Memory Management 241 242One of the core design challenges when wrapping upb is memory management. Every 243language runtime will have some memory management system, whether it is 244garbage collection, reference counting, manual memory management, or some hybrid 245of these. upb is written in C and uses arenas for memory management, but upb is 246designed to integrate with a wide variety of memory management schemes, and it 247provides a number of tools for making this integration as smooth as possible. 248 249### Arenas 250 251upb defines data structures in C to represent messages, arrays (repeated 252fields), and maps. A protobuf message is a hierarchical tree of these objects. 253For example, a relatively simple protobuf tree might look something like this: 254 255```dot {align="center"} 256digraph G { 257 rankdir=LR; 258 newrank=true; 259 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] 260 upb_msg -> upb_msg2; 261 upb_msg -> upb_array; 262 upb_msg [label="upb Message" fillcolor=1] 263 upb_msg2 [label="upb Message"]; 264 upb_array [label="upb Array"] 265} 266``` 267 268All upb objects are allocated from an arena. An arena lets you allocate objects 269individually, but you cannot free individual objects; you can only free the arena 270as a whole. When the arena is freed, all of the individual objects allocated 271from that arena are freed together. 272 273```dot {align="center"} 274digraph G { 275 rankdir=LR; 276 newrank=true; 277 subgraph cluster_0 { 278 label = "upb Arena" 279 graph[style="rounded,filled" fillcolor=gray] 280 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] 281 upb_msg -> upb_array; 282 upb_msg -> upb_msg2; 283 upb_msg [label="upb Message" fillcolor=1] 284 upb_msg2 [label="upb Message"]; 285 upb_array [label="upb Array"]; 286 } 287} 288``` 289 290In simple cases, the entire tree of objects will all live in a single arena. 291This has the nice property that there cannot be any dangling pointers between 292objects, since all objects are freed at the same time. 293 294However upb allows you to create links between any two objects, whether or 295not they are in the same arena. The library does not know or care what arenas 296the objects are in when you create links between them. 297 298```dot {align="center"} 299digraph G { 300 rankdir=LR; 301 newrank=true; 302 subgraph cluster_0 { 303 label = "upb Arena 1" 304 graph[style="rounded,filled" fillcolor=gray] 305 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] 306 upb_msg -> upb_array; 307 upb_msg -> upb_msg2; 308 upb_msg [label="upb Message 1" fillcolor=1] 309 upb_msg2 [label="upb Message 2"]; 310 upb_array [label="upb Array"]; 311 } 312 subgraph cluster_1 { 313 label = "upb Arena 2" 314 graph[style="rounded,filled" fillcolor=gray] 315 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1] 316 upb_msg3; 317 } 318 upb_msg2 -> upb_msg3; 319 upb_msg3 [label="upb Message 3"]; 320} 321``` 322 323When objects are on separate arenas, it is the user's responsibility to ensure 324that there are no dangling pointers. In the example above, this means Arena 2 325must outlive Message 1 and Message 2. 326 327### Integrating GC with upb 328 329In languages with automatic memory management, the goal is to handle all of the 330arenas behind the scenes, so that the user does not have to manage them manually 331or even know that they exist. 332 333We can achieve this goal if we set up the object graph in a particular way. The 334general strategy is to create wrapper objects around all of the C objects, 335including the arena. Our key goal is to make sure the arena wrapper is not 336GC'd until all of the C objects in that arena have become unreachable. 337 338For this example, we will assume we are wrapping upb in Python: 339 340```dot {align="center"} 341digraph G { 342 rankdir=LR; 343 newrank=true; 344 compound=true; 345 346 subgraph cluster_1 { 347 label = "upb Arena" 348 graph[style="rounded,filled" fillcolor=gray] 349 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] 350 upb_msg -> upb_array [style=dashed]; 351 upb_msg -> upb_msg2 [style=dashed]; 352 upb_msg [label="upb Message" fillcolor=1] 353 upb_msg2 [label="upb Message"]; 354 upb_array [label="upb Array"] 355 dummy [style=invis] 356 } 357 subgraph cluster_python { 358 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2] 359 peripheries=0 360 py_upb_msg [label="Python Message"]; 361 py_upb_msg2 [label="Python Message"]; 362 py_upb_arena [label="Python Arena"]; 363 } 364 py_upb_msg -> upb_msg [style=dashed]; 365 py_upb_msg2->upb_msg2 [style=dashed]; 366 py_upb_msg2 -> py_upb_arena [color=springgreen4]; 367 py_upb_msg -> py_upb_arena [color=springgreen4]; 368 py_upb_arena -> dummy [lhead=cluster_1, color=red]; 369 { 370 rank=same; 371 upb_msg; 372 py_upb_msg; 373 } 374 { 375 rank=same; 376 upb_array; 377 upb_msg2; 378 py_upb_msg2; 379 } 380 { rank=same; 381 dummy; 382 py_upb_arena; 383 } 384 dummy->upb_array [style=invis]; 385 dummy->upb_msg2 [style=invis]; 386 387 subgraph cluster_01 { 388 node [shape=plaintext] 389 peripheries=0 390 key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> 391 <tr><td align="right" port="i1">raw ptr</td></tr> 392 <tr><td align="right" port="i2">unique ptr</td></tr> 393 <tr><td align="right" port="i3">shared (GC) ptr</td></tr> 394 </table>>] 395 key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> 396 <tr><td port="i1"> </td></tr> 397 <tr><td port="i2"> </td></tr> 398 <tr><td port="i3"> </td></tr> 399 </table>>] 400 key:i1:e -> key2:i1:w [style=dashed] 401 key:i2:e -> key2:i2:w [color=red] 402 key:i3:e -> key2:i3:w [color=springgreen4] 403 } 404 key2:i1:w -> upb_msg [style=invis]; 405 { 406 rank=same; 407 key; 408 upb_msg; 409 } 410} 411``` 412 413In this example we have three different kinds of pointers: 414 415* **raw ptr**: This is a pointer that carries no ownership. 416* **unique ptr**: This is a pointer has *unique ownership* of the target. The owner 417 will free the target in its destructor (or finalizer, or cleaner). There can 418 only be a single unique pointer to a given object. 419* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the 420 target. Many objects can point to the target, and the target will be deleted 421 only when all such references are gone. In a runtime with automatic memory 422 management (GC), this is a reference that participates in GC. In Python such 423 references use reference counting, but in other VMs they may use mark and 424 sweep or some other form of GC instead. 425 426The Python Message wrappers have only raw pointers to the underlying message, 427but they contain a shared pointer to the arena that will ensure that the raw 428pointer remains valid. Only when all message wrapper objects are destroyed 429will the Python Arena become unreachable, and the upb arena ultimately freed. 430 431### Links between arenas with "Fuse" 432 433The design given above works well for objects that live in a single arena. But 434what if a user wants to create a link between two objects in different arenas? 435 436TODO 437 438## UTF-8 vs. UTF-16 439 440TODO 441 442## Object Cache 443 444TODO 445