1# upb Design 2 3[TOC] 4 5upb is a protobuf kernel written in C. It is a fast and conformant 6implementation of protobuf, with a low-level C API that is designed to be 7wrapped in other languages. 8 9upb is not designed to be used by applications directly. The C API is very 10low-level, unsafe, and changes frequently. It is important that upb is able to 11make breaking API changes as necessary, to avoid taking on technical debt that 12would compromise upb's goals of small code size and fast performance. 13 14## Design goals 15 16Goals: 17 18- Full protobuf conformance 19- Small code size 20- Fast performance (without compromising code size) 21- Easy to wrap in language runtimes 22- Easy to adapt to different memory management schemes (refcounting, GC, etc) 23 24Non-Goals: 25 26- Stable API 27- Safe API 28- Ergonomic API for applications 29 30Parameters: 31 32- C99 33- 32 or 64-bit CPU (assumes 4 or 8 byte pointers) 34- Uses pointer tagging, but avoids other implementation-defined behavior 35- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) 36- No global state, fully re-entrant 37 38## Arenas 39 40All memory management in upb uses arenas, using the type `upb_Arena`. Arenas are 41an alternative to `malloc()` and `free()` that significantly reduces the costs 42of memory allocation. 43 44Arenas obtain blocks of memory using some underlying allocator (likely 45`malloc()` and `free()`), and satisfy allocations using a simple bump allocator 46that walks through each block in linear order. Allocations cannot be freed 47individually: it is only possible to free the arena as a whole, which frees all 48of the underlying blocks. 49 50Here is an example of using the `upb_Arena` type: 51 52```c 53 upb_Arena* arena = upb_Arena_New(); 54 55 // Perform some allocations. 56 int* x = upb_Arena_Malloc(arena, sizeof(*x)); 57 int* y = upb_Arena_Malloc(arena, sizeof(*y)); 58 59 // We cannot free `x` and `y` separately, we can only free the arena 60 // as a whole. 61 upb_Arena_Free(arena); 62``` 63 64upb uses arenas for all memory management, and this fact is reflected in the API 65for all upb data structures. All upb functions that allocate take a `upb_Arena*` 66parameter and perform allocations using that arena rather than calling 67`malloc()` or `free()`. 68 69```c 70// upb API to create a message. 71UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table, 72 upb_Arena* arena); 73 74void MakeMessage(const upb_MiniTable* mini_table) { 75 upb_Arena* arena = upb_Arena_New(); 76 77 // This message is allocated on our arena. 78 upb_Message* msg = upb_Message_New(mini_table, arena); 79 80 // We can free the arena whenever we want, but we cannot free the 81 // message separately from the arena. 82 upb_Arena_Free(arena); 83 84 // msg is now deleted. 85} 86``` 87 88Arenas are a key part of upb's performance story. Parsing a large protobuf 89payload usually involves rapidly creating a series of messages, arrays (repeated 90fields), and maps. It is crucial for parsing performance that these allocations 91are as fast as possible. Equally important, freeing the tree of messages should 92be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg 93n)`. 94 95### Avoiding Dangling Pointers 96 97Objects allocated on an arena will frequently contain pointers to other 98arena-allocated objects. For example, a `upb_Message` will have pointers to 99sub-messages that are also arena-allocated. 100 101Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide 102automatic safety from dangling pointers. Instead, upb provides tools to help 103bridge between higher-level memory management schemes (GC, refcounting, RAII, 104borrow checkers) and arenas. 105 106If there is only one arena, dangling pointers within the arena are impossible, 107because all objects are freed at the same time. This is the simplest case. The 108user must still be careful not to keep dangling pointers that point at arena 109memory after it has been freed, but dangling pointers *between* the arena 110objects will be impossible. 111 112But what if there are multiple arenas? If we have a pointer from one arena to 113another, how do we ensure that this will not become a dangling pointer? 114 115To help with the multiple arena case, upb provides a primitive called "fuse". 116 117```c 118// Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b` 119// will be freed until both arenas are freed. 120UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b); 121``` 122 123When two arenas are fused together, their lifetimes are irreversibly joined, 124such that none of the arena blocks in either arena will be freed until *both* 125arenas are freed with `upb_Arena_Free()`. This means that dangling pointers 126between the two arenas will no longer be possible. 127 128Fuse is useful when joining two messages from separate arenas (making one a 129sub-message of the other). Fuse is a relatively cheap operation, on the order of 130150ns, and is very nearly `O(1)` in the number of arenas being fused (the true 131complexity is the inverse Ackermann function, which grows extremely slowly). 132 133Each arena does consume some memory, so repeatedly creating and fusing an 134additional arena is not free, but the CPU cost of fusing two arenas together is 135modest. 136 137### Initial Block and Custom Allocators 138 139`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its 140underlying blocks. But this default strategy can be customized to support the 141needs of a particular language. 142 143The lowest-level function for creating a `upb_Arena` is: 144 145```c 146// Creates an arena from the given initial block (if any -- n may be 0). 147// Additional blocks will be allocated from |alloc|. If |alloc| is NULL, 148// this is a fixed-size arena and cannot grow. 149UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc); 150``` 151 152The buffer `[mem, n]` will be used as an "initial block", which is used to 153satisfy allocations before calling any underlying allocation function. Note that 154the `upb_Arena` itself will be allocated from the initial block if possible, so 155the amount of memory available for allocation from the arena will be less than 156`n`. 157 158The `alloc` parameter specifies a custom memory allocation function which will 159be used once the initial block is exhausted. The user can pass `NULL` as the 160allocation function, in which case the initial block is the only memory 161available in the arena. This can allow upb to be used even in situations where 162there is no heap. 163 164It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating 165operations like `upb_Message_New()` should be checked for failure if there is 166any possibility that a fixed size arena is in use. 167 168## Schemas 169 170Nearly all operations in upb require that you have a schema. A protobuf schema 171is a data structure that contains all of the message, field, enum, etc. 172definitions that are specified in a `.proto` file. To create, parse, serialize, 173or access a message you must have a schema. For this reason, loading a schema is 174generally the first thing you must do when you use upb. [^0] 175 176[^0]: This requirement comes from the protobuf wire format itself, which is a 177 deep insight about the nature of protobuf (or at least the existing wire 178 format). Unlike JSON, protobuf cannot be parsed or manipulated in a 179 schema-less way. This is because the binary wire format does not 180 distinguish between strings and sub-messages, so a generic parser that is 181 oblivious to the schema is not possible. If a future version of the wire 182 format did distinguish between these, it could be possible to have a 183 schema-agnostic data representation, parser, and serializer. 184 185upb has two main data structures that represent a protobuf schema: 186 187* **MiniTables** are a compact, stripped down version of the schema that 188 contains only the information necessary for parsing and serializing the 189 binary wire format. 190* **Reflection** contains basically all of the data from a `.proto` file, 191 including the original names of all messages/fields/etc., and all options. 192 193The table below summarizes the main differences between these two: 194 195| | MiniTables | Reflection | 196| ------------------- | ------------------------- | -------------------------- | 197| Contains | Field numbers and types | All data in `.proto` file, | 198: : only : including names of : 199: : : everything : 200| Used to parse | binary format | JSON / TextFormat | 201| Wire representation | MiniDescriptor | Descriptor | 202| Type names | `upb_MiniTable`, | `upb_MessageDef`, | 203: : `upb_MiniTableField`, ... : `upb_FieldDef`, ... : 204| Registry | `upb_ExtensionRegistry` | `upb_DefPool` | 205: : (for extensions) : : 206 207MiniTables are useful if you only need the binary wire format, because they are 208much lighter weight than full reflection. 209 210Reflection is useful if you need to parse JSON or TextFormat, or you need access 211to options that were specified in the `proto` file. Note that reflection also 212includes MiniTables, so if you have reflection, you also have MiniTables 213available. 214 215### MiniTables 216 217MiniTables are represented by a set of data structures with names like 218`upb_MiniTable` (representing a message), `upb_MiniTableField`, 219`upb_MiniTableFile`, etc. Whenever you see one of these types in a function 220signature, you know that this particular operation requires a MiniTable. For 221example: 222 223``` 224// Parses the wire format data in the given buffer `[buf, size]` and writes it 225// to the message `msg`, which has the type `mt`. 226UPB_API upb_DecodeStatus upb_Decode(const char* buf, size_t size, 227 upb_Message* msg, const upb_MiniTable* mt, 228 const upb_ExtensionRegistry* extreg, 229 int options, upb_Arena* arena); 230``` 231 232The subset of upb that requires only MiniTables can be thought of as "upb lite," 233because both the code size and the runtime memory overhead will be less than 234"upb full" (the parts that use reflection). 235 236#### Loading 237 238There are three main ways of loading a MiniTable: 239 2401. **From C generated code:** The upb code generator can emit `.upb.c` files that 241 contain the MiniTables as global constant variables. When the main program 242 links against these, the MiniTable will be placed into `.rodata` (or 243 `.data.rel.ro`) in the binary. The MiniTable can then be obtained from a 244 generated function. In Blaze/Bazel these files can be generated and linked 245 using the `upb_proto_library()` rule. 2462. **From MiniDescriptors:** The user can build MiniDescriptors into MiniTables 247 at runtime. MiniDescriptors are a compact upb-specific wire format designed 248 specially for this purpose. The user can call `upb_MiniTable_Build()` at 249 runtime to convert MiniDescriptors to MiniTables. 2503. **From reflection:** If you have already built reflection data structures 251 for your type, then you can obtain the `upb_MiniTable` corresponding to a 252 `upb_MessageDef` using `upb_MessageDef_MiniTable()`. 253 254For languages that are already using reflection, (3) is an obvious choice. 255 256For languages that are avoiding reflection, here is a general guideline for 257choosing between (1) and (2): if the language being wrapped participates in the 258standard binary linking model on a given platform (in particular, if it is 259generally linked using `ld`), then it is better to use (1), which is also known 260as "static loading". 261 262Static loading of MiniTables has the benefit of requiring no runtime 263initialization[^2], leading to faster startup. Static loading of MiniTables also 264facilitates cross-language sharing of proto messages, because sharing generally 265requires that both languages are using exactly the same MiniTables. 266 267The main downside of static loading is that it requires the user to generate one 268`.upb.c` file per `.proto` and link against the transitive closure of `.upb.c` 269files. Blaze and Bazel make this reasonably easy, but for other build systems it 270can be more of a challenge. 271 272[^2]: aside from possible pointer relocations performed by the ELF/Mach-O loader 273 if the library or binary is position-independent 274 275Loading from MiniDescriptors, as in option (2), has the advantage that it does 276not require per-message linking of C code. For many language toolchains, 277generating and linking some custom C code for every protobuf file or message 278type would be a burdensome requirement. MiniDescriptors are a convenient way of 279loading MiniTables without needing to cross the FFI boundary outside the core 280runtime. 281 282A common pattern when using dynamic loading is to embed strings containing 283MiniDescriptors directly into generated code. For example, the generated code in 284Dart for a message with only primitive fields currently looks something like: 285 286```dart 287 const desc = r'$(+),*-#$%&! /10'; 288 _accessor = $pb.instance.registry.newMessageAccessor(desc); 289``` 290 291The implementation of `newMessageAccessor()` is mainly just a wrapper around 292`upb_MiniTable_Build()`, which builds a MiniTable from a MiniDescriptor. In the 293code generator, the MiniDescriptor can be obtained from the 294`upb_MessageDef_MiniDescriptorEncode()` API; users should never need to encode a 295MiniDescriptor manually. 296 297#### Linking 298 299When building MiniTables dynamically, it is the user's responsibility to link 300each message to the to the appropriate sub-messages and or enums. Each message 301must have its message and closed enum fields linked using 302`upb_MiniTable_SetSubMessage()` and `upb_MiniTable_SetSubEnum()`, respectively. 303 304A higher-level function that links all fields at the same time is also 305available, as `upb_MiniTable_Link()`. This function pairs well with 306`upb_MiniTable_GetSubList()` which can be used in a code generator to get a list 307of all the messages and enums which must be passed to `upb_MiniTable_Link()`. 308 309A common pattern is to embed the `link()` calls directly into the generated 310code. For example, here is an example from Dart of building a MiniTable that 311contains sub-messages and enums: 312 313```dart 314 const desc = r'$3334'; 315 _accessor = $pb.instance.registry.newMessageAccessor(desc); 316 _accessor!.link( 317 [ 318 M2.$_accessor, 319 M3.$_accessor, 320 M4.$_accessor, 321 ], 322 [ 323 E.$_accessor, 324 ], 325 ); 326``` 327 328In this case, `upb_MiniTable_GetSubList()` was used in the code generator to 329discover the 3 sub-message fields and 1 sub-enum field that require linking. At 330runtime, these lists of MiniTables are passed to the `link()` function, which 331will internally call `upb_MiniTable_Link()`. 332 333Note that in some cases, the application may choose to delay or even skip the 334registration of sub-message types, as part of a tree shaking strategy. 335 336When using static MiniTables, a manual link step is not necessary, as linking is 337performed automatically by `ld`. 338 339#### Enums 340 341MiniTables primarily carry data about messages, fields, and extensions. However 342for closed enums, we must also have a `upb_MiniTableEnum` structure that stores 343the set of all numbers that are defined in the enum. This is because closed 344enums have the unfortunate behavior of putting unknown enum values into the 345unknown field set. 346 347Over time, closed enums will hopefully be phased out via editions, and the 348relevance and overhead of `upb_MiniTableEnum` will shrink and eventually 349disappear. 350 351### Reflection 352 353Reflection uses types like `upb_MessageDef` and `upb_FieldDef` to represent the 354full contents of a `.proto` file at runtime. These types are upb's direct 355equivalents of `google::protobuf::Descriptor`, `google::protobuf::FieldDescriptor`, etc. [^1] 356 357[^1]: upb consistently uses `Def` where C++ would use `Descriptor` in type 358 names. This introduces divergence with C++; the rationale was to conserve 359 horizontal line length, as `Def` is less than 1/3 the length of 360 `Descriptor`. This is more relevant in C, where the type name is repeated 361 in every function, eg. `upb_FieldDef_Name()` vs. 362 `upb_FieldDescriptor_Name()`. 363 364Whenever you see one of these types in a function signature, you know that the 365given operation requires reflection. For example: 366 367```c 368// Parses JSON format into a message object, using reflection. 369UPB_API bool upb_JsonDecode(const char* buf, size_t size, upb_Message* msg, 370 const upb_MessageDef* m, const upb_DefPool* symtab, 371 int options, upb_Arena* arena, upb_Status* status); 372``` 373 374The part of upb that requires reflection can be thought of as "upb full." These 375parts of the library cannot be used if a given application has only loaded 376MiniTables. There is no way to convert a MiniTable into reflection. 377 378The `upb_DefPool` type is the top-level container that builds and owns some set 379of defs. This type is a close analogue of `google::protobuf::DescriptorPool` in C++. The 380user must always ensure that the `upb_DefPool` outlives any def objects that it 381owns. 382 383#### Loading 384 385As with MiniTable loading, we have multiple options for how to load full 386reflection: 387 3881. **From C generated code**: The upb code generator can create `foo.upbdefs.c` 389 files that embed the descriptors and exports generated C functions for 390 adding them to a user-provided `upb_DefPool`. 3912. **From descriptors**: The user can make manual calls to 392 `upb_DefPool_AddFile()`, using descriptors obtained at runtime. Defs for 393 individual messages can then be obtained using 394 `upb_DefPool_FindMessageByName()`. 395 396Unlike MiniTables, loading from generated code requires runtime initialization, 397as reflection data structures like `upb_MessageDef` are not capable of being 398emitted directly into `.rodata` like `upb_MiniTable` is. Instead, the generated 399code embeds serialized descriptor protos into `.rodata` which are then built 400into heap objects at runtime. 401 402From this you might conclude that option (1) is nothing but a convenience 403wrapper around option (2), but that is not quite correct either. Option (1) 404*does* link against the static `.upb.c` structures for the MiniTables, whereas 405option (2) will build the MiniTables from scratch on the heap. So option (1) 406will use marginally less CPU and RAM when the descriptors are loaded into a 407`upb_DefPool`. More importantly, the resulting descriptors will be capable of 408reflecting over any messages built from the generated `.upb.c` MiniTables, 409whereas descriptors built using option (2) will have distinct MiniTables that 410cannot reflect over messages that use the generated MiniTables. 411 412A common pattern for dynamic languages like PHP, Ruby, or Python, is to use 413option (2) with descriptors that are embedded into the generated code. For 414example, the generated code in Python currently looks something like: 415 416```python 417from google.protobuf import descriptor_pool as _descriptor_pool 418from google.protobuf.internal import builder as _builder 419 420_desc = b'\n\x1aprotoc_explorer/main.proto\x12\x03pkg' 421 422DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(_desc) 423_globals = globals() 424_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals) 425_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.protoc_explorer.main_pb2', _globals) 426``` 427 428The `AddSerializedFile()` API above is mainly just a thin wrapper around 429`upb_DefPool_AddFile()`. 430