• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# upb Design
2
3[TOC]
4
5upb is a protobuf kernel written in C. It is a fast and conformant
6implementation of protobuf, with a low-level C API that is designed to be
7wrapped in other languages.
8
9upb is not designed to be used by applications directly. The C API is very
10low-level, unsafe, and changes frequently. It is important that upb is able to
11make breaking API changes as necessary, to avoid taking on technical debt that
12would compromise upb's goals of small code size and fast performance.
13
14## Design goals
15
16Goals:
17
18-   Full protobuf conformance
19-   Small code size
20-   Fast performance (without compromising code size)
21-   Easy to wrap in language runtimes
22-   Easy to adapt to different memory management schemes (refcounting, GC, etc)
23
24Non-Goals:
25
26-   Stable API
27-   Safe API
28-   Ergonomic API for applications
29
30Parameters:
31
32-   C99
33-   32 or 64-bit CPU (assumes 4 or 8 byte pointers)
34-   Uses pointer tagging, but avoids other implementation-defined behavior
35-   Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
36-   No global state, fully re-entrant
37
38## Arenas
39
40All memory management in upb uses arenas, using the type `upb_Arena`. Arenas are
41an alternative to `malloc()` and `free()` that significantly reduces the costs
42of memory allocation.
43
44Arenas obtain blocks of memory using some underlying allocator (likely
45`malloc()` and `free()`), and satisfy allocations using a simple bump allocator
46that walks through each block in linear order. Allocations cannot be freed
47individually: it is only possible to free the arena as a whole, which frees all
48of the underlying blocks.
49
50Here is an example of using the `upb_Arena` type:
51
52```c
53  upb_Arena* arena = upb_Arena_New();
54
55  // Perform some allocations.
56  int* x = upb_Arena_Malloc(arena, sizeof(*x));
57  int* y = upb_Arena_Malloc(arena, sizeof(*y));
58
59  // We cannot free `x` and `y` separately, we can only free the arena
60  // as a whole.
61  upb_Arena_Free(arena);
62```
63
64upb uses arenas for all memory management, and this fact is reflected in the API
65for all upb data structures. All upb functions that allocate take a `upb_Arena*`
66parameter and perform allocations using that arena rather than calling
67`malloc()` or `free()`.
68
69```c
70// upb API to create a message.
71UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
72                                     upb_Arena* arena);
73
74void MakeMessage(const upb_MiniTable* mini_table) {
75  upb_Arena* arena = upb_Arena_New();
76
77  // This message is allocated on our arena.
78  upb_Message* msg = upb_Message_New(mini_table, arena);
79
80  // We can free the arena whenever we want, but we cannot free the
81  // message separately from the arena.
82  upb_Arena_Free(arena);
83
84  // msg is now deleted.
85}
86```
87
88Arenas are a key part of upb's performance story. Parsing a large protobuf
89payload usually involves rapidly creating a series of messages, arrays (repeated
90fields), and maps. It is crucial for parsing performance that these allocations
91are as fast as possible. Equally important, freeing the tree of messages should
92be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg
93n)`.
94
95### Avoiding Dangling Pointers
96
97Objects allocated on an arena will frequently contain pointers to other
98arena-allocated objects. For example, a `upb_Message` will have pointers to
99sub-messages that are also arena-allocated.
100
101Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide
102automatic safety from dangling pointers. Instead, upb provides tools to help
103bridge between higher-level memory management schemes (GC, refcounting, RAII,
104borrow checkers) and arenas.
105
106If there is only one arena, dangling pointers within the arena are impossible,
107because all objects are freed at the same time. This is the simplest case. The
108user must still be careful not to keep dangling pointers that point at arena
109memory after it has been freed, but dangling pointers *between* the arena
110objects will be impossible.
111
112But what if there are multiple arenas? If we have a pointer from one arena to
113another, how do we ensure that this will not become a dangling pointer?
114
115To help with the multiple arena case, upb provides a primitive called "fuse".
116
117```c
118// Fuses the lifetimes of `a` and `b`.  None of the blocks from `a` or `b`
119// will be freed until both arenas are freed.
120UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
121```
122
123When two arenas are fused together, their lifetimes are irreversibly joined,
124such that none of the arena blocks in either arena will be freed until *both*
125arenas are freed with `upb_Arena_Free()`. This means that dangling pointers
126between the two arenas will no longer be possible.
127
128Fuse is useful when joining two messages from separate arenas (making one a
129sub-message of the other). Fuse is a relatively cheap operation, on the order of
130150ns, and is very nearly `O(1)` in the number of arenas being fused (the true
131complexity is the inverse Ackermann function, which grows extremely slowly).
132
133Each arena does consume some memory, so repeatedly creating and fusing an
134additional arena is not free, but the CPU cost of fusing two arenas together is
135modest.
136
137### Initial Block and Custom Allocators
138
139`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its
140underlying blocks. But this default strategy can be customized to support the
141needs of a particular language.
142
143The lowest-level function for creating a `upb_Arena` is:
144
145```c
146// Creates an arena from the given initial block (if any -- n may be 0).
147// Additional blocks will be allocated from |alloc|.  If |alloc| is NULL,
148// this is a fixed-size arena and cannot grow.
149UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
150```
151
152The buffer `[mem, n]` will be used as an "initial block", which is used to
153satisfy allocations before calling any underlying allocation function. Note that
154the `upb_Arena` itself will be allocated from the initial block if possible, so
155the amount of memory available for allocation from the arena will be less than
156`n`.
157
158The `alloc` parameter specifies a custom memory allocation function which will
159be used once the initial block is exhausted. The user can pass `NULL` as the
160allocation function, in which case the initial block is the only memory
161available in the arena. This can allow upb to be used even in situations where
162there is no heap.
163
164It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating
165operations like `upb_Message_New()` should be checked for failure if there is
166any possibility that a fixed size arena is in use.
167
168## Schemas
169
170Nearly all operations in upb require that you have a schema. A protobuf schema
171is a data structure that contains all of the message, field, enum, etc.
172definitions that are specified in a `.proto` file. To create, parse, serialize,
173or access a message you must have a schema. For this reason, loading a schema is
174generally the first thing you must do when you use upb. [^0]
175
176[^0]: This requirement comes from the protobuf wire format itself, which is a
177    deep insight about the nature of protobuf (or at least the existing wire
178    format). Unlike JSON, protobuf cannot be parsed or manipulated in a
179    schema-less way. This is because the binary wire format does not
180    distinguish between strings and sub-messages, so a generic parser that is
181    oblivious to the schema is not possible. If a future version of the wire
182    format did distinguish between these, it could be possible to have a
183    schema-agnostic data representation, parser, and serializer.
184
185upb has two main data structures that represent a protobuf schema:
186
187*   **MiniTables** are a compact, stripped down version of the schema that
188    contains only the information necessary for parsing and serializing the
189    binary wire format.
190*   **Reflection** contains basically all of the data from a `.proto` file,
191    including the original names of all messages/fields/etc., and all options.
192
193The table below summarizes the main differences between these two:
194
195|                     | MiniTables                | Reflection                 |
196| ------------------- | ------------------------- | -------------------------- |
197| Contains            | Field numbers and types   | All data in `.proto` file, |
198:                     : only                      : including names of         :
199:                     :                           : everything                 :
200| Used to parse       | binary format             | JSON / TextFormat          |
201| Wire representation | MiniDescriptor            | Descriptor                 |
202| Type names          | `upb_MiniTable`,          | `upb_MessageDef`,          |
203:                     : `upb_MiniTableField`, ... : `upb_FieldDef`, ...        :
204| Registry            | `upb_ExtensionRegistry`   | `upb_DefPool`              |
205:                     : (for extensions)          :                            :
206
207MiniTables are useful if you only need the binary wire format, because they are
208much lighter weight than full reflection.
209
210Reflection is useful if you need to parse JSON or TextFormat, or you need access
211to options that were specified in the `proto` file. Note that reflection also
212includes MiniTables, so if you have reflection, you also have MiniTables
213available.
214
215### MiniTables
216
217MiniTables are represented by a set of data structures with names like
218`upb_MiniTable` (representing a message), `upb_MiniTableField`,
219`upb_MiniTableFile`, etc. Whenever you see one of these types in a function
220signature, you know that this particular operation requires a MiniTable. For
221example:
222
223```
224// Parses the wire format data in the given buffer `[buf, size]` and writes it
225// to the message `msg`, which has the type `mt`.
226UPB_API upb_DecodeStatus upb_Decode(const char* buf, size_t size,
227                                    upb_Message* msg, const upb_MiniTable* mt,
228                                    const upb_ExtensionRegistry* extreg,
229                                    int options, upb_Arena* arena);
230```
231
232The subset of upb that requires only MiniTables can be thought of as "upb lite,"
233because both the code size and the runtime memory overhead will be less than
234"upb full" (the parts that use reflection).
235
236#### Loading
237
238There are three main ways of loading a MiniTable:
239
2401.  **From C generated code:** The upb code generator can emit `.upb.c` files that
241    contain the MiniTables as global constant variables. When the main program
242    links against these, the MiniTable will be placed into `.rodata` (or
243    `.data.rel.ro`) in the binary. The MiniTable can then be obtained from a
244    generated function. In Blaze/Bazel these files can be generated and linked
245    using the `upb_proto_library()` rule.
2462.  **From MiniDescriptors:** The user can build MiniDescriptors into MiniTables
247    at runtime. MiniDescriptors are a compact upb-specific wire format designed
248    specially for this purpose. The user can call `upb_MiniTable_Build()` at
249    runtime to convert MiniDescriptors to MiniTables.
2503.  **From reflection:** If you have already built reflection data structures
251    for your type, then you can obtain the `upb_MiniTable` corresponding to a
252    `upb_MessageDef` using `upb_MessageDef_MiniTable()`.
253
254For languages that are already using reflection, (3) is an obvious choice.
255
256For languages that are avoiding reflection, here is a general guideline for
257choosing between (1) and (2): if the language being wrapped participates in the
258standard binary linking model on a given platform (in particular, if it is
259generally linked using `ld`), then it is better to use (1), which is also known
260as "static loading".
261
262Static loading of MiniTables has the benefit of requiring no runtime
263initialization[^2], leading to faster startup. Static loading of MiniTables also
264facilitates cross-language sharing of proto messages, because sharing generally
265requires that both languages are using exactly the same MiniTables.
266
267The main downside of static loading is that it requires the user to generate one
268`.upb.c` file per `.proto` and link against the transitive closure of `.upb.c`
269files. Blaze and Bazel make this reasonably easy, but for other build systems it
270can be more of a challenge.
271
272[^2]: aside from possible pointer relocations performed by the ELF/Mach-O loader
273    if the library or binary is position-independent
274
275Loading from MiniDescriptors, as in option (2), has the advantage that it does
276not require per-message linking of C code. For many language toolchains,
277generating and linking some custom C code for every protobuf file or message
278type would be a burdensome requirement. MiniDescriptors are a convenient way of
279loading MiniTables without needing to cross the FFI boundary outside the core
280runtime.
281
282A common pattern when using dynamic loading is to embed strings containing
283MiniDescriptors directly into generated code. For example, the generated code in
284Dart for a message with only primitive fields currently looks something like:
285
286```dart
287  const desc = r'$(+),*-#$%&! /10';
288  _accessor = $pb.instance.registry.newMessageAccessor(desc);
289```
290
291The implementation of `newMessageAccessor()` is mainly just a wrapper around
292`upb_MiniTable_Build()`, which builds a MiniTable from a MiniDescriptor. In the
293code generator, the MiniDescriptor can be obtained from the
294`upb_MessageDef_MiniDescriptorEncode()` API; users should never need to encode a
295MiniDescriptor manually.
296
297#### Linking
298
299When building MiniTables dynamically, it is the user's responsibility to link
300each message to the to the appropriate sub-messages and or enums. Each message
301must have its message and closed enum fields linked using
302`upb_MiniTable_SetSubMessage()` and `upb_MiniTable_SetSubEnum()`, respectively.
303
304A higher-level function that links all fields at the same time is also
305available, as `upb_MiniTable_Link()`. This function pairs well with
306`upb_MiniTable_GetSubList()` which can be used in a code generator to get a list
307of all the messages and enums which must be passed to `upb_MiniTable_Link()`.
308
309A common pattern is to embed the `link()` calls directly into the generated
310code. For example, here is an example from Dart of building a MiniTable that
311contains sub-messages and enums:
312
313```dart
314  const desc = r'$3334';
315  _accessor = $pb.instance.registry.newMessageAccessor(desc);
316  _accessor!.link(
317      [
318        M2.$_accessor,
319        M3.$_accessor,
320        M4.$_accessor,
321      ],
322      [
323        E.$_accessor,
324      ],
325  );
326```
327
328In this case, `upb_MiniTable_GetSubList()` was used in the code generator to
329discover the 3 sub-message fields and 1 sub-enum field that require linking. At
330runtime, these lists of MiniTables are passed to the `link()` function, which
331will internally call `upb_MiniTable_Link()`.
332
333Note that in some cases, the application may choose to delay or even skip the
334registration of sub-message types, as part of a tree shaking strategy.
335
336When using static MiniTables, a manual link step is not necessary, as linking is
337performed automatically by `ld`.
338
339#### Enums
340
341MiniTables primarily carry data about messages, fields, and extensions. However
342for closed enums, we must also have a `upb_MiniTableEnum` structure that stores
343the set of all numbers that are defined in the enum. This is because closed
344enums have the unfortunate behavior of putting unknown enum values into the
345unknown field set.
346
347Over time, closed enums will hopefully be phased out via editions, and the
348relevance and overhead of `upb_MiniTableEnum` will shrink and eventually
349disappear.
350
351### Reflection
352
353Reflection uses types like `upb_MessageDef` and `upb_FieldDef` to represent the
354full contents of a `.proto` file at runtime. These types are upb's direct
355equivalents of `google::protobuf::Descriptor`, `google::protobuf::FieldDescriptor`, etc. [^1]
356
357[^1]: upb consistently uses `Def` where C++ would use `Descriptor` in type
358    names. This introduces divergence with C++; the rationale was to conserve
359    horizontal line length, as `Def` is less than 1/3 the length of
360    `Descriptor`. This is more relevant in C, where the type name is repeated
361    in every function, eg. `upb_FieldDef_Name()` vs.
362    `upb_FieldDescriptor_Name()`.
363
364Whenever you see one of these types in a function signature, you know that the
365given operation requires reflection. For example:
366
367```c
368// Parses JSON format into a message object, using reflection.
369UPB_API bool upb_JsonDecode(const char* buf, size_t size, upb_Message* msg,
370                            const upb_MessageDef* m, const upb_DefPool* symtab,
371                            int options, upb_Arena* arena, upb_Status* status);
372```
373
374The part of upb that requires reflection can be thought of as "upb full." These
375parts of the library cannot be used if a given application has only loaded
376MiniTables. There is no way to convert a MiniTable into reflection.
377
378The `upb_DefPool` type is the top-level container that builds and owns some set
379of defs. This type is a close analogue of `google::protobuf::DescriptorPool` in C++. The
380user must always ensure that the `upb_DefPool` outlives any def objects that it
381owns.
382
383#### Loading
384
385As with MiniTable loading, we have multiple options for how to load full
386reflection:
387
3881.  **From C generated code**: The upb code generator can create `foo.upbdefs.c`
389    files that embed the descriptors and exports generated C functions for
390    adding them to a user-provided `upb_DefPool`.
3912.  **From descriptors**: The user can make manual calls to
392    `upb_DefPool_AddFile()`, using descriptors obtained at runtime. Defs for
393    individual messages can then be obtained using
394    `upb_DefPool_FindMessageByName()`.
395
396Unlike MiniTables, loading from generated code requires runtime initialization,
397as reflection data structures like `upb_MessageDef` are not capable of being
398emitted directly into `.rodata` like `upb_MiniTable` is. Instead, the generated
399code embeds serialized descriptor protos into `.rodata` which are then built
400into heap objects at runtime.
401
402From this you might conclude that option (1) is nothing but a convenience
403wrapper around option (2), but that is not quite correct either. Option (1)
404*does* link against the static `.upb.c` structures for the MiniTables, whereas
405option (2) will build the MiniTables from scratch on the heap. So option (1)
406will use marginally less CPU and RAM when the descriptors are loaded into a
407`upb_DefPool`. More importantly, the resulting descriptors will be capable of
408reflecting over any messages built from the generated `.upb.c` MiniTables,
409whereas descriptors built using option (2) will have distinct MiniTables that
410cannot reflect over messages that use the generated MiniTables.
411
412A common pattern for dynamic languages like PHP, Ruby, or Python, is to use
413option (2) with descriptors that are embedded into the generated code. For
414example, the generated code in Python currently looks something like:
415
416```python
417from google.protobuf import descriptor_pool as _descriptor_pool
418from google.protobuf.internal import builder as _builder
419
420_desc = b'\n\x1aprotoc_explorer/main.proto\x12\x03pkg'
421
422DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(_desc)
423_globals = globals()
424_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
425_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.protoc_explorer.main_pb2', _globals)
426```
427
428The `AddSerializedFile()` API above is mainly just a thin wrapper around
429`upb_DefPool_AddFile()`.
430