• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Edition Zero Features
2
3**Authors:** [@mcy](https://github.com/mcy),
4[@zhangskz](https://github.com/zhangskz),
5[@mkruskal-google](https://github.com/mkruskal-google)
6
7**Approved:** 2022-07-22
8
9Feature flags, and their defaults, that we will introduce to define the
10converged semantics of Edition Zero.
11
12**NOTE:** This document is largely replaced by the topic,
13[Feature Settings for Editions](https://protobuf.dev/editions/features) (to be
14released soon).
15
16## Overview
17
18*Edition Zero Features* defines the "first edition" of the brave new world of
19no-`syntax` Protobuf. This document defines the actual mechanics of the features
20(in the narrow sense of editions) we need to implement in protoc, as well as the
21chosen defaults.
22
23This document will require careful review from various stakeholders, because it
24is essentially defining a new Protobuf `syntax`, even if it isn't spelled that
25way. In particular, we need to ensure that there is a way to rewrite existing
26`proto2` and `proto3` files as `editions` files, and the behavior of "mixed
27syntax" messages, without any nasty surprises.
28
29Note that it is an explicit goal that it be possible to take an arbitrary
30proto2/proto3 file and convert it to editions without semantic changes, via
31appropriate application of features.
32
33## Existing Non-Conformance
34
35We must keep in mind that the status quo is messy. Many languages have some
36areas where they currently diverge from the correct proto2/proto3 semantics. For
37edition zero, we must preserve these idiosyncratic behaviors, because that is
38the only way for a proto2/proto3 -> editions LSC to be a no-op.
39
40For example, in this document we define a feature `features.enum =
41{CLOSED,OPEN}`. But currently Go does not implement closed enum semantics for
42`syntax=proto2` as it should. This behavior is out of conformance, but we must
43preserve this out-of-conformance behavior for edition zero.
44
45In other words, defining features and their semantics is in scope for edition
46zero, but fixing code generators to perfectly match those semantics is
47explicitly out-of-scope.
48
49## Glossary
50
51Because we need to speak of two proto syntaxes, `proto2` and `proto3`, that have
52disagreeing terminology in some places, we'll define the following terms to aid
53discussion. When a term appears in `code font`, it refers to the Protobuf
54language keyword.
55
56*   A **presence discipline** is a handling for the presence (or hasbit) of a
57    field. Every field notionally has a hasbit: whether it has been explicitly
58    set via the API or whether a record for it was present on deserialization.
59    See
60    [Application Note: Field Presence](https://protobuf.dev/programming-guides/field_presence)
61    for more on this topic. The discipline specifies how this bit is surfaced to
62    the user:
63    *   **No presence** means that the API does not expose the hasbit. The
64        default value for the field behaves somewhat like a special sentinel
65        value, which is not serialized and not merged-from. The hasbit may still
66        exist in the implementation (C++ accidentally leaks this via HasField,
67        for example). Note that repeated fields sort-of behave like no presence
68        fields.
69    *   **Explicit presence** means that the API exposes the hasbit through a
70        `has` method and a `Clear` method; default values are always serialized
71        if the hasbit is set.
72*   A **closed enum** is an enum where parsing requires validating that a parsed
73    `int32` representing a field of this type matches one of the known set of
74    valid values.
75*   An **open enum** does not have this restriction, and is just an `int32`
76    field with well-known values.
77
78For the purposes of this document, we will use the syntax described in *Features
79as Custom Options*, since it is the prevailing consensus among those working on
80editions, and allows us to have enum-typed features. The exact names for the
81features are a matter of bikeshedding.
82
83## Proposed Converged Semantics
84
85There are two kinds of syntax behaviors we need to capture: those that are
86turned on by a keyword, like `required`, and those that are implicit, like open
87enums. The differences between proto2 and proto3 today are:
88
89*   Required. Proto2 has `required` but not `defaulted`; Proto3 has `defaulted`
90    but not `required`. Proto3 also does not allow custom defaults on
91    `defaulted` fields, and on message-typed fields, `defaulted` is a synonym
92    for `optional`.
93*   Groups. Proto2 has groups, proto3 does not.
94*   Enums. In Proto2, enums are **closed**: messages that have an enum not in
95    the known set are stored in the unknown field set. In Proto3, enums are
96    **open**.
97*   String validation. Proto2 is a bit wobbly on whether strings must be UTF-8
98    when serialized; Proto3 enforces this (sometimes).
99*   Extensions. Proto2 has extensions, while Proto3 does not (`Any` is the
100    canonical workaround).
101
102We propose defining the following features as part of edition zero:
103
104### features.field_presence
105
106This feature is enum-typed and controls the presence discipline of a singular
107field:
108
109*   `EXPLICIT` (default) - the field has *explicit presence* discipline. Any
110    explicitly set value will be serialized onto the wire (even if it is the
111    same as the default value).
112*   `IMPLICIT` - the field has *no presence* discipline. The default value is
113    not serialized onto the wire (even if it is explicitly set).
114*   `LEGACY_REQUIRED` - the field is wire-required and API-optional. Setting
115    this will require being in the `required` allowlist. Any explicitly set
116    value will be serialized onto the wire (even if it is the same as the
117    default value).
118
119The syntax for singular fields is a much debated question. After discussing the
120tradeoffs, we have chosen to *eliminate both the `optional` and `required`
121keywords, making them parse errors*. Singular fields are spelled as in proto3
122(no label), and will take on the presence discipline given by
123`features.:presence`. Migration will require deleting every instance of
124`optional` in proto files in google3, of which there are 385,236.
125
126It is important to observe that proto2 users are much likelier to care about
127presence than proto3 users, since the design of proto3 discourages thinking
128about presence as an interesting feature of protos, so arguably introducing
129proto2-style presence will not register on most users' mental radars. This is
130difficult to prove concretely.
131
132`IMPLICIT` fields behave much like proto3 implicit fields: they cannot have
133custom defaults and are ignored on submessage fields. Also, if it is an
134enum-typed field, that enum must be open (i.e., it is either defined in a
135`syntax = proto3;` file or it specifies `option features.enum = OPEN;`
136transitively).
137
138We also make some semantic changes:
139
140*   ~~`IMPLICIT``fields may have a custom default value, unlike in`proto3`.
141    Whether an`IMPLICIT` field containing its default value is serialized
142    becomes an implementation choice (implementations are encouraged to try to
143    avoid serializing too much, though).~~
144*   `has_optional_keyword()` and `has_presence()` now check for `EXPLICIT`, and
145    are effectively synonyms.
146*   `proto3_optional` is rejected as a parse error (use the feature instead).
147
148Migrating from proto2/3 involves deleting all `optional`/`required` labels and
149adding `IMPLICIT` and `LEGACY_REQUIURED` annotations where necessary.
150
151#### Alternatives
152
153*   For syntax:
154    *   Require `optional`. This may confuse proto3 users who are used to
155        `optional` not being a default they reach for. Will result in
156        significant (trivial, but noisy) churn in proto3 files. The keyword is
157        effectively line noise, since it does not indicate anything other than
158        "this is a singular field".
159    *   Invent a new label, like `singular`. This results in more churn but
160        avoids breaking peoples' priors.
161    *   Allow `optional` and no label to coexist in a file, which take on their
162        original meanings unless overridden by `features.field_presence`. The
163        fact that a top-level `features.field_presence = IMPLICIT` breaks the
164        proto3 expectation that `optional` means `EXPLICIT` may be a source of
165        confusion.
166*   `proto:allow_required`, which must be present for `required` to not be a
167    syntax error.
168*   Allow `required`/`optional` and introduce `defaulted` as a real keyword. We
169    will not have another easy chance to introduce such syntax (which we do,
170    because `edition = ...` is a breaking change).
171*   Reject custom defaults for `IMPLICIT` fields. This is technically not really
172    needed for converged semantics, but trying to remove the Proto3-ness from
173    `IMPLICIT` fields seems useful for consistency.
174
175#### Future Work
176
177In the future, we can introduce something like `features.always_serialize` or a
178similar new enumerator (`ALWAYS_SERIALIZE`) to the `when_missing` enum, which
179makes `EXPLICIT_PRESENCE` fields unconditionally serialized, allowing
180`LEGACY_REQUIRED` fields to become `EXPLICIT_PRESENCE` in a future large-scale
181change. The details of such a migration are out-of-scope for this document.
182
183#### Migration Examples
184
185Given the following files:
186
187```
188// foo.proto
189syntax = "proto2"
190
191message Foo {
192  required int32 x = 1;
193  optional int32 y = 2;
194  repeated int32 z = 3;
195}
196
197// bar.proto
198syntax = "proto3"
199
200message Bar {
201  int32 x = 1;
202  optional int32 y = 2;
203  repeated int32 z = 3;
204}
205```
206
207post-editions, they will look like this:
208
209```
210// foo.proto
211edition = "tbd"
212
213message Foo {
214  int32 x = 1 [features.field_presence = LEGACY_REQUIRED];
215  int32 y = 2;
216  repeated int32 z = 3;
217}
218
219// bar.proto
220edition = "tbd"
221option features.field_presence = NO_PRESENCE;
222
223message Bar {
224  int32 x = 1;
225  int32 y = 2 [features.field_presence = EXPLICIT_PRESENCE];
226  repeated int32 z = 3;
227}
228```
229
230### features.enum_type
231
232Enum types come in two distinct flavors: *closed* and *open*.
233
234*   *closed* enums will store enum values that are out of range in the unknown
235    field set.
236*   *open* enums will parse out of range values into their fields directly.
237
238    **NOTE:** Closed enums can cause confusion for parallel arrays (two repeated
239    fields that expect to have index i refer to the same logical concept in both
240    fields) because an unknown enum value from a parallel array will be placed
241    in the unknown field set and the arrays will cease being parallel. Similarly
242    parsing and serializing can change the order of a repeated closed enum by
243    moving unknown values to the end.
244
245    **NOTE:** Some runtimes (C++ and Java, in particular) currently do not use
246    the declaration site of enums to determine whether an enum field is treated
247    as open; rather, they use the syntax of the message the field is defined in,
248    instead. To preserve this proto2 quirk until we can migrate users off of it,
249    Java and C++ (and runtimes with the same quirk) will use the value of
250    `features.enum` as set at the file level of messages (so, if a file sets
251    `features.enum = CLOSED` at the file level, enum fields defined in it behave
252    as if the enum was closed, regardless of declaration). IMPLICIT singular
253    fields in Java and C++ ignore this and are always treated as open, because
254    they used to only be possible to define in proto3 files, which can't use
255    proto2 enums.
256
257In proto2, `enum` values are closed and no requirements are placed upon the
258first `enum` value. The first enum value will be used as the default value.
259
260In proto3, `enum` values are open and the first `enum` value must be zero. The
261first `enum` value is used as the default value, but that value is required to
262be zero.
263
264In edition zero, We will add a feature `features.enum_type = {CLOSED,OPEN}`. The
265default will be `OPEN`. Upgraded proto2 files will explicitly set
266`features.enum_type = CLOSED`. The requirement of having the first enum value be
267zero will be dropped.
268
269**NOTE:** Nominally this exposes a new state in the configuration space, OPEN
270enums with a non-zero default. We decided that excluding this option simply
271because it was previously inexpressible was a false economy.
272
273#### Alternatives
274
275*   We could add a property for requiring a zero first value for an enum. This
276    feels needlessly complicated.
277*   We could drop the ability to have `CLOSED` enums, but that is a semantic
278    change.
279
280#### Migration Examples
281
282Given the following files:
283
284```
285// foo.proto
286syntax = "proto2"
287
288enum Foo {
289  A = 2, B = 4, C = 6,
290}
291
292// bar.proto
293syntax = "proto3"
294
295enum Bar {
296  A = 0, B = 1, C = 5,
297}
298```
299
300post-editions, they will look like this:
301
302```
303// foo.proto
304edition = "tbd"
305option features.enum_type = CLOSED;
306
307enum Foo {
308  A = 2, B = 4, C = 6,
309}
310
311// bar.proto
312edition = "tbd"
313
314enum Bar {
315  A = 0, B = 1, C = 5,
316}
317```
318
319If we wanted to merge them into one file, it would look like this:
320
321```
322// foo.proto
323edition = "tbd"
324
325enum Foo {
326  option features.enum_type = CLOSED;
327  A = 2, B = 4, C = 6,
328}
329
330
331enum Bar {
332  A = 0, B = 1, C = 5,
333}
334```
335
336### features.repeated_field_encoding
337
338In proto3, the `repeated_field_encoding` attribute defaults to `PACKED`. In
339proto2, the `repeated_field_encoding` attribute defaults to `EXPANDED`. Users
340explicitly enabled packed fields 12.3k times, but only explicitly disable it 200
341times. Thus we can see a clear preference for `repeated_field_encoding = PACKED`
342emerge. This data matches best practices. As such, the default value for
343`features.repeated_field_encoding` will be `PACKED`.
344
345The existing `[packed = …]` syntax will be made an alias for setting the feature
346in edition zero. This alias will eventually be removed. Whether that removal
347happens during the initial large-scale change to enable edition zero or as a
348follow on will be decided at the time.
349
350In the long term, we would like to remove explicit usages of
351`features.repeated_field_encoding = EXPANDED`, but we would prefer to separate
352that large-scale change from the landing of edition zero. So we will explicitly
353set `features.repeated_field_encoding` to `EXPANDED` at the file level when we
354migrate proto2 files to edition zero.
355
356#### Alternatives
357
358*   Force everyone to use packed fields. This is a semantic change, which we're
359    trying to avoid in edition zero.
360*   Don’t add `features.repeated_field_encoding` and instead specify `[packed =
361    false]` when converting from proto2. This will be incredibly noisy,
362    syntax-wise and diff-wise.
363
364#### Migration Examples
365
366Given the following files:
367
368```
369// foo.proto
370syntax = "proto2"
371
372message Foo {
373  repeated int32 x = 1;
374  repeated int32 y = 2 [packed = true];
375  repeated int32 z = 3;
376}
377
378// bar.proto
379syntax = "proto3"
380
381message Foo {
382  repeated int32 x = 1;
383  repeated int32 y = 2 [packed = false];
384  repeated int32 z = 3;
385}
386```
387
388post-editions, they will look like this:
389
390```
391// foo.proto
392edition = "tbd"
393options features.repeated_field_encoding = EXPANDED;
394
395message Foo {
396  repeated int32 x = 1;
397  repeated int32 y = 2 [packed = true];
398  repeated int32 z = 3;
399}
400
401
402// bar.proto
403edition = "tbd"
404
405message Foo {
406  repeated int32 x = 1;
407  repeated int32 y = 2 [packed = false];
408  repeated int32 z = 3;
409}
410```
411
412Note that post migration, we have not changed `packed` to
413`features.repeated_field_encoding = PACKED`, although we could choose to do so
414if the diff cost is not monumental. We prefer to defer to an LSC after editions
415are shipped, if possible.
416
417### features.string_field_validation
418
419**WARNING:** UTF8 validation is actually messier than originally believed. This
420feature is being reconsidered in _Editions Zero Feature: utf8_validation_.
421
422This feature is a tristate:
423
424*   `MANDATORY` - this means that a runtime MUST verify UTF-8.
425*   `HINT` - this means that a runtime may refuse to parse invalid UTF-8, but it
426    can also simply skip the check for performance in some build modes.
427*   `NONE` - this field behaves like a `bytes` field on the wire, but parsers
428    may mangle the string in an unspecified way (for example, Java may insert
429    spaces as replacement characters).
430
431The default will be `MANDATORY`.
432
433Long term, we would like to remove this feature and make all `string` fields
434`MANDATORY`.
435
436#### Alternatives
437
438*   Drop the UTF-8 requirements completely. This seems like it will create more
439    problems than it will solve (e.g., random things relying on validation need
440    to be fixed) and it will be a lot of work. This is also counter to the
441    vision of string being a UTF-8 type, and bytes being its unchecked sibling.
442*   Make opt-in verification a hard requirement instead of a hint, so that users
443    have a nice performance needle they can play with.
444
445#### Future Work
446
447In the infinite future, we would like to remove this feature and force all
448`string` fields to be UTF-8 validated. To do this, we need to recognize that
449what many callers want from their `string` fields is a `bytes` field with a
450`string`-like API. To ease the transition, we would add per-codegen backend
451features, like `java.bytes_as_string`, that give a `bytes` field a generated API
452resembling that of a `string` field (with caveats about replacement characters
453forced by the host language's string type).
454
455The migration would take `HINT` or `SKIP` `string` fields and convert them into
456`bytes` fields with the appropriate API modifiers, depending on which languages
457use that proto; C++-only protos, for example, are a no-op.
458
459There is an argument to be made for "I want a string type, and I explicitly want
460replacement U+FFFD characters if I get something that isn't UTF-8." It is
461unclear if this is something users want and we would need to investigate it
462before making a decision.
463
464### features.json_format
465
466This feature is dual state in edition zero:
467
468*   `ALLOW` - this means that a runtime must allow JSON parsing and
469    serialization. Checks will be applied at the proto level to make sure that
470    there is a well-defined mapping to JSON.
471*   `LEGACY_BEST_EFFORT` - this means that a runtime will do the best it can to
472    parse and serialize JSON. Certain protos will be allowed that can result in
473    undefined behavior at runtime (e.g. many:1 or 1:many mappings).
474
475The default will be `ALLOW`, which maps the to the current proto3 behavior.
476`LEGACY_BEST_EFFORT` will be used for proto2 files that require it (e.g. they’ve
477set `deprecated_legacy_json_field_conflicts`)
478
479#### Alternatives
480
481*   Keep the proto2 behavior - this will regress proto3 files by removing
482    validation for JSON mappings, and lead to *more* undefined runtime behavior
483*   Only use `ALLOW` - there are ~30 cases internally where protos have invalid
484    JSON mappings and rely on unspecified (but luckily well defined) runtime
485    behavior.
486
487#### Future Work
488
489Long term, we would like to either remove this feature entirely or add a
490`DISALLOW` option instead of `LEGACY_BEST_EFFORT`. This will more strictly
491enforce that protos without a valid JSON mapping *can’t* be serialized or parsed
492to JSON. `DISALLOW` will be enforced at the proto-language level, where no
493message marked `ALLOW` can contain any message/enum marked `DISALLOW` (e.g.
494through extensions or fields)
495
496#### Migration Examples
497
498### Extensions are Always Allowed
499
500Extensions may be used on all messages. This lifts a restriction from proto3.
501
502Extensions do not play nicely with `TypeResolver`. This is actually fixable, but
503probably only worth it if someone complains.
504
505#### Alternatives
506
507*   Add `features.allow_extensions`, default true. This feels unnecessary since
508    uttering `extend` and `extensions` is required to use extensions in the
509    first place.
510
511### features.message_encoding
512
513This feature defaults to `LENGTH_PREFIXED`. The `group` syntax does not exist
514under editions. Instead, message-typed fields that have
515`features.message_encoding = DELIMITED` set will be encoded as groups (wire type
5163/4) rather than byte blobs (wire type 2). This reflects the existing API
517(groups are funny message fields) and simplifies the parser.
518
519A `proto2` group field will be converted into a nested message type of the same
520name, and a singular submessage field that is `features.message_encoding =
521DELIMITED` with the message type's name in snake_case.
522
523This could be used in the future to switch new message fields to use group
524encoding, which suggested previously as an efficiency direction.
525
526#### Alternatives
527
528*   Allow groups in `editions` with no changes. `group` syntax is deprecated, so
529    we may as well take the opportunity to knock it out.
530*   Add a sidecar allowlist like we do for `required`. This is mostly
531    orthogonal.
532
533#### Migration Examples
534
535Given the following file
536
537```
538// foo.proto
539syntax = "proto2"
540
541message Foo {
542  group Bar = 1 {
543    optional int32 x = 1;
544    repeated int32 y = 2;
545  }
546}
547```
548
549post-editions, it will look like this:
550
551```
552// foo.proto
553edition = "tbd"
554
555message Foo {
556  message Bar {
557    optional int32 x = 1;
558    repeated int32 y = 2;
559  }
560  Bar bar = 1 [features.message_encoding = DELIMITED];
561}
562```
563
564## Proposed Features Message
565
566Putting together all of the above, we propose the following `Features` message,
567including retention and target rules associated with fields.
568
569```
570message Features {
571  enum FieldPresence {
572    EXPLICIT = 0;
573    IMPLICIT = 1;
574    LEGACY_REQUIRED = 2;
575  }
576  optional FieldPresence field_presence = 1 [
577      retention = RUNTIME,
578      target = FILE,
579      target = FIELD
580  ];
581
582  enum EnumType {
583    OPEN = 0;
584    CLOSED = 1;
585  }
586  optional EnumType enum = 2 [
587      retention = RUNTIME,
588      target = FILE,
589      target = ENUM
590  ];
591
592  enum RepeatedFieldEncoding {
593    PACKED = 0;
594    UNPACKED = 1;
595  }
596  optional RepeatedFieldEncoding repeated_field_encoding = 3 [
597      retention = RUNTIME,
598      target = FILE,
599      target = FIELD
600  ];
601
602  enum StringFieldValidation {
603    MANDATORY = 0;
604    HINT = 1;
605    NONE = 2;
606  }
607  optional StringFieldValidation string_field_validation = 4 [
608      retention = RUNTIME,
609      target = FILE,
610      target = FIELD
611  ];
612
613  enum MessageEncoding {
614    LENGTH_PREFIXED = 0;
615    DELIMITED = 1;
616  }
617  optional MessageEncoding message_encoding = 5 [
618      retention = RUNTIME,
619      target = FILE,
620      target = FIELD
621  ];
622
623  extensions 1000;  // for features_cpp.proto
624  extensions 1001;  // for features_java.proto
625}
626```
627