1# Edition Zero Features 2 3**Authors:** [@mcy](https://github.com/mcy), 4[@zhangskz](https://github.com/zhangskz), 5[@mkruskal-google](https://github.com/mkruskal-google) 6 7**Approved:** 2022-07-22 8 9Feature flags, and their defaults, that we will introduce to define the 10converged semantics of Edition Zero. 11 12**NOTE:** This document is largely replaced by the topic, 13[Feature Settings for Editions](https://protobuf.dev/editions/features) (to be 14released soon). 15 16## Overview 17 18*Edition Zero Features* defines the "first edition" of the brave new world of 19no-`syntax` Protobuf. This document defines the actual mechanics of the features 20(in the narrow sense of editions) we need to implement in protoc, as well as the 21chosen defaults. 22 23This document will require careful review from various stakeholders, because it 24is essentially defining a new Protobuf `syntax`, even if it isn't spelled that 25way. In particular, we need to ensure that there is a way to rewrite existing 26`proto2` and `proto3` files as `editions` files, and the behavior of "mixed 27syntax" messages, without any nasty surprises. 28 29Note that it is an explicit goal that it be possible to take an arbitrary 30proto2/proto3 file and convert it to editions without semantic changes, via 31appropriate application of features. 32 33## Existing Non-Conformance 34 35We must keep in mind that the status quo is messy. Many languages have some 36areas where they currently diverge from the correct proto2/proto3 semantics. For 37edition zero, we must preserve these idiosyncratic behaviors, because that is 38the only way for a proto2/proto3 -> editions LSC to be a no-op. 39 40For example, in this document we define a feature `features.enum = 41{CLOSED,OPEN}`. But currently Go does not implement closed enum semantics for 42`syntax=proto2` as it should. This behavior is out of conformance, but we must 43preserve this out-of-conformance behavior for edition zero. 44 45In other words, defining features and their semantics is in scope for edition 46zero, but fixing code generators to perfectly match those semantics is 47explicitly out-of-scope. 48 49## Glossary 50 51Because we need to speak of two proto syntaxes, `proto2` and `proto3`, that have 52disagreeing terminology in some places, we'll define the following terms to aid 53discussion. When a term appears in `code font`, it refers to the Protobuf 54language keyword. 55 56* A **presence discipline** is a handling for the presence (or hasbit) of a 57 field. Every field notionally has a hasbit: whether it has been explicitly 58 set via the API or whether a record for it was present on deserialization. 59 See 60 [Application Note: Field Presence](https://protobuf.dev/programming-guides/field_presence) 61 for more on this topic. The discipline specifies how this bit is surfaced to 62 the user: 63 * **No presence** means that the API does not expose the hasbit. The 64 default value for the field behaves somewhat like a special sentinel 65 value, which is not serialized and not merged-from. The hasbit may still 66 exist in the implementation (C++ accidentally leaks this via HasField, 67 for example). Note that repeated fields sort-of behave like no presence 68 fields. 69 * **Explicit presence** means that the API exposes the hasbit through a 70 `has` method and a `Clear` method; default values are always serialized 71 if the hasbit is set. 72* A **closed enum** is an enum where parsing requires validating that a parsed 73 `int32` representing a field of this type matches one of the known set of 74 valid values. 75* An **open enum** does not have this restriction, and is just an `int32` 76 field with well-known values. 77 78For the purposes of this document, we will use the syntax described in *Features 79as Custom Options*, since it is the prevailing consensus among those working on 80editions, and allows us to have enum-typed features. The exact names for the 81features are a matter of bikeshedding. 82 83## Proposed Converged Semantics 84 85There are two kinds of syntax behaviors we need to capture: those that are 86turned on by a keyword, like `required`, and those that are implicit, like open 87enums. The differences between proto2 and proto3 today are: 88 89* Required. Proto2 has `required` but not `defaulted`; Proto3 has `defaulted` 90 but not `required`. Proto3 also does not allow custom defaults on 91 `defaulted` fields, and on message-typed fields, `defaulted` is a synonym 92 for `optional`. 93* Groups. Proto2 has groups, proto3 does not. 94* Enums. In Proto2, enums are **closed**: messages that have an enum not in 95 the known set are stored in the unknown field set. In Proto3, enums are 96 **open**. 97* String validation. Proto2 is a bit wobbly on whether strings must be UTF-8 98 when serialized; Proto3 enforces this (sometimes). 99* Extensions. Proto2 has extensions, while Proto3 does not (`Any` is the 100 canonical workaround). 101 102We propose defining the following features as part of edition zero: 103 104### features.field_presence 105 106This feature is enum-typed and controls the presence discipline of a singular 107field: 108 109* `EXPLICIT` (default) - the field has *explicit presence* discipline. Any 110 explicitly set value will be serialized onto the wire (even if it is the 111 same as the default value). 112* `IMPLICIT` - the field has *no presence* discipline. The default value is 113 not serialized onto the wire (even if it is explicitly set). 114* `LEGACY_REQUIRED` - the field is wire-required and API-optional. Setting 115 this will require being in the `required` allowlist. Any explicitly set 116 value will be serialized onto the wire (even if it is the same as the 117 default value). 118 119The syntax for singular fields is a much debated question. After discussing the 120tradeoffs, we have chosen to *eliminate both the `optional` and `required` 121keywords, making them parse errors*. Singular fields are spelled as in proto3 122(no label), and will take on the presence discipline given by 123`features.:presence`. Migration will require deleting every instance of 124`optional` in proto files in google3, of which there are 385,236. 125 126It is important to observe that proto2 users are much likelier to care about 127presence than proto3 users, since the design of proto3 discourages thinking 128about presence as an interesting feature of protos, so arguably introducing 129proto2-style presence will not register on most users' mental radars. This is 130difficult to prove concretely. 131 132`IMPLICIT` fields behave much like proto3 implicit fields: they cannot have 133custom defaults and are ignored on submessage fields. Also, if it is an 134enum-typed field, that enum must be open (i.e., it is either defined in a 135`syntax = proto3;` file or it specifies `option features.enum = OPEN;` 136transitively). 137 138We also make some semantic changes: 139 140* ~~`IMPLICIT``fields may have a custom default value, unlike in`proto3`. 141 Whether an`IMPLICIT` field containing its default value is serialized 142 becomes an implementation choice (implementations are encouraged to try to 143 avoid serializing too much, though).~~ 144* `has_optional_keyword()` and `has_presence()` now check for `EXPLICIT`, and 145 are effectively synonyms. 146* `proto3_optional` is rejected as a parse error (use the feature instead). 147 148Migrating from proto2/3 involves deleting all `optional`/`required` labels and 149adding `IMPLICIT` and `LEGACY_REQUIURED` annotations where necessary. 150 151#### Alternatives 152 153* For syntax: 154 * Require `optional`. This may confuse proto3 users who are used to 155 `optional` not being a default they reach for. Will result in 156 significant (trivial, but noisy) churn in proto3 files. The keyword is 157 effectively line noise, since it does not indicate anything other than 158 "this is a singular field". 159 * Invent a new label, like `singular`. This results in more churn but 160 avoids breaking peoples' priors. 161 * Allow `optional` and no label to coexist in a file, which take on their 162 original meanings unless overridden by `features.field_presence`. The 163 fact that a top-level `features.field_presence = IMPLICIT` breaks the 164 proto3 expectation that `optional` means `EXPLICIT` may be a source of 165 confusion. 166* `proto:allow_required`, which must be present for `required` to not be a 167 syntax error. 168* Allow `required`/`optional` and introduce `defaulted` as a real keyword. We 169 will not have another easy chance to introduce such syntax (which we do, 170 because `edition = ...` is a breaking change). 171* Reject custom defaults for `IMPLICIT` fields. This is technically not really 172 needed for converged semantics, but trying to remove the Proto3-ness from 173 `IMPLICIT` fields seems useful for consistency. 174 175#### Future Work 176 177In the future, we can introduce something like `features.always_serialize` or a 178similar new enumerator (`ALWAYS_SERIALIZE`) to the `when_missing` enum, which 179makes `EXPLICIT_PRESENCE` fields unconditionally serialized, allowing 180`LEGACY_REQUIRED` fields to become `EXPLICIT_PRESENCE` in a future large-scale 181change. The details of such a migration are out-of-scope for this document. 182 183#### Migration Examples 184 185Given the following files: 186 187``` 188// foo.proto 189syntax = "proto2" 190 191message Foo { 192 required int32 x = 1; 193 optional int32 y = 2; 194 repeated int32 z = 3; 195} 196 197// bar.proto 198syntax = "proto3" 199 200message Bar { 201 int32 x = 1; 202 optional int32 y = 2; 203 repeated int32 z = 3; 204} 205``` 206 207post-editions, they will look like this: 208 209``` 210// foo.proto 211edition = "tbd" 212 213message Foo { 214 int32 x = 1 [features.field_presence = LEGACY_REQUIRED]; 215 int32 y = 2; 216 repeated int32 z = 3; 217} 218 219// bar.proto 220edition = "tbd" 221option features.field_presence = NO_PRESENCE; 222 223message Bar { 224 int32 x = 1; 225 int32 y = 2 [features.field_presence = EXPLICIT_PRESENCE]; 226 repeated int32 z = 3; 227} 228``` 229 230### features.enum_type 231 232Enum types come in two distinct flavors: *closed* and *open*. 233 234* *closed* enums will store enum values that are out of range in the unknown 235 field set. 236* *open* enums will parse out of range values into their fields directly. 237 238 **NOTE:** Closed enums can cause confusion for parallel arrays (two repeated 239 fields that expect to have index i refer to the same logical concept in both 240 fields) because an unknown enum value from a parallel array will be placed 241 in the unknown field set and the arrays will cease being parallel. Similarly 242 parsing and serializing can change the order of a repeated closed enum by 243 moving unknown values to the end. 244 245 **NOTE:** Some runtimes (C++ and Java, in particular) currently do not use 246 the declaration site of enums to determine whether an enum field is treated 247 as open; rather, they use the syntax of the message the field is defined in, 248 instead. To preserve this proto2 quirk until we can migrate users off of it, 249 Java and C++ (and runtimes with the same quirk) will use the value of 250 `features.enum` as set at the file level of messages (so, if a file sets 251 `features.enum = CLOSED` at the file level, enum fields defined in it behave 252 as if the enum was closed, regardless of declaration). IMPLICIT singular 253 fields in Java and C++ ignore this and are always treated as open, because 254 they used to only be possible to define in proto3 files, which can't use 255 proto2 enums. 256 257In proto2, `enum` values are closed and no requirements are placed upon the 258first `enum` value. The first enum value will be used as the default value. 259 260In proto3, `enum` values are open and the first `enum` value must be zero. The 261first `enum` value is used as the default value, but that value is required to 262be zero. 263 264In edition zero, We will add a feature `features.enum_type = {CLOSED,OPEN}`. The 265default will be `OPEN`. Upgraded proto2 files will explicitly set 266`features.enum_type = CLOSED`. The requirement of having the first enum value be 267zero will be dropped. 268 269**NOTE:** Nominally this exposes a new state in the configuration space, OPEN 270enums with a non-zero default. We decided that excluding this option simply 271because it was previously inexpressible was a false economy. 272 273#### Alternatives 274 275* We could add a property for requiring a zero first value for an enum. This 276 feels needlessly complicated. 277* We could drop the ability to have `CLOSED` enums, but that is a semantic 278 change. 279 280#### Migration Examples 281 282Given the following files: 283 284``` 285// foo.proto 286syntax = "proto2" 287 288enum Foo { 289 A = 2, B = 4, C = 6, 290} 291 292// bar.proto 293syntax = "proto3" 294 295enum Bar { 296 A = 0, B = 1, C = 5, 297} 298``` 299 300post-editions, they will look like this: 301 302``` 303// foo.proto 304edition = "tbd" 305option features.enum_type = CLOSED; 306 307enum Foo { 308 A = 2, B = 4, C = 6, 309} 310 311// bar.proto 312edition = "tbd" 313 314enum Bar { 315 A = 0, B = 1, C = 5, 316} 317``` 318 319If we wanted to merge them into one file, it would look like this: 320 321``` 322// foo.proto 323edition = "tbd" 324 325enum Foo { 326 option features.enum_type = CLOSED; 327 A = 2, B = 4, C = 6, 328} 329 330 331enum Bar { 332 A = 0, B = 1, C = 5, 333} 334``` 335 336### features.repeated_field_encoding 337 338In proto3, the `repeated_field_encoding` attribute defaults to `PACKED`. In 339proto2, the `repeated_field_encoding` attribute defaults to `EXPANDED`. Users 340explicitly enabled packed fields 12.3k times, but only explicitly disable it 200 341times. Thus we can see a clear preference for `repeated_field_encoding = PACKED` 342emerge. This data matches best practices. As such, the default value for 343`features.repeated_field_encoding` will be `PACKED`. 344 345The existing `[packed = …]` syntax will be made an alias for setting the feature 346in edition zero. This alias will eventually be removed. Whether that removal 347happens during the initial large-scale change to enable edition zero or as a 348follow on will be decided at the time. 349 350In the long term, we would like to remove explicit usages of 351`features.repeated_field_encoding = EXPANDED`, but we would prefer to separate 352that large-scale change from the landing of edition zero. So we will explicitly 353set `features.repeated_field_encoding` to `EXPANDED` at the file level when we 354migrate proto2 files to edition zero. 355 356#### Alternatives 357 358* Force everyone to use packed fields. This is a semantic change, which we're 359 trying to avoid in edition zero. 360* Don’t add `features.repeated_field_encoding` and instead specify `[packed = 361 false]` when converting from proto2. This will be incredibly noisy, 362 syntax-wise and diff-wise. 363 364#### Migration Examples 365 366Given the following files: 367 368``` 369// foo.proto 370syntax = "proto2" 371 372message Foo { 373 repeated int32 x = 1; 374 repeated int32 y = 2 [packed = true]; 375 repeated int32 z = 3; 376} 377 378// bar.proto 379syntax = "proto3" 380 381message Foo { 382 repeated int32 x = 1; 383 repeated int32 y = 2 [packed = false]; 384 repeated int32 z = 3; 385} 386``` 387 388post-editions, they will look like this: 389 390``` 391// foo.proto 392edition = "tbd" 393options features.repeated_field_encoding = EXPANDED; 394 395message Foo { 396 repeated int32 x = 1; 397 repeated int32 y = 2 [packed = true]; 398 repeated int32 z = 3; 399} 400 401 402// bar.proto 403edition = "tbd" 404 405message Foo { 406 repeated int32 x = 1; 407 repeated int32 y = 2 [packed = false]; 408 repeated int32 z = 3; 409} 410``` 411 412Note that post migration, we have not changed `packed` to 413`features.repeated_field_encoding = PACKED`, although we could choose to do so 414if the diff cost is not monumental. We prefer to defer to an LSC after editions 415are shipped, if possible. 416 417### features.string_field_validation 418 419**WARNING:** UTF8 validation is actually messier than originally believed. This 420feature is being reconsidered in _Editions Zero Feature: utf8_validation_. 421 422This feature is a tristate: 423 424* `MANDATORY` - this means that a runtime MUST verify UTF-8. 425* `HINT` - this means that a runtime may refuse to parse invalid UTF-8, but it 426 can also simply skip the check for performance in some build modes. 427* `NONE` - this field behaves like a `bytes` field on the wire, but parsers 428 may mangle the string in an unspecified way (for example, Java may insert 429 spaces as replacement characters). 430 431The default will be `MANDATORY`. 432 433Long term, we would like to remove this feature and make all `string` fields 434`MANDATORY`. 435 436#### Alternatives 437 438* Drop the UTF-8 requirements completely. This seems like it will create more 439 problems than it will solve (e.g., random things relying on validation need 440 to be fixed) and it will be a lot of work. This is also counter to the 441 vision of string being a UTF-8 type, and bytes being its unchecked sibling. 442* Make opt-in verification a hard requirement instead of a hint, so that users 443 have a nice performance needle they can play with. 444 445#### Future Work 446 447In the infinite future, we would like to remove this feature and force all 448`string` fields to be UTF-8 validated. To do this, we need to recognize that 449what many callers want from their `string` fields is a `bytes` field with a 450`string`-like API. To ease the transition, we would add per-codegen backend 451features, like `java.bytes_as_string`, that give a `bytes` field a generated API 452resembling that of a `string` field (with caveats about replacement characters 453forced by the host language's string type). 454 455The migration would take `HINT` or `SKIP` `string` fields and convert them into 456`bytes` fields with the appropriate API modifiers, depending on which languages 457use that proto; C++-only protos, for example, are a no-op. 458 459There is an argument to be made for "I want a string type, and I explicitly want 460replacement U+FFFD characters if I get something that isn't UTF-8." It is 461unclear if this is something users want and we would need to investigate it 462before making a decision. 463 464### features.json_format 465 466This feature is dual state in edition zero: 467 468* `ALLOW` - this means that a runtime must allow JSON parsing and 469 serialization. Checks will be applied at the proto level to make sure that 470 there is a well-defined mapping to JSON. 471* `LEGACY_BEST_EFFORT` - this means that a runtime will do the best it can to 472 parse and serialize JSON. Certain protos will be allowed that can result in 473 undefined behavior at runtime (e.g. many:1 or 1:many mappings). 474 475The default will be `ALLOW`, which maps the to the current proto3 behavior. 476`LEGACY_BEST_EFFORT` will be used for proto2 files that require it (e.g. they’ve 477set `deprecated_legacy_json_field_conflicts`) 478 479#### Alternatives 480 481* Keep the proto2 behavior - this will regress proto3 files by removing 482 validation for JSON mappings, and lead to *more* undefined runtime behavior 483* Only use `ALLOW` - there are ~30 cases internally where protos have invalid 484 JSON mappings and rely on unspecified (but luckily well defined) runtime 485 behavior. 486 487#### Future Work 488 489Long term, we would like to either remove this feature entirely or add a 490`DISALLOW` option instead of `LEGACY_BEST_EFFORT`. This will more strictly 491enforce that protos without a valid JSON mapping *can’t* be serialized or parsed 492to JSON. `DISALLOW` will be enforced at the proto-language level, where no 493message marked `ALLOW` can contain any message/enum marked `DISALLOW` (e.g. 494through extensions or fields) 495 496#### Migration Examples 497 498### Extensions are Always Allowed 499 500Extensions may be used on all messages. This lifts a restriction from proto3. 501 502Extensions do not play nicely with `TypeResolver`. This is actually fixable, but 503probably only worth it if someone complains. 504 505#### Alternatives 506 507* Add `features.allow_extensions`, default true. This feels unnecessary since 508 uttering `extend` and `extensions` is required to use extensions in the 509 first place. 510 511### features.message_encoding 512 513This feature defaults to `LENGTH_PREFIXED`. The `group` syntax does not exist 514under editions. Instead, message-typed fields that have 515`features.message_encoding = DELIMITED` set will be encoded as groups (wire type 5163/4) rather than byte blobs (wire type 2). This reflects the existing API 517(groups are funny message fields) and simplifies the parser. 518 519A `proto2` group field will be converted into a nested message type of the same 520name, and a singular submessage field that is `features.message_encoding = 521DELIMITED` with the message type's name in snake_case. 522 523This could be used in the future to switch new message fields to use group 524encoding, which suggested previously as an efficiency direction. 525 526#### Alternatives 527 528* Allow groups in `editions` with no changes. `group` syntax is deprecated, so 529 we may as well take the opportunity to knock it out. 530* Add a sidecar allowlist like we do for `required`. This is mostly 531 orthogonal. 532 533#### Migration Examples 534 535Given the following file 536 537``` 538// foo.proto 539syntax = "proto2" 540 541message Foo { 542 group Bar = 1 { 543 optional int32 x = 1; 544 repeated int32 y = 2; 545 } 546} 547``` 548 549post-editions, it will look like this: 550 551``` 552// foo.proto 553edition = "tbd" 554 555message Foo { 556 message Bar { 557 optional int32 x = 1; 558 repeated int32 y = 2; 559 } 560 Bar bar = 1 [features.message_encoding = DELIMITED]; 561} 562``` 563 564## Proposed Features Message 565 566Putting together all of the above, we propose the following `Features` message, 567including retention and target rules associated with fields. 568 569``` 570message Features { 571 enum FieldPresence { 572 EXPLICIT = 0; 573 IMPLICIT = 1; 574 LEGACY_REQUIRED = 2; 575 } 576 optional FieldPresence field_presence = 1 [ 577 retention = RUNTIME, 578 target = FILE, 579 target = FIELD 580 ]; 581 582 enum EnumType { 583 OPEN = 0; 584 CLOSED = 1; 585 } 586 optional EnumType enum = 2 [ 587 retention = RUNTIME, 588 target = FILE, 589 target = ENUM 590 ]; 591 592 enum RepeatedFieldEncoding { 593 PACKED = 0; 594 UNPACKED = 1; 595 } 596 optional RepeatedFieldEncoding repeated_field_encoding = 3 [ 597 retention = RUNTIME, 598 target = FILE, 599 target = FIELD 600 ]; 601 602 enum StringFieldValidation { 603 MANDATORY = 0; 604 HINT = 1; 605 NONE = 2; 606 } 607 optional StringFieldValidation string_field_validation = 4 [ 608 retention = RUNTIME, 609 target = FILE, 610 target = FIELD 611 ]; 612 613 enum MessageEncoding { 614 LENGTH_PREFIXED = 0; 615 DELIMITED = 1; 616 } 617 optional MessageEncoding message_encoding = 5 [ 618 retention = RUNTIME, 619 target = FILE, 620 target = FIELD 621 ]; 622 623 extensions 1000; // for features_cpp.proto 624 extensions 1001; // for features_java.proto 625} 626``` 627