1# Stricter Schemas with Editions 2 3**Author:** [@mcy](https://github.com/mcy) 4 5**Approved:** 2022-11-28 6 7## Overview 8 9The Protobuf language is surprisingly lax in what it allows in some places, even 10though these corners of the syntax space are rarely exercised in real use, and 11which add complexity to backends and runtimes. 12 13This document describes several such corners in the language, and how we might 14use Editions to fix them (spoiler: we'll add a feature for each one and then 15ratchet the features). 16 17This is primarily a memo on a use-case for Editions, and not a design doc per 18se. 19 20## Potential Lints 21 22### Entity Names 23 24Protobuf does not enforce any constraints on names other than the "ASCII 25identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This 26results in problems for backends: 27 28* Backends need to be able to convert between PascalCase, camelCase, 29 snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky. 30* Extraneous underscores, such as underscores in names that want to be 31 PascalCase, trailing underscores, leading underscores, and repeated 32 underscores create problems for case conversion and can clash with private 33 names generated by backends. 34* Protobuf does not support non-ASCII identifiers, mostly out of inertia more 35 than anything else. Because some languages (Java most prominent among them) 36 do not support them, we can never support them, but we are not particularly 37 clear on this point. 38 39The Protobuf language should be as strict as possible in what patterns it 40accepts for identifiers, since these need to be transformed to many languages. 41Thus, we propose the following regexes for the three casings used in Protobuf: 42 43* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for: 44 * Messages. 45 * Enums. 46 * Services. 47 * Methods. 48* `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for: 49 * Fields (including extensions). 50 * Package components. 51* `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for: 52 * Enum values. 53 54These patterns are intended to reject extraneous underscores, and to make casing 55of ASCII letters consistent. We explicitly only support ASCII for maximal 56portability to target languages. Note that option names are not included, since 57those are defined as fields in a proto, and would be subject to this rule 58automatically. 59 60To migrate, we would introduce a bool feature `feature.relax_identifier_rules`, 61which can be applied to any entity. When set, it would cause the compiler to 62reject `.proto` files which contain identifiers that don't match the above 63constraints. It would default to true and would switch to false in a future 64edition. 65 66### Keywords as Identifiers 67 68Currently, the Protobuf language allows using keywords as identifiers. This 69makes the parser somewhat more complicated than it has to be for minimal 70benefit, and shadowing behavior is not well-specified. For example, what does 71the following compile as? 72 73``` 74message Foo { 75 message int32 {} 76 optional int32 foo = 1; 77} 78``` 79 80This is particularly fraught in places where either a keyword or a type name can 81follow. For example, `optional foo = 1;` is a proto3 non-optional with type 82`optional`, but the parser can't tell until it sees the `=` sign. 83 84To avoid this and eventually stop supporting this in the parser, we make the 85following set of keywords true reserved names that cannot be used as 86identifiers: 87 88``` 89bool bytes double edition enum extend extensions fixed32 90fixed64 float group import int32 int64 map max 91message oneof option optional package public repeated required 92reserved returns rpc service sfixed32 sfixed64 sint32 sint64 93stream string syntax to uint32 uint64 weak 94``` 95 96Additionally, we introduce the syntax `#optional` for escaping a keyword as an 97identifier. This may *only* be used on keywords, and not non-keyword 98identifiers. 99 100To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`, 101which can be applied to any entity. When set, it would cause the compiler to 102reject `.proto` files which contain identifiers that use the names of keywords. 103It would migrate true->false in a future edition. The `#optional` syntax would 104not need to be feature-gated. 105 106From time to time we may introduce new keywords. The best procedure for doing so 107is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then 108switch it to false in an edition, which would cause it to be treated as a 109keyword for the purposes of this check. There's nothing stopping us from 110starting to use it in the syntax without an edition if it would be relatively 111unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they 112really hate contextual keywords since it complicates the parser, so keywords 113start out as contextual and become properly reserved in the next Rust edition. 114 115### Nonempty Package 116 117Right now, an empty package is technically permitted. We should remove this 118functionality from the language completely and require every file to declare a 119package. 120 121We would introduce a feature like `feature.allow_missing_package`, start it out 122as true, and switch it to false. 123 124### Invalid Names in `reserved` 125 126Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field 127and thus should be rejected. Ideally we should remove this syntax altogether and 128only permit the use of identifiers in this position, such as `reserved foo, 129bar;`. 130 131We would introduce a feature like `feature.allow_strings_in_reserved`, start it 132out as true, and then switch it to false. 133 134### Almost All Names are Fully Qualified 135 136Right now, Protobuf defines a complicated name resolution scheme that involves 137matching subsets of names inspired by that of C++ (which is even more 138complicated than ours!). Instead, we should require that every name be either a 139single identifier OR fully-qualified. This is an attempt to move to Go-style 140name resolution, which is significantly simpler to implement and explain. 141 142In particular, if a name is a single identifier, then: 143 144* It must be the name of a type defined at the top level of the current file. 145* If it is the name of a message or enum for a field's type, it may be the 146 name of a type defined in the current message. This does *not* apply to 147 extension fields. 148 149Because any multi-component path must be fully qualified, we no longer need the 150`.foo.Bar` syntax anymore, except to refer to messages defined in files without 151a package. We forbid `.`-prefixed names except in that case. 152 153We would introduce a feature like `features.use_cpp_style_name_resolution`, 154start it out as true, and then switch it to false. 155 156Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted 157at a message, rather than a package. In that case, we could go as far as saying 158that "names that start with a lower-case letter are fully-qualified, otherwise 159they are relative to the current package, and will only find things defined in 160the current file." 161 162Unlike Go, we do not allow finding things in other packages without being 163fully-qualified; this mostly comes from doing source-diving in very large 164packages, like the Go runtime, where it is very hard to find where something is 165defined. 166 167### Unique Enum Values 168 169Right now, we allow aliases in enums: 170 171``` 172enum Foo { 173 BAR = 5; 174 BAZ = 5; 175} 176``` 177 178This results in significant complexity in some parts of the backend, and weird 179behavior in textproto and JSON. We should disallow this. 180 181We would introduce a feature like `features.allow_enum_aliases`, which would 182switch from true to false. 183 184### Imports are Used 185 186We should adopt the Go rule that all non-public imports are used (i.e, every 187import provides at least one type referred to in the file). 188 189We would introduce a feature like `features.allow_unused_imports`, which would 190switch from true to false. 191 192### Next Field # is Reserved 193 194There's a few idioms for this checked by linters, such as `// Next ID: N`. We 195should codify this in the language by rewriting that every message begin with 196`reserved N to max;`, with the intent that `N` is the next never-used field 197number. Because it is required to be the first production in the message, it can 198be 199 200We could, additionally, require that *every* field number be either used or 201reserved, in addition to having a single `N to max;` reservation. Alternatively, 202we could require that every field number up to the largest one used be reserved; 203gaps between message numbers are usually a smell. 204 205This applies equally to message fields and enum values. 206 207We would introduce a feature like `features.allow_unused_numbers`, which we 208would switch from true to false. 209 210### Disallow Implicit String Concatenation 211 212Protobuf will implicitly concatenate two adjacent strings in any place it allows 213quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting 214problems around `reserved` in the past, if a comma is omitted: `reserved "foo" 215"bar";` is `reserved "foobar";`. 216 217We would introduce a feature like `features.concatenate_adjacent_strings`, which 218would switch from true to false. 219 220### Package Is First 221 222The `package` declaration can appear anywhere in the file after `syntax` or 223`edition`. We should take cues from Go and require it to be the first thing in 224the file, after the edition. 225 226We would introduce a feature like `features.package_anywhere`, which would 227switch from true to false. 228 229### Strict Boolean Options 230 231Boolean options can use true, false, True, False, T, or F as a value: `option 232my_bool = T;`. We should restrict to only `true` and `false`. 233 234We would introduce a feature like `features.loose_bool_options`, which would 235switch from true to false. 236 237### Decimal Field Numbers 238 239We permit non-decimal integer literals for field numbers, e.g. `optional int32 240x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should 241require decimal literals, since there is very little reason to allow other 242literals and makes the Protobuf language harder to parse. 243 244We would introduce a feature like `features.non_decimal_field_numbers`, which 245would switch from true to false.