• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Stricter Schemas with Editions
2
3**Author:** [@mcy](https://github.com/mcy)
4
5**Approved:** 2022-11-28
6
7## Overview
8
9The Protobuf language is surprisingly lax in what it allows in some places, even
10though these corners of the syntax space are rarely exercised in real use, and
11which add complexity to backends and runtimes.
12
13This document describes several such corners in the language, and how we might
14use Editions to fix them (spoiler: we'll add a feature for each one and then
15ratchet the features).
16
17This is primarily a memo on a use-case for Editions, and not a design doc per
18se.
19
20## Potential Lints
21
22### Entity Names
23
24Protobuf does not enforce any constraints on names other than the "ASCII
25identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
26results in problems for backends:
27
28*   Backends need to be able to convert between PascalCase, camelCase,
29    snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
30*   Extraneous underscores, such as underscores in names that want to be
31    PascalCase, trailing underscores, leading underscores, and repeated
32    underscores create problems for case conversion and can clash with private
33    names generated by backends.
34*   Protobuf does not support non-ASCII identifiers, mostly out of inertia more
35    than anything else. Because some languages (Java most prominent among them)
36    do not support them, we can never support them, but we are not particularly
37    clear on this point.
38
39The Protobuf language should be as strict as possible in what patterns it
40accepts for identifiers, since these need to be transformed to many languages.
41Thus, we propose the following regexes for the three casings used in Protobuf:
42
43*   `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
44    *   Messages.
45    *   Enums.
46    *   Services.
47    *   Methods.
48*   `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for:
49    *   Fields (including extensions).
50    *   Package components.
51*   `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for:
52    *   Enum values.
53
54These patterns are intended to reject extraneous underscores, and to make casing
55of ASCII letters consistent. We explicitly only support ASCII for maximal
56portability to target languages. Note that option names are not included, since
57those are defined as fields in a proto, and would be subject to this rule
58automatically.
59
60To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
61which can be applied to any entity. When set, it would cause the compiler to
62reject `.proto` files which contain identifiers that don't match the above
63constraints. It would default to true and would switch to false in a future
64edition.
65
66### Keywords as Identifiers
67
68Currently, the Protobuf language allows using keywords as identifiers. This
69makes the parser somewhat more complicated than it has to be for minimal
70benefit, and shadowing behavior is not well-specified. For example, what does
71the following compile as?
72
73```
74message Foo {
75  message int32 {}
76  optional int32 foo = 1;
77}
78```
79
80This is particularly fraught in places where either a keyword or a type name can
81follow. For example, `optional foo = 1;` is a proto3 non-optional with type
82`optional`, but the parser can't tell until it sees the `=` sign.
83
84To avoid this and eventually stop supporting this in the parser, we make the
85following set of keywords true reserved names that cannot be used as
86identifiers:
87
88```
89bool      bytes    double  edition   enum      extend    extensions  fixed32
90fixed64   float    group   import    int32     int64     map         max
91message   oneof    option  optional  package   public    repeated    required
92reserved  returns  rpc     service   sfixed32  sfixed64  sint32      sint64
93stream    string   syntax  to        uint32    uint64    weak
94```
95
96Additionally, we introduce the syntax `#optional` for escaping a keyword as an
97identifier. This may *only* be used on keywords, and not non-keyword
98identifiers.
99
100To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
101which can be applied to any entity. When set, it would cause the compiler to
102reject `.proto` files which contain identifiers that use the names of keywords.
103It would migrate true->false in a future edition. The `#optional` syntax would
104not need to be feature-gated.
105
106From time to time we may introduce new keywords. The best procedure for doing so
107is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
108switch it to false in an edition, which would cause it to be treated as a
109keyword for the purposes of this check. There's nothing stopping us from
110starting to use it in the syntax without an edition if it would be relatively
111unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
112really hate contextual keywords since it complicates the parser, so keywords
113start out as contextual and become properly reserved in the next Rust edition.
114
115### Nonempty Package
116
117Right now, an empty package is technically permitted. We should remove this
118functionality from the language completely and require every file to declare a
119package.
120
121We would introduce a feature like `feature.allow_missing_package`, start it out
122as true, and switch it to false.
123
124### Invalid Names in `reserved`
125
126Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
127and thus should be rejected. Ideally we should remove this syntax altogether and
128only permit the use of identifiers in this position, such as `reserved foo,
129bar;`.
130
131We would introduce a feature like `feature.allow_strings_in_reserved`, start it
132out as true, and then switch it to false.
133
134### Almost All Names are Fully Qualified
135
136Right now, Protobuf defines a complicated name resolution scheme that involves
137matching subsets of names inspired by that of C++ (which is even more
138complicated than ours!). Instead, we should require that every name be either a
139single identifier OR fully-qualified. This is an attempt to move to Go-style
140name resolution, which is significantly simpler to implement and explain.
141
142In particular, if a name is a single identifier, then:
143
144*   It must be the name of a type defined at the top level of the current file.
145*   If it is the name of a message or enum for a field's type, it may be the
146    name of a type defined in the current message. This does *not* apply to
147    extension fields.
148
149Because any multi-component path must be fully qualified, we no longer need the
150`.foo.Bar` syntax anymore, except to refer to messages defined in files without
151a package. We forbid `.`-prefixed names except in that case.
152
153We would introduce a feature like `features.use_cpp_style_name_resolution`,
154start it out as true, and then switch it to false.
155
156Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
157at a message, rather than a package. In that case, we could go as far as saying
158that "names that start with a lower-case letter are fully-qualified, otherwise
159they are relative to the current package, and will only find things defined in
160the current file."
161
162Unlike Go, we do not allow finding things in other packages without being
163fully-qualified; this mostly comes from doing source-diving in very large
164packages, like the Go runtime, where it is very hard to find where something is
165defined.
166
167### Unique Enum Values
168
169Right now, we allow aliases in enums:
170
171```
172enum Foo {
173  BAR = 5;
174  BAZ = 5;
175}
176```
177
178This results in significant complexity in some parts of the backend, and weird
179behavior in textproto and JSON. We should disallow this.
180
181We would introduce a feature like `features.allow_enum_aliases`, which would
182switch from true to false.
183
184### Imports are Used
185
186We should adopt the Go rule that all non-public imports are used (i.e, every
187import provides at least one type referred to in the file).
188
189We would introduce a feature like `features.allow_unused_imports`, which would
190switch from true to false.
191
192### Next Field # is Reserved
193
194There's a few idioms for this checked by linters, such as `// Next ID: N`. We
195should codify this in the language by rewriting that every message begin with
196`reserved N to max;`, with the intent that `N` is the next never-used field
197number. Because it is required to be the first production in the message, it can
198be
199
200We could, additionally, require that *every* field number be either used or
201reserved, in addition to having a single `N to max;` reservation. Alternatively,
202we could require that every field number up to the largest one used be reserved;
203gaps between message numbers are usually a smell.
204
205This applies equally to message fields and enum values.
206
207We would introduce a feature like `features.allow_unused_numbers`, which we
208would switch from true to false.
209
210### Disallow Implicit String Concatenation
211
212Protobuf will implicitly concatenate two adjacent strings in any place it allows
213quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
214problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
215"bar";` is `reserved "foobar";`.
216
217We would introduce a feature like `features.concatenate_adjacent_strings`, which
218would switch from true to false.
219
220### Package Is First
221
222The `package` declaration can appear anywhere in the file after `syntax` or
223`edition`. We should take cues from Go and require it to be the first thing in
224the file, after the edition.
225
226We would introduce a feature like `features.package_anywhere`, which would
227switch from true to false.
228
229### Strict Boolean Options
230
231Boolean options can use true, false, True, False, T, or F as a value: `option
232my_bool = T;`. We should restrict to only `true` and `false`.
233
234We would introduce a feature like `features.loose_bool_options`, which would
235switch from true to false.
236
237### Decimal Field Numbers
238
239We permit non-decimal integer literals for field numbers, e.g. `optional int32
240x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
241require decimal literals, since there is very little reason to allow other
242literals and makes the Protobuf language harder to parse.
243
244We would introduce a feature like `features.non_decimal_field_numbers`, which
245would switch from true to false.