• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Editions: Group Migration Issues
2
3**Authors**: [@mkruskal-google](https://github.com/mkruskal-google)
4
5## Summary
6
7Address some unexpected issues in delimited encoding in edition 2023 before its
8OSS release.
9
10## Background
11
12Joshua Humphries reported some well-timed
13[issues](https://github.com/protocolbuffers/protobuf/issues/16239) discovered
14while experimenting with our early release of Edition 2023. He discovered that
15our new message encoding feature piggybacked a bit too much on the old group
16logic, and actually ended up being virtually useless in general.
17
18None of our testing or migrations caught this because they were heavily focused
19on *preserving* old behavior (which is the primary goal of edition 2023).
20Delimited messages structured exactly like proto2 groups (e.g. message and field
21in the same scope with matching names) continued to work exactly as before,
22making it seem like everything was fine.
23
24All of this is especially problematic in light of *Submessages: In Pursuit of a
25More Perfect Encoding* (not available externally yet), which intends to migrate the
26ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature
27as a migration tool to eliminate a deprecated syntax is one thing, but trying to
28push the ecosystem to it is especially bad.
29
30## Overview
31
32The problems here stem from the fact that before edition 2023, the field and
33type name of group fields was guaranteed to always be unique and intuitive.
34Proto2 splits groups into a synthetic nested message with a type name equivalent
35to the group specification (required to be capitalized), and a field name that's
36fully lowercased. For example,
37
38```
39optional group MyGroup = 1 { ... }
40```
41
42would become:
43
44```
45message MyGroup { ... }
46optional MyGroup mygroup = 1;
47```
48
49The casing here is very important, since the transformation is irreversible. We
50can't recover the group name from the field name in general, only if the group
51is a single word.
52
53The problem under edition 2023 is that we've removed the generation of
54synchronized synthetic messages from the language. Users now explicitly define
55messages, and any message field can be marked `DELIMITED`. This means that
56anyone assuming that the type and field name are synchronized could now be
57broken.
58
59### Codegen
60
61While using the field name for generated APIs required less special-casing in
62the generators, the field name ends up producing slightly-less-readable APIs for
63multi-word camelcased groups. The result is that we see a fairly random-seeming
64mix in different generators. Using protoc-explorer (not available externally),
65we find the following:
66
67<table>
68  <tr>
69   <td><strong>Language</strong>
70   </td>
71   <td><strong>Generated APIs</strong>
72   </td>
73   <td><strong>Example proto2 getter</strong>
74   </td>
75  </tr>
76  <tr>
77   <td>C++
78   </td>
79   <td>field
80   </td>
81   <td><code>MyGroup mygroup()</code>
82   </td>
83  </tr>
84  <tr>
85   <td>Java (all)
86   </td>
87   <td>message
88   </td>
89   <td><code>MyGroup getMyGroup()</code>
90   </td>
91  </tr>
92  <tr>
93   <td>Python
94   </td>
95   <td>field
96   </td>
97   <td><code>mygroup</code>
98   </td>
99  </tr>
100  <tr>
101   <td>Go (all)
102   </td>
103   <td>field
104   </td>
105   <td><code>GetMygroup() *Foo_MyGroup</code>
106   </td>
107  </tr>
108  <tr>
109   <td>Dart V1
110   </td>
111   <td>field/message*
112   </td>
113   <td><code>get mygroup</code>
114   </td>
115  </tr>
116  <tr>
117   <td>upb **
118   </td>
119   <td>field
120   </td>
121   <td><code>Foo_mygroup()</code>
122   </td>
123  </tr>
124  <tr>
125   <td>Objective-c
126   </td>
127   <td>message
128   </td>
129   <td><code>MyGroup* myGroup</code>
130   </td>
131  </tr>
132  <tr>
133   <td>Swift
134   </td>
135   <td>message
136   </td>
137   <td><code>MyGroup myGroup</code>
138   </td>
139  </tr>
140  <tr>
141   <td>C#
142   </td>
143   <td>field/message*
144   </td>
145   <td><code>MyGroup Mygroup</code>
146   </td>
147  </tr>
148</table>
149
150\* This codegen difference was [caught](cl/611144002) during the implementation
151and intentionally "fixed" in Edition 2023. \
152\*\* This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.) \
153† Extensions use field
154
155In the Dart V1 implementation, we decided to intentionally introduce a behavior
156change on editions upgrades. It was determined that this only affected a handful
157of protos in google3, and could probably be manually fixed as-needed. Java's
158handling changes the story significantly, since over 50% of protos in google3
159produce generated Java code. Objective-C is also noteworthy since we open-source
160it, and Swift because it's widely used in OSS and we don't own it.
161
162While the editions upgrade is still non-breaking, it means that the generated
163APIs could have very surprising spellings and may not be unique. For example,
164using the same type for two delimited fields in the same containing message will
165create two sets of generated APIs with the same name in some languages!
166
167### Text Format
168
169Our "official"
170[draft specification](https://protobuf.dev/reference/protobuf/textformat-spec/)
171of text-format explicitly states that group messages are encoded by the *message
172name*, rather than the lowercases field name. A group `MyGroup` will be
173serialized as:
174
175```
176MyGroup {
177  ...
178}
179```
180
181In C++, we always serialize the message name and have special handling to only
182accept the message name in parsing. We also have conformance tests locking down
183the positive path here (i.e. using the message name round-trip). The negative
184path (i.e. failing to accept the field name) doesn't have a conformance test,
185but C++/Java/Python all agree and there's no known case that doesn't.
186
187To make things even stranger, for *extensions* (group fields extending other
188messages), we always use the field name for groups. So as far as group
189extensions are concerned, there's no problem for editions.
190
191There are a few problems with non-extension group fields in editions:
192
193*   Refactoring the message name will change any text-format output
194*   New delimited fields will have unexpected text-format output, that *could*
195    conflict with other fields
196*   Text parsers will expect the message name, which is surprising and could be
197    impossible to specify uniquely
198
199## Recommendation
200
201Clearly the end-state we want is for the field name to be used in all generated
202APIs, and for text-format serialization/parsing. The only questions are: how do
203we get there and can/should we do it in time for the 2023 release in 27.0 next
204month?
205
206We propose a combination of the alternatives listed below.
207[Smooth Extension](#smooth-extension) seems like the best short-term path
208forward to unblock the delimited migration. It *mostly* solves the problem and
209doesn't require any new features. The necessary changes for this approach have
210already been prepared, along with new conformance tests to lock down the
211behavior changes.
212
213[Global Feature](#global-feature) is a good long-term mitigation for tech debt
214we're leaving behind with *Smooth Extension*. Ultimately we would like to remove
215any labeling of fields by their type, and editions provides a good mechanism to
216do this. Alternatively, we could implement [aliases](#aliases) and use that to
217unify this old behavior and avoid a new feature. Either of these options will be
218the next step after the release of 2023, with aliases being preferred as long as
219the timing works out.
220
221If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available
222externally) is the quickest path forward to unblock the release of edition 2023.
223It has a lot of downsides though, and will block any migration towards delimited
224encoding until edition 2024 has started rolling out.
225
226## Alternatives
227
228### Smooth Extension {#smooth-extension}
229
230Instead of trying to change the existing behavior, we could expand the current
231spec to try to cover both proto2 and editions. We would define a "group-like"
232concept, which applies to all fields which:
233
234*   Have `DELIMITED` encoding
235*   Have a type corresponding to a nested message directly under its containing
236    message
237*   Have a name corresponding to its lowercased type name.
238
239Note that proto2 groups will *always* be "group-like."
240
241For any group-like field we will use the old proto2 semantics, whatever they are
242today. Otherwise, we will treat them as regular fields for both codegen and
243text-format. This means that *most* new cases of delimited encoding will have
244the desired behavior, while *all* old groups will continue to function. The main
245exception here is that users will see the unexpected proto2 behavior if they
246have message/field names that *happen* to match.
247
248While the old behavior will result in some unexpected capitalization when it's
249hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow
250duplicate field names), we can guarantee that in both codegen and text encoding
251there will never be any conflicting symbols. There can never be two delimited
252fields of the same type using the old behavior, and no other messages or fields
253will exist with either spelling.
254
255Additionally, we will update the text parsers to accept **both** the old
256message-based spelling and the new field-based spelling for group-like fields.
257This will at least prevent parsing failures if users hit this unexpected change
258in behavior.
259
260#### Pros
261
262*   Fully supports old proto2 behavior
263*   Treats most new editions fields correctly
264*   Doesn't allow for any of the problematic cases we see today
265*   By updating the parsers to accept both, we have a migration path to change
266    the "wire"-format
267*   Decoupled from editions launch (since it's a non-breaking change w/o a
268    feature)
269
270#### Cons
271
272*   Requires coordinated changes in every editions-compatible runtime (and many
273    generators)
274*   Keeps the old proto2 behavior around indefinitely, with no path to remove it
275*   Plants surprising edge case for users if they happen to name their
276    message/fields a certain way
277
278### Global Feature {#global-feature}
279
280The simplest answer here is to introduce a new global message feature
281`legacy_group_handling` to control all the changes we'd like. This will only be
282applicable to group-like fields (see
283[Smooth Extension](?tab=t.0#heading=h.blnhard1tpyx)). With this feature enabled,
284these fields will always use their message name for text-format. Each
285non-conformant language could also use this feature to gate the codegen rules.
286
287#### Pros
288
289*   Simple boolean to gate all the behavior changes
290*   Doesn't require adding language features to a bunch of languages that don't
291    have them yet
292*   Uses editions to ratchet down the bad behavior
293
294#### Cons
295
296*   It's a little late in the game to be introducing new features to 2023
297    (go/edition-lifetimes)
298*   Requires coordinated changes in every editions-compatible runtime (and many
299    generators)
300*   The migration story for users is unclear. Overriding the value of this
301    feature is both a "wire"-breaking and API-breaking change they may not be
302    able to do easily.
303*   With the feature set, users will still see all of the problems we have today
304
305### Feature Suite
306
307An extension of [Global feature](?tab=t.0#heading=h.mvtf74vplkdg) would be to
308split the codegen changes out into separate per-language features.
309
310#### Pros
311
312*   Simple booleans to gate all the distinct behavior changes
313*   Uses editions to ratchet down the bad behavior
314*   Better migration story for users, since it separates API and "wire" breaking
315    changes
316
317#### Cons
318
319*   Requires a whole slew of new language features, which typically have a
320    difficult first-time setup
321*   Requires coordinated changes in every editions-compatible runtime (and many
322    generators)
323*   Increases the complexity of edition 2023 significantly
324*   With the features set, users will still see all of the problems we have
325    today
326
327### Nerf Delimited Encoding in 2023
328
329A quick fix to avoid releasing a bad feature would be to simply ban the case
330where the message and field names don't match. Adding this validation to protoc
331would cover the majority of cases, although we might want additional checks in
332every language that supports dynamic messages.
333
334This is a good fallback option if we can't implement anything better before 27.0
335is released. It allows us to release editions in a reasonable state, where we
336can fix these issues and release a more functional `DELIMITED` feature in 2024.
337
338#### Pros
339
340*   Unblocks editions rollout
341*   Easy and safe to implement
342*   Avoids rushed implementation of a proper fix
343*   Avoids runtime issues with text format
344*   Avoids unexpected build breakages post-editions (e.g. renaming the nested
345    message)
346
347#### Cons
348
349*   We'd still be releasing a really bad feature. Instead of opening up new
350    possibilities, it's just "like groups but worse"
351*   We couldn't fix this in 2023 without potential version skew from third party
352    plugins. We'd likely have to wait until edition 2024
353*   Might requires coordinated changes in a lot of runtimes
354*   Doesn't unblock our effort to roll out delimited
355
356### Rename Fields in Editions
357
358While it might be tempting to leverage the edition 2023 upgrade as a place we
359can just rename the group field, that doesn't actually work (e.g. rename
360`mygroup` to `my_group`). Because so many runtimes already use the *field name*
361in generated APIs, they would break under this transformation.
362
363#### Pros
364
365*   Works really well for text-format and some languages
366
367#### Cons
368
369*   Turns 2023 upgrade into a breaking change for many languages
370
371### Aliases {#aliases}
372
373We've discussed aliases a lot mostly in the context of `Any`, but they would be
374useful for any encoding scheme that locks down field/message names. If we had a
375fully implemented alias system in place, it would be the perfect mitigation
376here. Unfortunately, we don't yet and the timeline here is probably too tight to
377implement one.
378
379#### Pros
380
381*   Fixes all of the problems mentioned above
382*   Allows us to specify the old behavior using the proto language, which allows
383    it to be handled by Prototiller
384
385#### Cons
386
387*   We want this to be a real fully thought-out feature, not a hack rushed into
388    a tight timeline
389
390### Do Nothing
391
392Doing nothing doesn't actually break anyone, but it is embarrassing.
393
394#### Pros
395
396*   Easy to do
397
398#### Cons
399
400*   Releases a horrible feature full of foot-guns in our first edition
401*   Doesn't unblock our effort to roll out delimited
402