1# Editions: Group Migration Issues 2 3**Authors**: [@mkruskal-google](https://github.com/mkruskal-google) 4 5## Summary 6 7Address some unexpected issues in delimited encoding in edition 2023 before its 8OSS release. 9 10## Background 11 12Joshua Humphries reported some well-timed 13[issues](https://github.com/protocolbuffers/protobuf/issues/16239) discovered 14while experimenting with our early release of Edition 2023. He discovered that 15our new message encoding feature piggybacked a bit too much on the old group 16logic, and actually ended up being virtually useless in general. 17 18None of our testing or migrations caught this because they were heavily focused 19on *preserving* old behavior (which is the primary goal of edition 2023). 20Delimited messages structured exactly like proto2 groups (e.g. message and field 21in the same scope with matching names) continued to work exactly as before, 22making it seem like everything was fine. 23 24All of this is especially problematic in light of *Submessages: In Pursuit of a 25More Perfect Encoding* (not available externally yet), which intends to migrate the 26ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature 27as a migration tool to eliminate a deprecated syntax is one thing, but trying to 28push the ecosystem to it is especially bad. 29 30## Overview 31 32The problems here stem from the fact that before edition 2023, the field and 33type name of group fields was guaranteed to always be unique and intuitive. 34Proto2 splits groups into a synthetic nested message with a type name equivalent 35to the group specification (required to be capitalized), and a field name that's 36fully lowercased. For example, 37 38``` 39optional group MyGroup = 1 { ... } 40``` 41 42would become: 43 44``` 45message MyGroup { ... } 46optional MyGroup mygroup = 1; 47``` 48 49The casing here is very important, since the transformation is irreversible. We 50can't recover the group name from the field name in general, only if the group 51is a single word. 52 53The problem under edition 2023 is that we've removed the generation of 54synchronized synthetic messages from the language. Users now explicitly define 55messages, and any message field can be marked `DELIMITED`. This means that 56anyone assuming that the type and field name are synchronized could now be 57broken. 58 59### Codegen 60 61While using the field name for generated APIs required less special-casing in 62the generators, the field name ends up producing slightly-less-readable APIs for 63multi-word camelcased groups. The result is that we see a fairly random-seeming 64mix in different generators. Using protoc-explorer (not available externally), 65we find the following: 66 67<table> 68 <tr> 69 <td><strong>Language</strong> 70 </td> 71 <td><strong>Generated APIs</strong> 72 </td> 73 <td><strong>Example proto2 getter</strong> 74 </td> 75 </tr> 76 <tr> 77 <td>C++ 78 </td> 79 <td>field 80 </td> 81 <td><code>MyGroup mygroup()</code> 82 </td> 83 </tr> 84 <tr> 85 <td>Java (all) 86 </td> 87 <td>message 88 </td> 89 <td><code>MyGroup getMyGroup()</code> 90 </td> 91 </tr> 92 <tr> 93 <td>Python 94 </td> 95 <td>field 96 </td> 97 <td><code>mygroup</code> 98 </td> 99 </tr> 100 <tr> 101 <td>Go (all) 102 </td> 103 <td>field 104 </td> 105 <td><code>GetMygroup() *Foo_MyGroup</code> 106 </td> 107 </tr> 108 <tr> 109 <td>Dart V1 110 </td> 111 <td>field/message* 112 </td> 113 <td><code>get mygroup</code> 114 </td> 115 </tr> 116 <tr> 117 <td>upb ** 118 </td> 119 <td>field 120 </td> 121 <td><code>Foo_mygroup()</code> 122 </td> 123 </tr> 124 <tr> 125 <td>Objective-c 126 </td> 127 <td>message 128 </td> 129 <td><code>MyGroup* myGroup</code> 130 </td> 131 </tr> 132 <tr> 133 <td>Swift 134 </td> 135 <td>message 136 </td> 137 <td><code>MyGroup myGroup</code> 138 </td> 139 </tr> 140 <tr> 141 <td>C# 142 </td> 143 <td>field/message* 144 </td> 145 <td><code>MyGroup Mygroup</code> 146 </td> 147 </tr> 148</table> 149 150\* This codegen difference was [caught](cl/611144002) during the implementation 151and intentionally "fixed" in Edition 2023. \ 152\*\* This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.) \ 153† Extensions use field 154 155In the Dart V1 implementation, we decided to intentionally introduce a behavior 156change on editions upgrades. It was determined that this only affected a handful 157of protos in google3, and could probably be manually fixed as-needed. Java's 158handling changes the story significantly, since over 50% of protos in google3 159produce generated Java code. Objective-C is also noteworthy since we open-source 160it, and Swift because it's widely used in OSS and we don't own it. 161 162While the editions upgrade is still non-breaking, it means that the generated 163APIs could have very surprising spellings and may not be unique. For example, 164using the same type for two delimited fields in the same containing message will 165create two sets of generated APIs with the same name in some languages! 166 167### Text Format 168 169Our "official" 170[draft specification](https://protobuf.dev/reference/protobuf/textformat-spec/) 171of text-format explicitly states that group messages are encoded by the *message 172name*, rather than the lowercases field name. A group `MyGroup` will be 173serialized as: 174 175``` 176MyGroup { 177 ... 178} 179``` 180 181In C++, we always serialize the message name and have special handling to only 182accept the message name in parsing. We also have conformance tests locking down 183the positive path here (i.e. using the message name round-trip). The negative 184path (i.e. failing to accept the field name) doesn't have a conformance test, 185but C++/Java/Python all agree and there's no known case that doesn't. 186 187To make things even stranger, for *extensions* (group fields extending other 188messages), we always use the field name for groups. So as far as group 189extensions are concerned, there's no problem for editions. 190 191There are a few problems with non-extension group fields in editions: 192 193* Refactoring the message name will change any text-format output 194* New delimited fields will have unexpected text-format output, that *could* 195 conflict with other fields 196* Text parsers will expect the message name, which is surprising and could be 197 impossible to specify uniquely 198 199## Recommendation 200 201Clearly the end-state we want is for the field name to be used in all generated 202APIs, and for text-format serialization/parsing. The only questions are: how do 203we get there and can/should we do it in time for the 2023 release in 27.0 next 204month? 205 206We propose a combination of the alternatives listed below. 207[Smooth Extension](#smooth-extension) seems like the best short-term path 208forward to unblock the delimited migration. It *mostly* solves the problem and 209doesn't require any new features. The necessary changes for this approach have 210already been prepared, along with new conformance tests to lock down the 211behavior changes. 212 213[Global Feature](#global-feature) is a good long-term mitigation for tech debt 214we're leaving behind with *Smooth Extension*. Ultimately we would like to remove 215any labeling of fields by their type, and editions provides a good mechanism to 216do this. Alternatively, we could implement [aliases](#aliases) and use that to 217unify this old behavior and avoid a new feature. Either of these options will be 218the next step after the release of 2023, with aliases being preferred as long as 219the timing works out. 220 221If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available 222externally) is the quickest path forward to unblock the release of edition 2023. 223It has a lot of downsides though, and will block any migration towards delimited 224encoding until edition 2024 has started rolling out. 225 226## Alternatives 227 228### Smooth Extension {#smooth-extension} 229 230Instead of trying to change the existing behavior, we could expand the current 231spec to try to cover both proto2 and editions. We would define a "group-like" 232concept, which applies to all fields which: 233 234* Have `DELIMITED` encoding 235* Have a type corresponding to a nested message directly under its containing 236 message 237* Have a name corresponding to its lowercased type name. 238 239Note that proto2 groups will *always* be "group-like." 240 241For any group-like field we will use the old proto2 semantics, whatever they are 242today. Otherwise, we will treat them as regular fields for both codegen and 243text-format. This means that *most* new cases of delimited encoding will have 244the desired behavior, while *all* old groups will continue to function. The main 245exception here is that users will see the unexpected proto2 behavior if they 246have message/field names that *happen* to match. 247 248While the old behavior will result in some unexpected capitalization when it's 249hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow 250duplicate field names), we can guarantee that in both codegen and text encoding 251there will never be any conflicting symbols. There can never be two delimited 252fields of the same type using the old behavior, and no other messages or fields 253will exist with either spelling. 254 255Additionally, we will update the text parsers to accept **both** the old 256message-based spelling and the new field-based spelling for group-like fields. 257This will at least prevent parsing failures if users hit this unexpected change 258in behavior. 259 260#### Pros 261 262* Fully supports old proto2 behavior 263* Treats most new editions fields correctly 264* Doesn't allow for any of the problematic cases we see today 265* By updating the parsers to accept both, we have a migration path to change 266 the "wire"-format 267* Decoupled from editions launch (since it's a non-breaking change w/o a 268 feature) 269 270#### Cons 271 272* Requires coordinated changes in every editions-compatible runtime (and many 273 generators) 274* Keeps the old proto2 behavior around indefinitely, with no path to remove it 275* Plants surprising edge case for users if they happen to name their 276 message/fields a certain way 277 278### Global Feature {#global-feature} 279 280The simplest answer here is to introduce a new global message feature 281`legacy_group_handling` to control all the changes we'd like. This will only be 282applicable to group-like fields (see 283[Smooth Extension](?tab=t.0#heading=h.blnhard1tpyx)). With this feature enabled, 284these fields will always use their message name for text-format. Each 285non-conformant language could also use this feature to gate the codegen rules. 286 287#### Pros 288 289* Simple boolean to gate all the behavior changes 290* Doesn't require adding language features to a bunch of languages that don't 291 have them yet 292* Uses editions to ratchet down the bad behavior 293 294#### Cons 295 296* It's a little late in the game to be introducing new features to 2023 297 (go/edition-lifetimes) 298* Requires coordinated changes in every editions-compatible runtime (and many 299 generators) 300* The migration story for users is unclear. Overriding the value of this 301 feature is both a "wire"-breaking and API-breaking change they may not be 302 able to do easily. 303* With the feature set, users will still see all of the problems we have today 304 305### Feature Suite 306 307An extension of [Global feature](?tab=t.0#heading=h.mvtf74vplkdg) would be to 308split the codegen changes out into separate per-language features. 309 310#### Pros 311 312* Simple booleans to gate all the distinct behavior changes 313* Uses editions to ratchet down the bad behavior 314* Better migration story for users, since it separates API and "wire" breaking 315 changes 316 317#### Cons 318 319* Requires a whole slew of new language features, which typically have a 320 difficult first-time setup 321* Requires coordinated changes in every editions-compatible runtime (and many 322 generators) 323* Increases the complexity of edition 2023 significantly 324* With the features set, users will still see all of the problems we have 325 today 326 327### Nerf Delimited Encoding in 2023 328 329A quick fix to avoid releasing a bad feature would be to simply ban the case 330where the message and field names don't match. Adding this validation to protoc 331would cover the majority of cases, although we might want additional checks in 332every language that supports dynamic messages. 333 334This is a good fallback option if we can't implement anything better before 27.0 335is released. It allows us to release editions in a reasonable state, where we 336can fix these issues and release a more functional `DELIMITED` feature in 2024. 337 338#### Pros 339 340* Unblocks editions rollout 341* Easy and safe to implement 342* Avoids rushed implementation of a proper fix 343* Avoids runtime issues with text format 344* Avoids unexpected build breakages post-editions (e.g. renaming the nested 345 message) 346 347#### Cons 348 349* We'd still be releasing a really bad feature. Instead of opening up new 350 possibilities, it's just "like groups but worse" 351* We couldn't fix this in 2023 without potential version skew from third party 352 plugins. We'd likely have to wait until edition 2024 353* Might requires coordinated changes in a lot of runtimes 354* Doesn't unblock our effort to roll out delimited 355 356### Rename Fields in Editions 357 358While it might be tempting to leverage the edition 2023 upgrade as a place we 359can just rename the group field, that doesn't actually work (e.g. rename 360`mygroup` to `my_group`). Because so many runtimes already use the *field name* 361in generated APIs, they would break under this transformation. 362 363#### Pros 364 365* Works really well for text-format and some languages 366 367#### Cons 368 369* Turns 2023 upgrade into a breaking change for many languages 370 371### Aliases {#aliases} 372 373We've discussed aliases a lot mostly in the context of `Any`, but they would be 374useful for any encoding scheme that locks down field/message names. If we had a 375fully implemented alias system in place, it would be the perfect mitigation 376here. Unfortunately, we don't yet and the timeline here is probably too tight to 377implement one. 378 379#### Pros 380 381* Fixes all of the problems mentioned above 382* Allows us to specify the old behavior using the proto language, which allows 383 it to be handled by Prototiller 384 385#### Cons 386 387* We want this to be a real fully thought-out feature, not a hack rushed into 388 a tight timeline 389 390### Do Nothing 391 392Doing nothing doesn't actually break anyone, but it is embarrassing. 393 394#### Pros 395 396* Easy to do 397 398#### Cons 399 400* Releases a horrible feature full of foot-guns in our first edition 401* Doesn't unblock our effort to roll out delimited 402