1# Editions: Feature Extension Layout 2 3**Author:** [@mkruskal-google](https://github.com/mkruskal-google), 4[@zhangskz](https://github.com/zhangskz) 5 6**Approved:** 2023-08-23 7 8## Background 9 10"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan 11for allowing for more targeted features not owned by the protobuf team. It uses 12extensions of the global features proto to implement this. One thing that was 13left a bit ambiguous was *who* should own these extensions. Language, code 14generator, and runtime implementations are all similar but not identical 15distinctions. 16 17"Editions Zero Feature: utf8_validation" (not available externally, though a 18later version, 19"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)" 20is) is a recent plan to add a new set of generator features for utf8 validation. 21While the sole feature we had originally created (`legacy_closed_enum` in Java 22and C++) didn't have any ambiguity here, this one did. Specifically in Python, 23the current behaviors across proto2/proto3 are distinct for all 3 24implementations: pure python, Python/C++, Python/upb. 25 26## Overview 27 28In meetings, we've discussed various alternatives, captured below. The original 29plan was to make feature extensions runtime implementation-specific (e.g. C++, 30Java, Python, upb). There are some notable complications that came up though: 31 321. **Polyglot** - it's not clear how upb or C++ runtimes should behave in 33 multi-language situations. Which feature sets do they consider for runtime 34 behaviors? *Note: this is already a serious issue today, where all proto2 35 strings and many proto3 strings are completely unsafe across languages.* 36 372. **Shared Implementations** - Runtimes like upb and C++ are used as backing 38 implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP). 39 If we have a single set of `upb` or `cpp` features, migrating to those 40 shared implementations would be more difficult (since there's no independent 41 switches per-language). *Note: this is already the situation we're in today, 42 where switching the runtime implementation can cause subtle and dangerous 43 behavior changes.* 44 45Given that we only have two behaviors, and one of them is unambiguous, it seems 46reasonable to punt on this decision until we have more information. We may 47encounter more edge cases that require feature extensions (and give us more 48information) during the rollout of edition zero. We also have a lot of freedom 49to re-model features in later editions, so keeping the initial implementation as 50simple as possible seems best (i.e. Alternative 2). 51 52## Alternatives 53 54### Alternative 1: Runtime Implementation Features 55 56Features would be per-runtime implementation as originally described in 57"Editions Zero Feature: utf8_validation." For example, Protobuf Python users 58would set different features depending on the backing implementation (e.g. 59`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`). 60 61#### Pros 62 63* Most consistent with range of behaviors expressible pre-Editions 64 65#### Cons 66 67* Implementation may / should not be obvious to users. 68* Lack of levers specifically for language / implementation combos. For 69 example, there is no way to set Python-C++ behavior independently of C++ 70 behavior which may make migration harder from other Python implementations. 71 72### Alternative 2: Generator Features 73 74Features would be per-generator only (i.e. each protoc plugin would own one set 75of features). This was the second decision we made in later discussions, and 76while very similar to the above alternative, it's more inline with our goal of 77making features primarily for codegen. 78 79For example, all Python implementations would share the same set of features 80(e.g. `features.(pb.python).<feature>`). However, certain features could be 81targeted to specific implementations (e.g. 82`features.(pb.python).upb_utf8_validation` would only be used by Python/upb). 83 84#### Pros 85 86* Allows independent controls of shared implementations in different target 87 languages (e.g. Python's upb feature won't affect PHP). 88 89#### Cons 90 91* Possible complexity in upb to understand which language's features to 92 respect. UPB is not currently aware of what language it is being used for. 93* Limits in-process sharing across languages with shared implementations (e.g. 94 Python upb, PHP upb) in the case of conflicting behaviors. 95 * Additional checks may be needed. 96 97### Alternative 3: Migrate to bytes 98 99Since this whole discussion revolves around the utf8 validation feature, one 100option would be to just remove it from edition zero. Instead of adding a new 101toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce 102utf8 today to `bytes`. This would likely need another new *codegen* feature for 103generating byte getters/setters as strings, but that wouldn't have any of the 104ambiguity we're seeing today. 105 106Unfortunately, this doesn't seem feasible because of all the different behaviors 107laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't 108really a binary on/off decision, and it can vary widely between languages. There 109are many cases where UTF8 is validated in **some** languages but not others, and 110there's also the C++ "hint" behavior that logs errors but allows invalid UTF8. 111 112**Note:** This could still be partially done in a follow-up LSC by targeting 113specific combinations of the new feature that disable validation in all relevant 114languages. 115 116#### Pros 117 118* Punts on the issue, we wouldn't need any upb features and C++ features would 119 all be code-gen only 120* Simplifies the situation, avoids adding a very complicated feature in 121 edition zero 122 123#### Cons 124 125* Not really possible given the current complexity 126* There are O(10M) proto2 string fields that would be blindly changed to bytes 127 128### Alternative 4: Nested Features 129 130Another option is to allow for shared feature set messages. For example, upb 131would define a feature message, but *not* make it an extension of the global 132`FeatureSet`. Instead, languages with upb implementations would have a field of 133this type to allow for finer-grained controls. C++ would both extend the global 134`FeatureSet` and also be allowed as a field in other languages. 135 136For example, python utf8 validation could be specified as: 137 138We could have checks during feature validation that enforce that impossible 139combinations aren't specified. For example, with our current implementation 140`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`, 141since we don't have any mechanism for distinguishing them. 142 143#### Pros 144 145* Much more explicit than options 1 and 2 146 147#### Cons 148 149* Maybe too explicit? Proto owners would be forced to duplicate a lot of 150 features 151