• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Editions: Feature Extension Layout
2
3**Author:** [@mkruskal-google](https://github.com/mkruskal-google),
4[@zhangskz](https://github.com/zhangskz)
5
6**Approved:** 2023-08-23
7
8## Background
9
10"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan
11for allowing for more targeted features not owned by the protobuf team. It uses
12extensions of the global features proto to implement this. One thing that was
13left a bit ambiguous was *who* should own these extensions. Language, code
14generator, and runtime implementations are all similar but not identical
15distinctions.
16
17"Editions Zero Feature: utf8_validation" (not available externally, though a
18later version,
19"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)"
20is) is a recent plan to add a new set of generator features for utf8 validation.
21While the sole feature we had originally created (`legacy_closed_enum` in Java
22and C++) didn't have any ambiguity here, this one did. Specifically in Python,
23the current behaviors across proto2/proto3 are distinct for all 3
24implementations: pure python, Python/C++, Python/upb.
25
26## Overview
27
28In meetings, we've discussed various alternatives, captured below. The original
29plan was to make feature extensions runtime implementation-specific (e.g. C++,
30Java, Python, upb). There are some notable complications that came up though:
31
321.  **Polyglot** - it's not clear how upb or C++ runtimes should behave in
33    multi-language situations. Which feature sets do they consider for runtime
34    behaviors? *Note: this is already a serious issue today, where all proto2
35    strings and many proto3 strings are completely unsafe across languages.*
36
372.  **Shared Implementations** - Runtimes like upb and C++ are used as backing
38    implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP).
39    If we have a single set of `upb` or `cpp` features, migrating to those
40    shared implementations would be more difficult (since there's no independent
41    switches per-language). *Note: this is already the situation we're in today,
42    where switching the runtime implementation can cause subtle and dangerous
43    behavior changes.*
44
45Given that we only have two behaviors, and one of them is unambiguous, it seems
46reasonable to punt on this decision until we have more information. We may
47encounter more edge cases that require feature extensions (and give us more
48information) during the rollout of edition zero. We also have a lot of freedom
49to re-model features in later editions, so keeping the initial implementation as
50simple as possible seems best (i.e. Alternative 2).
51
52## Alternatives
53
54### Alternative 1: Runtime Implementation Features
55
56Features would be per-runtime implementation as originally described in
57"Editions Zero Feature: utf8_validation." For example, Protobuf Python users
58would set different features depending on the backing implementation (e.g.
59`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`).
60
61#### Pros
62
63*   Most consistent with range of behaviors expressible pre-Editions
64
65#### Cons
66
67*   Implementation may / should not be obvious to users.
68*   Lack of levers specifically for language / implementation combos. For
69    example, there is no way to set Python-C++ behavior independently of C++
70    behavior which may make migration harder from other Python implementations.
71
72### Alternative 2: Generator Features
73
74Features would be per-generator only (i.e. each protoc plugin would own one set
75of features). This was the second decision we made in later discussions, and
76while very similar to the above alternative, it's more inline with our goal of
77making features primarily for codegen.
78
79For example, all Python implementations would share the same set of features
80(e.g. `features.(pb.python).<feature>`). However, certain features could be
81targeted to specific implementations (e.g.
82`features.(pb.python).upb_utf8_validation` would only be used by Python/upb).
83
84#### Pros
85
86*   Allows independent controls of shared implementations in different target
87    languages (e.g. Python's upb feature won't affect PHP).
88
89#### Cons
90
91*   Possible complexity in upb to understand which language's features to
92    respect. UPB is not currently aware of what language it is being used for.
93*   Limits in-process sharing across languages with shared implementations (e.g.
94    Python upb, PHP upb) in the case of conflicting behaviors.
95    *   Additional checks may be needed.
96
97### Alternative 3: Migrate to bytes
98
99Since this whole discussion revolves around the utf8 validation feature, one
100option would be to just remove it from edition zero. Instead of adding a new
101toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce
102utf8 today to `bytes`. This would likely need another new *codegen* feature for
103generating byte getters/setters as strings, but that wouldn't have any of the
104ambiguity we're seeing today.
105
106Unfortunately, this doesn't seem feasible because of all the different behaviors
107laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't
108really a binary on/off decision, and it can vary widely between languages. There
109are many cases where UTF8 is validated in **some** languages but not others, and
110there's also the C++ "hint" behavior that logs errors but allows invalid UTF8.
111
112**Note:** This could still be partially done in a follow-up LSC by targeting
113specific combinations of the new feature that disable validation in all relevant
114languages.
115
116#### Pros
117
118*   Punts on the issue, we wouldn't need any upb features and C++ features would
119    all be code-gen only
120*   Simplifies the situation, avoids adding a very complicated feature in
121    edition zero
122
123#### Cons
124
125*   Not really possible given the current complexity
126*   There are O(10M) proto2 string fields that would be blindly changed to bytes
127
128### Alternative 4: Nested Features
129
130Another option is to allow for shared feature set messages. For example, upb
131would define a feature message, but *not* make it an extension of the global
132`FeatureSet`. Instead, languages with upb implementations would have a field of
133this type to allow for finer-grained controls. C++ would both extend the global
134`FeatureSet` and also be allowed as a field in other languages.
135
136For example, python utf8 validation could be specified as:
137
138We could have checks during feature validation that enforce that impossible
139combinations aren't specified. For example, with our current implementation
140`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`,
141since we don't have any mechanism for distinguishing them.
142
143#### Pros
144
145*   Much more explicit than options 1 and 2
146
147#### Cons
148
149*   Maybe too explicit? Proto owners would be forced to duplicate a lot of
150    features
151