• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Java Lite For Editions
2
3**Author:** [@zhangskz](https://github.com/zhangskz)
4
5**Approved:** 2023-05-26
6
7## Background
8
9The "Lite" implementation for Java utilizes a custom format for embedding
10descriptors motivated by critical code-size and performance requirements for
11Android.
12
13The code generator for Java Lite encodes an descriptor-like info string which is
14stored into `RawMessageInfo`. This is decoded into `MessageSchema` which serves
15as the descriptor-like schema for Java lite for parsing and serialization.
16
17The current implementation makes significant use of an `is_proto3` bit in the
18encoding, which is problematic for editions. Note that any parser changes to the
19format would also need to maintain backwards compatibility, due to our
20guarantees for parsers to remain backwards compatible within a major version.
21
22## Overview
23
24Fortunately, we already have corresponding bits for most
25[Editions Zero Features](edition-zero-features.md) in the corresponding
26`MessageInfo` field entry encoding.
27
28We will move existing remaining syntax usages reading `is_proto3` to use these
29bits. Several other syntax usages need to be made to be editions compatible by
30merging implementations.
31
32As new editions features are added that must be represented in `MessageInfo`, we
33will eventually need to revamp `MessageInfo` encoding to support these changes.
34However, this should be avoidable for Editions Zero.
35
36## Recommendation
37
38### Encoding: Add Is Edition Bit
39
40`RawMessageInfo` should be augmented with an additional `is_edition` bit in
41flags' unused bits.
42
43\[0]: flags, flags & 0x1 = is proto2?, flags & 0x2 = is message?, flags &
44**0x4 = is edition?**
45
46The decoded `ProtoSyntax` should add a corresponding Editions option based on
47this bit.
48
49```
50public enum ProtoSyntax
51  PROTO2;
52  PROTO3;
53  EDITIONS;
54```
55
56For now, there is no need to explicitly encode the raw editions string or
57feature options. These resolved features will be encoded directly in their
58corresponding field entries.
59
60### Encoding: Editions Zero Features
61
62Field entries in `RawMessageInfo` already encode bits corresponding to most
63***resolved*** Editions Zero features in `GetExperimentalJavaFieldType`. This is
64decoded in `fieldTypeWithExtraBits` by reading the corresponding bits.
65
66<table>
67  <tr>
68   <td><strong>Edition Zero Feature</strong>
69   </td>
70   <td><strong>Existing Encoding </strong>
71   </td>
72   <td><strong>Changes</strong>
73   </td>
74  </tr>
75  <tr>
76   <td>features.field_presence
77   </td>
78   <td> <code>kHasHasBit (0x1000)</code>
79   </td>
80   <td>Keep as-is.
81   </td>
82  </tr>
83  <tr>
84   <td>java.legacy_closed_enum
85   </td>
86   <td><code>kMapWithProto2EnumValue (0x800)</code>
87   </td>
88   <td>Replace with <code>kLegacyEnumIsClosedBit</code>
89<p>
90This will now be set for all enum fields, instead of just enum map values.
91<p>
92We will still need to check syntax in the interim in case of gencode.
93   </td>
94  </tr>
95  <tr>
96   <td><em>features.enum_type</em>
97   </td>
98   <td><em><code>EnumLiteGenerator</code> writes <code>UNRECOGNIZED(-1)</code> value for open enums in gencode.</em>
99<p>
100<em>This is not encoded in MessageInfo since this is an enum feature.</em>
101   </td>
102   <td><em>This is not needed in Editions Zero since enum closedness in Java Lite's runtime is dictated per-field by java.legacy_closed_enum. (<a href="edition-zero-feature-enum-field-closedness.md">Edition Zero Feature: Enum Field Closedness</a>), but should be used when Java non-conformance is fixed.</em>
103<p>
104<em>Note, this is implicitly encoded in kLegacyEnumIsClosedBit if java.legacy_closed_enum is unset since the corresponding FieldDescriptor helper should fall back on the EnumDescriptor.</em>
105   </td>
106  </tr>
107  <tr>
108   <td>features.repeated_field_encoding
109   </td>
110   <td><code>GetExperimentalJavaFieldTypeForPacked</code>
111   </td>
112   <td>Keep as-is.
113   </td>
114  </tr>
115  <tr>
116   <td>features.string_field_validation
117   </td>
118   <td><code>kUtf8CheckBit (0x200)</code>
119   </td>
120   <td>Keep as-is.
121<p>
122HINT does not apply to Java and will have the same behavior as MANDATORY or NONE
123   </td>
124  </tr>
125  <tr>
126   <td>features.message_encoding
127   </td>
128   <td>Not present.
129   </td>
130   <td>Encode as type group.
131<p>
132See below.
133   </td>
134  </tr>
135</table>
136
137Several places already use these bits properly, but there are a few syntax
138usages in the decoding that should be replaced by checking the corresponding
139feature bit.
140
141There are several unused bits that we could use for future field-level features
142before breaking the encoding format, but we should not need these for editions
143zero.
144
145The results of the `is_proto3` and feature bits only seem to be used within
146protobuf, and don't seem to be publicly exposed.
147
148#### features.message_encoding
149
150In the compiler, message fields with `features.message_encoding = DELIMITED`
151should be treated as a group *before* encoding message info.
152
153This means that `GetExperimentalJavaFieldTypeForSingular`, should encode the
154field's type `GROUP` (17), instead of its actual type `MESSAGE` (9), e.g.
155
156```
157int GetExperimentalJavaFieldTypeForSingular(const FieldDescriptor* field) {
158  int result = field->type();
159  if (result == FieldDescriptor::TYPE_MESSAGE) {
160    if (field->isDelimited()) {
161      return 17; // GROUP
162    }
163  }
164}
165```
166
167`ImmutableMessageFieldLiteGenerator::GenerateFieldInfo` calls this when
168generating the message field's field info.
169
170The nested message's `MessageInfo` encoding does not need to be changed as this
171is already identical for group and message.
172
173Since each message field will be handled separately, this means that the
174post-editions proto file below
175
176```
177// foo.proto
178edition = "tbd"
179
180message Foo {
181  message Bar {
182    int32 x = 1;
183    repeated int32 y = 2;
184  }
185  Bar bar = 1 [features.message_encoding = DELIMITED];
186  Bar baz = 2; // not DELIMITED
187
188}
189```
190
191will be encoded and treated by `MessageSchema` like its pre-editions equivalent
192below.
193
194```
195message Foo {
196  group Bar = 1 {
197    int32 x = 1;
198    repeated int32 y = 2;
199  }
200  Bar baz = 2; // not DELIMITED
201}
202```
203
204We recommended this alternative to minimize changes to the encoding and how
205groups are treated.
206
207In a future breaking change, we could consider renaming `FieldType.GROUP` to
208`FieldType.MESSAGE_DELIMITED` while preserving the same number and encoding for
209clarity. For now, we will leave the naming for this enum as-is.
210
211##### Alternative: Add kIsMessageEncodingDelimitedBit
212
213Alternatively, we could encode `features.message_encoding = DELIMITED` as-is as
214type `MESSAGE`. The `MessageInfo` encoding would encode these as a normal
215message field, using an unused (0x1100) bit as `kIsMessageEncodingDelimitedBit`.
216
217This could be used to indicate that the message should be parsed/serialized from
218the wire-format as if it were a group. This would need to be passed along to
219`MessageSchema` which would then handle treating Messages with this bit set as
220groups e.g. in `case Message`.
221
222This is less ideal, since it would require handling this in multiple places.
223
224### Unify non-feature syntax usages
225
226There are several places that branch on syntax into separate proto2/proto3
227codepaths. These generally duplicate a lot of code and should be unified into a
228single syntax-agnostic code path branching on the relevant feature bits.
229
230This code tends to be pretty opaque, so we should document this with comments or
231add helpers (e.g. `isEnforceUtf8`) to indicate what feature bits are used as we
232make changes here.
233
234<table>
235  <tr>
236   <td><code>ManifestSchemaFactory.newSchema()</code>
237   </td>
238   <td>MessageInfo -> Schema
239   </td>
240   <td>Allow extensions for editions.
241   </td>
242  </tr>
243  <tr>
244   <td><code>MessageSchema.getSerializedSize()</code>
245   </td>
246   <td>Message -> Serialized Size
247   </td>
248   <td>Unify getSerializedSizeProto2/3
249   </td>
250  </tr>
251  <tr>
252   <td><code>MessageSchema.writeTo()</code>
253   </td>
254   <td>Serialize Message
255   </td>
256   <td>Unify writeFieldsInAscendingOrderProto2/3
257   </td>
258  </tr>
259  <tr>
260   <td><code>MessageSchema.mergeFrom()</code>
261   </td>
262   <td>Parse Message
263   </td>
264   <td>Unify parseProto2/3Message
265   </td>
266  </tr>
267  <tr>
268   <td><code>DescriptorMessageInfoFactory.convert()</code>
269   </td>
270   <td>Descriptor -> MessageInfo
271   </td>
272   <td>Unify convertProto2/3
273   </td>
274  </tr>
275</table>
276
277There is a lot of dead code in Java Lite so several syntax usages can also be
278deleted or merged where possible.
279
280## Alternatives
281
282### Alternative 1: Introduce New Backwards-compatible MessageInfo Encoding
283
284Add a new backwards-compatible `MessageInfo` encoding for editions.
285
286The `is_edition` bit could toggle the encoding format being used, where
287`is_edition == true` indicates the new encoding format but `is_edition == false`
288indicates the old encoding.
289
290This would allow us to encode additional information that the current encoding
291format does not currently have available bits to support, such as the editions
292string or additional features.
293
294For example, the current encoding format only has a fixed number of available
295field entry bits where we could encode new feature bits. We will need to
296introduce a new encoding format once we exceed these, or if we want to encode
297features at the message level.
298
299In a future major version bump when support for proto2/3 is officially dropped,
300we could drop support for the previous encoding format.
301
302The recommendation is to revisit alternative 1 along with alternative 2
303post-Editions zero as we need to support additional feature bits.
304
305#### Pros
306
307*   Future-proof for future editions and features
308
309#### Cons
310
311*   Blocks editions zero on more complex encoding changes that won't be used
312    yet.
313*   Requires more invasive updates to all MessageInfo decodings
314
315### Alternative 2: Move to MiniDescriptor encoding
316
317We could switch Java Lite to use the MiniDescriptor encoding specification.
318
319Like Java Lite, this encoding seems to be optimized to be lightweight and with
320minimal descriptor information.
321
322MiniDescriptors do not encode proto2/proto3 syntax currently, which makes it
323mostly editions-compatible. MiniDescriptors encode FieldModifier/MessageModifier
324bits that correspond to some editions zero similarly to the Java Lite field
325feature bits, and can be augmented to support additional features.
326
327Supposedly, this encoding format *should* support an arbitrary number of
328modifier bits, but this needs to be double-checked to verify there isn't a
329similar hard limit to the number of features.
330
331It is unclear whether this is sufficiently optimized for Android's needs and how
332compatible this would be with Java Lite's Schemas.
333
334The recommendation is to revisit alternative 2 along with alternative 1
335post-Editions zero as we need to support additional feature bits.
336
337#### Pros
338
339*   Unify implementations for lower long-term maintenance cost
340
341*   MiniDescriptor encoding will eventually need to be updated for editions
342    anyways.
343
344#### Cons
345
346*   Blocks editions zero on more complex encoding changes that aren't necessary.
347
348*   Requires even more invasive updates to all MessageInfo decodings
349
350*   Probably requires major version bumps to break compatibility
351
352*   Unknown code size /schema compatibility constraints that would need to be
353    explored.
354
355*   There are a few possible changes to MiniDescriptors on the table that we
356    should wait to settle before bringing on additional implementations.
357
358### Alternative 3: Do Nothing
359
360Doing nothing is always an alternative. Describe the pros and cons of it.
361
362#### Pros
363
364*   No work
365
366#### Cons
367
368*   Editions is blocked since Java Lite protos are stuck in the past
369