1# Java Lite For Editions 2 3**Author:** [@zhangskz](https://github.com/zhangskz) 4 5**Approved:** 2023-05-26 6 7## Background 8 9The "Lite" implementation for Java utilizes a custom format for embedding 10descriptors motivated by critical code-size and performance requirements for 11Android. 12 13The code generator for Java Lite encodes an descriptor-like info string which is 14stored into `RawMessageInfo`. This is decoded into `MessageSchema` which serves 15as the descriptor-like schema for Java lite for parsing and serialization. 16 17The current implementation makes significant use of an `is_proto3` bit in the 18encoding, which is problematic for editions. Note that any parser changes to the 19format would also need to maintain backwards compatibility, due to our 20guarantees for parsers to remain backwards compatible within a major version. 21 22## Overview 23 24Fortunately, we already have corresponding bits for most 25[Editions Zero Features](edition-zero-features.md) in the corresponding 26`MessageInfo` field entry encoding. 27 28We will move existing remaining syntax usages reading `is_proto3` to use these 29bits. Several other syntax usages need to be made to be editions compatible by 30merging implementations. 31 32As new editions features are added that must be represented in `MessageInfo`, we 33will eventually need to revamp `MessageInfo` encoding to support these changes. 34However, this should be avoidable for Editions Zero. 35 36## Recommendation 37 38### Encoding: Add Is Edition Bit 39 40`RawMessageInfo` should be augmented with an additional `is_edition` bit in 41flags' unused bits. 42 43\[0]: flags, flags & 0x1 = is proto2?, flags & 0x2 = is message?, flags & 44**0x4 = is edition?** 45 46The decoded `ProtoSyntax` should add a corresponding Editions option based on 47this bit. 48 49``` 50public enum ProtoSyntax 51 PROTO2; 52 PROTO3; 53 EDITIONS; 54``` 55 56For now, there is no need to explicitly encode the raw editions string or 57feature options. These resolved features will be encoded directly in their 58corresponding field entries. 59 60### Encoding: Editions Zero Features 61 62Field entries in `RawMessageInfo` already encode bits corresponding to most 63***resolved*** Editions Zero features in `GetExperimentalJavaFieldType`. This is 64decoded in `fieldTypeWithExtraBits` by reading the corresponding bits. 65 66<table> 67 <tr> 68 <td><strong>Edition Zero Feature</strong> 69 </td> 70 <td><strong>Existing Encoding </strong> 71 </td> 72 <td><strong>Changes</strong> 73 </td> 74 </tr> 75 <tr> 76 <td>features.field_presence 77 </td> 78 <td> <code>kHasHasBit (0x1000)</code> 79 </td> 80 <td>Keep as-is. 81 </td> 82 </tr> 83 <tr> 84 <td>java.legacy_closed_enum 85 </td> 86 <td><code>kMapWithProto2EnumValue (0x800)</code> 87 </td> 88 <td>Replace with <code>kLegacyEnumIsClosedBit</code> 89<p> 90This will now be set for all enum fields, instead of just enum map values. 91<p> 92We will still need to check syntax in the interim in case of gencode. 93 </td> 94 </tr> 95 <tr> 96 <td><em>features.enum_type</em> 97 </td> 98 <td><em><code>EnumLiteGenerator</code> writes <code>UNRECOGNIZED(-1)</code> value for open enums in gencode.</em> 99<p> 100<em>This is not encoded in MessageInfo since this is an enum feature.</em> 101 </td> 102 <td><em>This is not needed in Editions Zero since enum closedness in Java Lite's runtime is dictated per-field by java.legacy_closed_enum. (<a href="edition-zero-feature-enum-field-closedness.md">Edition Zero Feature: Enum Field Closedness</a>), but should be used when Java non-conformance is fixed.</em> 103<p> 104<em>Note, this is implicitly encoded in kLegacyEnumIsClosedBit if java.legacy_closed_enum is unset since the corresponding FieldDescriptor helper should fall back on the EnumDescriptor.</em> 105 </td> 106 </tr> 107 <tr> 108 <td>features.repeated_field_encoding 109 </td> 110 <td><code>GetExperimentalJavaFieldTypeForPacked</code> 111 </td> 112 <td>Keep as-is. 113 </td> 114 </tr> 115 <tr> 116 <td>features.string_field_validation 117 </td> 118 <td><code>kUtf8CheckBit (0x200)</code> 119 </td> 120 <td>Keep as-is. 121<p> 122HINT does not apply to Java and will have the same behavior as MANDATORY or NONE 123 </td> 124 </tr> 125 <tr> 126 <td>features.message_encoding 127 </td> 128 <td>Not present. 129 </td> 130 <td>Encode as type group. 131<p> 132See below. 133 </td> 134 </tr> 135</table> 136 137Several places already use these bits properly, but there are a few syntax 138usages in the decoding that should be replaced by checking the corresponding 139feature bit. 140 141There are several unused bits that we could use for future field-level features 142before breaking the encoding format, but we should not need these for editions 143zero. 144 145The results of the `is_proto3` and feature bits only seem to be used within 146protobuf, and don't seem to be publicly exposed. 147 148#### features.message_encoding 149 150In the compiler, message fields with `features.message_encoding = DELIMITED` 151should be treated as a group *before* encoding message info. 152 153This means that `GetExperimentalJavaFieldTypeForSingular`, should encode the 154field's type `GROUP` (17), instead of its actual type `MESSAGE` (9), e.g. 155 156``` 157int GetExperimentalJavaFieldTypeForSingular(const FieldDescriptor* field) { 158 int result = field->type(); 159 if (result == FieldDescriptor::TYPE_MESSAGE) { 160 if (field->isDelimited()) { 161 return 17; // GROUP 162 } 163 } 164} 165``` 166 167`ImmutableMessageFieldLiteGenerator::GenerateFieldInfo` calls this when 168generating the message field's field info. 169 170The nested message's `MessageInfo` encoding does not need to be changed as this 171is already identical for group and message. 172 173Since each message field will be handled separately, this means that the 174post-editions proto file below 175 176``` 177// foo.proto 178edition = "tbd" 179 180message Foo { 181 message Bar { 182 int32 x = 1; 183 repeated int32 y = 2; 184 } 185 Bar bar = 1 [features.message_encoding = DELIMITED]; 186 Bar baz = 2; // not DELIMITED 187 188} 189``` 190 191will be encoded and treated by `MessageSchema` like its pre-editions equivalent 192below. 193 194``` 195message Foo { 196 group Bar = 1 { 197 int32 x = 1; 198 repeated int32 y = 2; 199 } 200 Bar baz = 2; // not DELIMITED 201} 202``` 203 204We recommended this alternative to minimize changes to the encoding and how 205groups are treated. 206 207In a future breaking change, we could consider renaming `FieldType.GROUP` to 208`FieldType.MESSAGE_DELIMITED` while preserving the same number and encoding for 209clarity. For now, we will leave the naming for this enum as-is. 210 211##### Alternative: Add kIsMessageEncodingDelimitedBit 212 213Alternatively, we could encode `features.message_encoding = DELIMITED` as-is as 214type `MESSAGE`. The `MessageInfo` encoding would encode these as a normal 215message field, using an unused (0x1100) bit as `kIsMessageEncodingDelimitedBit`. 216 217This could be used to indicate that the message should be parsed/serialized from 218the wire-format as if it were a group. This would need to be passed along to 219`MessageSchema` which would then handle treating Messages with this bit set as 220groups e.g. in `case Message`. 221 222This is less ideal, since it would require handling this in multiple places. 223 224### Unify non-feature syntax usages 225 226There are several places that branch on syntax into separate proto2/proto3 227codepaths. These generally duplicate a lot of code and should be unified into a 228single syntax-agnostic code path branching on the relevant feature bits. 229 230This code tends to be pretty opaque, so we should document this with comments or 231add helpers (e.g. `isEnforceUtf8`) to indicate what feature bits are used as we 232make changes here. 233 234<table> 235 <tr> 236 <td><code>ManifestSchemaFactory.newSchema()</code> 237 </td> 238 <td>MessageInfo -> Schema 239 </td> 240 <td>Allow extensions for editions. 241 </td> 242 </tr> 243 <tr> 244 <td><code>MessageSchema.getSerializedSize()</code> 245 </td> 246 <td>Message -> Serialized Size 247 </td> 248 <td>Unify getSerializedSizeProto2/3 249 </td> 250 </tr> 251 <tr> 252 <td><code>MessageSchema.writeTo()</code> 253 </td> 254 <td>Serialize Message 255 </td> 256 <td>Unify writeFieldsInAscendingOrderProto2/3 257 </td> 258 </tr> 259 <tr> 260 <td><code>MessageSchema.mergeFrom()</code> 261 </td> 262 <td>Parse Message 263 </td> 264 <td>Unify parseProto2/3Message 265 </td> 266 </tr> 267 <tr> 268 <td><code>DescriptorMessageInfoFactory.convert()</code> 269 </td> 270 <td>Descriptor -> MessageInfo 271 </td> 272 <td>Unify convertProto2/3 273 </td> 274 </tr> 275</table> 276 277There is a lot of dead code in Java Lite so several syntax usages can also be 278deleted or merged where possible. 279 280## Alternatives 281 282### Alternative 1: Introduce New Backwards-compatible MessageInfo Encoding 283 284Add a new backwards-compatible `MessageInfo` encoding for editions. 285 286The `is_edition` bit could toggle the encoding format being used, where 287`is_edition == true` indicates the new encoding format but `is_edition == false` 288indicates the old encoding. 289 290This would allow us to encode additional information that the current encoding 291format does not currently have available bits to support, such as the editions 292string or additional features. 293 294For example, the current encoding format only has a fixed number of available 295field entry bits where we could encode new feature bits. We will need to 296introduce a new encoding format once we exceed these, or if we want to encode 297features at the message level. 298 299In a future major version bump when support for proto2/3 is officially dropped, 300we could drop support for the previous encoding format. 301 302The recommendation is to revisit alternative 1 along with alternative 2 303post-Editions zero as we need to support additional feature bits. 304 305#### Pros 306 307* Future-proof for future editions and features 308 309#### Cons 310 311* Blocks editions zero on more complex encoding changes that won't be used 312 yet. 313* Requires more invasive updates to all MessageInfo decodings 314 315### Alternative 2: Move to MiniDescriptor encoding 316 317We could switch Java Lite to use the MiniDescriptor encoding specification. 318 319Like Java Lite, this encoding seems to be optimized to be lightweight and with 320minimal descriptor information. 321 322MiniDescriptors do not encode proto2/proto3 syntax currently, which makes it 323mostly editions-compatible. MiniDescriptors encode FieldModifier/MessageModifier 324bits that correspond to some editions zero similarly to the Java Lite field 325feature bits, and can be augmented to support additional features. 326 327Supposedly, this encoding format *should* support an arbitrary number of 328modifier bits, but this needs to be double-checked to verify there isn't a 329similar hard limit to the number of features. 330 331It is unclear whether this is sufficiently optimized for Android's needs and how 332compatible this would be with Java Lite's Schemas. 333 334The recommendation is to revisit alternative 2 along with alternative 1 335post-Editions zero as we need to support additional feature bits. 336 337#### Pros 338 339* Unify implementations for lower long-term maintenance cost 340 341* MiniDescriptor encoding will eventually need to be updated for editions 342 anyways. 343 344#### Cons 345 346* Blocks editions zero on more complex encoding changes that aren't necessary. 347 348* Requires even more invasive updates to all MessageInfo decodings 349 350* Probably requires major version bumps to break compatibility 351 352* Unknown code size /schema compatibility constraints that would need to be 353 explored. 354 355* There are a few possible changes to MiniDescriptors on the table that we 356 should wait to settle before bringing on additional implementations. 357 358### Alternative 3: Do Nothing 359 360Doing nothing is always an alternative. Describe the pros and cons of it. 361 362#### Pros 363 364* No work 365 366#### Cons 367 368* Editions is blocked since Java Lite protos are stuck in the past 369