README.md
1# Chrome OS Update Process
2
3[TOC]
4
5System updates in more modern operating systems like Chrome OS and Android are
6called A/B updates, over-the-air ([OTA]) updates, seamless updates, or simply
7auto updates. In contrast to more primitive system updates (like Windows or
8macOS) where the system is booted into a special mode to override the system
9partitions with newer updates and may take several minutes or hours, A/B updates
10have several advantages including but not limited to:
11
12* Updates maintain a workable system that remains on the disk during and after
13 an update. Hence, reducing the likelihood of corrupting a device into a
14 non-usable state. And reducing the need for flashing devices manually or at
15 repair and warranty centers, etc.
16* Updates can happen while the system is running (normally with minimum
17 overhead) without interrupting the user. The only downside for users is a
18 required reboot (or, in Chrome OS, a sign out which automatically causes a
19 reboot if an update was performed where the reboot duration is about 10
20 seconds and is no different than a normal reboot).
21* The user does not need (although they can) to request for an update. The
22 update checks happen periodically in the background.
23* If the update fails to apply, the user is not affected. The user will
24 continue on the old version of the system and the system will attempt to
25 apply the update again at a later time.
26* If the update applies correctly but fails to boot, the system will rollback
27 to the old partition and the user can still use the system as usual.
28* The user does not need to reserve enough space for the update. The system
29 has already reserved enough space in terms of two copies (A and B) of a
30 partition. The system doesn’t even need any cache space on the disk,
31 everything happens seamlessly from network to memory to the inactive
32 partitions.
33
34## Life of an A/B Update
35
36In A/B update capable systems, each partition, such as the kernel or root (or
37other artifacts like [DLC]), has two copies. We call these two copies active (A)
38and inactive (B). The system is booted into the active partition (depending on
39which copy has the higher priority at boot time) and when a new update is
40available, it is written into the inactive partition. After a successful reboot,
41the previously inactive partition becomes active and the old active partition
42becomes inactive.
43
44### Generation
45
46But everything starts with generating OTA packages on (Google) servers for
47each new system image. This is done by calling
48[ota_from_target_files](https://cs.android.com/android/platform/superproject/+/master:build/make/tools/releasetools/ota_from_target_files.py)
49with source and destination builds. This script requires target_file.zip to work,
50image files are not sufficient.
51
52### Distribution/Configuration
53Once the OTA packages are generated, they are signed with specific keys
54and stored in a location known to an update server (GOTA).
55GOTA will then make this OTA package accessible via a public URL. Optionally,
56operators an choose to make this OTA update available only to a specific
57subset of devices.
58
59### Installation
60When the device's updater client initiates an update (either periodically or user
61initiated), it first consults different device policies to see if the update
62check is allowed. For example, device policies can prevent an update check
63during certain times of a day or they require the update check time to be
64scattered throughout the day randomly, etc.
65
66Once policies allow for the update check, the updater client sends a request to
67the update server (all this communication happens over HTTPS) and identifies its
68parameters like its Application ID, hardware ID, version, board, etc.
69
70Some policities on the server might prevent the device from getting specific
71OTA updates, these server side policities are often set by operators. For
72example, the operator might want to deliver a beta version of software to only
73a subset of devices.
74
75But if the update server decides to serve an update payload, it will respond
76with all the parameters needed to perform an update like the URLs to download the
77payloads, the metadata signatures, the payload size and hash, etc. The updater
78client continues communicating with the update server after different state
79changes, like reporting that it started to download the payload or it finished
80the update, or reports that the update failed with specific error codes, etc.
81
82The device will then proceed to actually installing the OTA update. This consists
83of roughly 3 steps.
84#### Download & Install
85Each payload consists of two main sections: metadata and extra data. The
86metadata is basically a list of operations that should be performed for an
87update. The extra data contains the data blobs needed by some or all of these
88operations. The updater client first downloads the metadata and
89cryptographically verifies it using the provided signatures from the update
90server’s response. Once the metadata is verified as valid, the rest of the
91payload can easily be verified cryptographically (mostly through SHA256 hashes).
92
93Next, the updater client marks the inactive partition as unbootable (because it
94needs to write the new updates into it). At this point the system cannot
95rollback to the inactive partition anymore.
96
97Then, the updater client performs the operations defined in the metadata (in the
98order they appear in the metadata) and the rest of the payload is gradually
99downloaded when these operations require their data. Once an operation is
100finished its data is discarded. This eliminates the need for caching the entire
101payload before applying it. During this process the updater client periodically
102checkpoints the last operation performed so in the event of failure or system
103shutdown, etc. it can continue from the point it missed without redoing all
104operations from the beginning.
105
106During the download, the updater client hashes the downloaded bytes and when the
107download finishes, it checks the payload signature (located at the end of the
108payload). If the signature cannot be verified, the update is rejected.
109
110#### Hash Verification & Verity Computation
111
112After the inactive partition is updated, the updater client will compute
113Forward-Error-Correction(also known as FEC, Verity) code for each partition,
114and wriee the computed verity data to inactive partitions. In some updates,
115verity data is included in the extra data, so this step will be skipped.
116
117Then, the entire partition is re-read, hashed and compared to a hash value
118passed in the metadata to make sure the update was successfully written into
119the partition. Hash computed in this step includes the verity code written in
120last step.
121
122#### Postintall
123
124In the next step, the [Postinstall] scripts (if any) is called. From OTA's perspective,
125these postinstall scripts are just blackboxes. Usually postinstall scripts will optimize
126existings apps on the phone and run file system garbage collection, so that device can boot
127fast after OTA. But these are managed by other teams.
128
129#### Finishing Touches
130
131Then the updater client goes into a state that identifies the update has
132completed and the user needs to reboot the system. At this point, until the user
133reboots (or signs out), the updater client will not do any more system updates
134even if newer updates are available. However, it does continue to perform
135periodic update checks so we can have statistics on the number of active devices
136in the field.
137
138After the update proved successful, the inactive partition is marked to have a
139higher priority (on a boot, a partition with higher priority is booted
140first). Once the user reboots the system, it will boot into the updated
141partition and it is marked as active. At this point, after the reboot, the
142[update_verifier](https://cs.android.com/android/platform/superproject/+/master:bootable/recovery/update_verifier/)
143program runs, read all dm-verity devices to make sure the partitions aren't corrupted,
144then mark the update as successful.
145
146A/B updates are considered completed at this point. Virtual A/B updates will have an
147additional step after this, called "merging". Merging usually takes few minutes, after that
148Virtual A/B updates are considered complete.
149
150## Update Engine Daemon
151
152The `update_engine` is a single-threaded daemon process that runs all the
153times. This process is the heart of the auto updates. It runs with lower
154priorities in the background and is one of the last processes to start after a
155system boot. Different clients (like GMS Core or other services) can send requests
156for update checks to the update engine. The details of how requests are passed
157to the update engine is system dependent, but in Chrome OS it is D-Bus. Look at
158the [D-Bus interface] for a list of all available methods. On Android it is binder.
159
160There are many resiliency features embedded in the update engine that makes auto
161updates robust including but not limited to:
162
163* If the update engine crashes, it will restart automatically.
164* During an active update it periodically checkpoints the state of the update
165 and if it fails to continue the update or crashes in the middle, it will
166 continue from the last checkpoint.
167* It retries failed network communication.
168* If it fails to apply a delta payload (due to bit changes on the active
169 partition) for a few times, it switches to full payload.
170
171The updater clients writes its active preferences in
172`/data/misc/update_engine/prefs`. These preferences help with tracking changes
173during the lifetime of the updater client and allows properly continuing the
174update process after failed attempts or crashes.
175
176
177
178### Interactive vs Non-Interactive vs. Forced Updates
179
180Non-interactive updates are updates that are scheduled periodically by the
181update engine and happen in the background. Interactive updates, on the other
182hand, happen when a user specifically requests an update check (e.g. by clicking
183on “Check For Update” button in Chrome OS’s About page). Depending on the update
184server's policies, interactive updates have higher priority than non-interactive
185updates (by carrying marker hints). They may decide to not provide an update if
186they have busy server load, etc. There are other internal differences between
187these two types of updates too. For example, interactive updates try to install
188the update faster.
189
190Forced updates are similar to interactive updates (initiated by some kind of
191user action), but they can also be configured to act as non-interactive. Since
192non-interactive updates happen periodically, a forced-non-interactive update
193causes a non-interactive update at the moment of the request, not at a later
194time. We can call a forced non-interactive update with:
195
196```bash
197update_engine_client --interactive=false --check_for_update
198```
199
200### Network
201
202The updater client has the capability to download the payloads using Ethernet,
203WiFi, or Cellular networks depending on which one the device is connected
204to. Downloading over Cellular networks will prompt permission from the user as
205it can consume a considerable amount of data.
206
207### Logs
208
209In Chrome OS the `update_engine` logs are located in `/var/log/update_engine`
210directory. Whenever `update_engine` starts, it starts a new log file with the
211current data-time format in the log file’s name
212(`update_engine.log-DATE-TIME`). Many log files can be seen in
213`/var/log/update_engine` after a few restarts of the update engine or after the
214system reboots. The latest active log is symlinked to
215`/var/log/update_engine.log`.
216
217In Android the `update_engine` logs are located in `/data/misc/update_engine_log`.
218
219## Update Payload Generation
220
221The update payload generation is the process of converting a set of
222partitions/files into a format that is both understandable by the updater client
223(especially if it's a much older version) and is securely verifiable. This
224process involves breaking the input partitions into smaller components and
225compressing them in order to help with network bandwidth when downloading the
226payloads.
227
228`delta_generator` is a tool with a wide range of options for generating
229different types of update payloads. Its code is located in
230`update_engine/payload_generator`. This directory contains all the source code
231related to mechanics of generating an update payload. None of the files in this
232directory should be included or used in any other library/executable other than
233the `delta_generator` which means this directory does not get compiled into the
234rest of the update engine tools.
235
236However, it is not recommended to use `delta_generator` directly, as it has way
237too many flags. Wrappers like [ota_from_target_files](https://cs.android.com/android/platform/superproject/+/master:build/make/tools/releasetools/ota_from_target_files.py)
238or [OTA Generator](https://github.com/google/ota-generator) should be used.
239
240### Update Payload File Specification
241
242Each update payload file has a specific structure defined in the table below:
243
244| Field | Size (bytes) | Type | Description |
245| ----------------------- | ------------ | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
246| Magic Number | 4 | char[4] | Magic string "CrAU" identifying this is an update payload. |
247| Major Version | 8 | uint64 | Payload major version number. |
248| Manifest Size | 8 | uint64 | Manifest size in bytes. |
249| Manifest Signature Size | 4 | uint32 | Manifest signature blob size in bytes (only in major version 2). |
250| Manifest | Varies | [DeltaArchiveManifest] | The list of operations to be performed. |
251| Manifest Signature | Varies | [Signatures] | The signature of the first five fields. There could be multiple signatures if the key has changed. |
252| Payload Data | Varies | List of raw or compressed data blobs | The list of binary blobs used by operations in the metadata. |
253| Payload Signature Size | Varies | uint64 | The size of the payload signature. |
254| Payload Signature | Varies | [Signatures] | The signature of the entire payload except the metadata signature. There could be multiple signatures if the key has changed. |
255
256### Delta vs. Full Update Payloads
257
258There are two types of payload: Full and Delta. A full payload is generated
259solely from the target image (the image we want to update to) and has all the
260data necessary to update the inactive partition. Hence, full payloads can be
261quite large in size. A delta payload, on the other hand, is a differential
262update generated by comparing the source image (the active partitions) and the
263target image and producing the diffs between these two images. It is basically a
264differential update similar to applications like `diff` or `bsdiff`. Hence,
265updating the system using the delta payloads requires the system to read parts
266of the active partition in order to update the inactive partition (or
267reconstruct the target partition). The delta payloads are significantly smaller
268than the full payloads. The structure of the payload is equal for both types.
269
270Payload generation is quite resource intensive and its tools are implemented
271with high parallelism.
272
273#### Generating Full Payloads
274
275A full payload is generated by breaking the partition into 2MiB (configurable)
276chunks and either compressing them using bzip2 or XZ algorithms or keeping it as
277raw data depending on which produces smaller data. Full payloads are much larger
278in comparison to delta payloads hence require longer download time if the
279network bandwidth is limited. On the other hand, full payloads are a bit faster
280to apply because the system doesn’t need to read data from the source partition.
281
282#### Generating Delta Payloads
283
284Delta payloads are generated by looking at both the source and target images
285data on a file and metadata basis (more precisely, the file system level on each
286appropriate partition). The reason we can generate delta payloads is that Chrome
287OS partitions are read only. So with high certainty we can assume the active
288partitions on the client’s device is bit-by-bit equal to the original partitions
289generated in the image generation/signing phase. The process for generating a
290delta payload is roughly as follows:
291
2921. Find all the zero-filled blocks on the target partition and produce `ZERO`
293 operation for them. `ZERO` operation basically discards the associated
294 blocks (depending on the implementation).
2952. Find all the blocks that have not changed between the source and target
296 partitions by directly comparing one-to-one source and target blocks and
297 produce `SOURCE_COPY` operation.
2983. List all the files (and their associated blocks) in the source and target
299 partitions and remove blocks (and files) which we have already generated
300 operations for in the last two steps. Assign the remaining metadata (inodes,
301 etc) of each partition as a file.
3024. If a file is new, generate a `REPLACE`, `REPLACE_XZ`, or `REPLACE_BZ`
303 operation for its data blocks depending on which one generates a smaller
304 data blob.
3055. For each other file, compare the source and target blocks and produce a
306 `SOURCE_BSDIFF` or `PUFFDIFF` operation depending on which one generates a
307 smaller data blob. These two operations produce binary diffs between a
308 source and target data blob. (Look at [bsdiff] and [puffin] for details of
309 such binary differential programs!)
3106. Sort the operations based on their target partitions’ block offset.
3117. Optionally merge same or similar operations next to each other into larger
312 operations for better efficiency and potentially smaller payloads.
313
314Full payloads can only contain `REPLACE`, `REPLACE_BZ`, and `REPLACE_XZ`
315operations. Delta payloads can contain any operations.
316
317### Major and Minor versions
318
319The major and minor versions specify the update payload file format and the
320capability of the updater client to accept certain types of update payloads
321respectively. These numbers are [hard coded] in the updater client.
322
323Major version is basically the update payload file version specified in the
324[update payload file specification] above (second field). Each updater client
325supports a range of major versions. Currently, there are only two major
326versions: 1, and 2. And both Chrome OS and Android are on major version 2 (major
327version 1 is being deprecated). Whenever there are new additions that cannot be
328fitted in the [Manifest protobuf], we need to uprev the major version. Upreving
329major version should be done with utmost care because older clients do not know
330how to handle the newer versions. Any major version uprev in Chrome OS should be
331associated with a GoldenEye stepping stone.
332
333Minor version defines the capability of the updater client to accept certain
334operations or perform certain actions. Each updater client supports a range of
335minor versions. For example, the updater client with minor version 4 (or less)
336does not know how to handle a `PUFFDIFF` operation. So when generating a delta
337payload for an image which has an updater client with minor version 4 (or less)
338we cannot produce PUFFDIFF operation for it. The payload generation process
339looks at the source image’s minor version to decide the type of operations it
340supports and only a payload that confirms to those restrictions. Similarly, if
341there is a bug in a client with a specific minor version, an uprev in the minor
342version helps with avoiding to generate payloads that cause that bug to
343manifest. However, upreving minor versions is quite expensive too in terms of
344maintainability and it can be error prone. So one should practice caution when
345making such a change.
346
347Minor versions are irrelevant in full payloads. Full payloads should always be
348able to be applied for very old clients. The reason is that the updater clients
349may not send their current version, so if we had different types of full
350payloads, we would not have known which version to serve to the client.
351
352### Signed vs Unsigned Payloads
353
354Update payloads can be signed (with private/public key pairs) for use in
355production or be kept unsigned for use in testing. Tools like `delta_generator`
356help with generating metadata and payload hashes or signing the payloads given
357private keys.
358
359## update_payload Scripts
360
361[update_payload] contains a set of python scripts used mostly to validate
362payload generation and application. We normally test the update payloads using
363an actual device (live tests). [`brillo_update_payload`] script can be used to
364generate and test applying of a payload on a host device machine. These tests
365can be viewed as dynamic tests without the need for an actual device. Other
366`update_payload` scripts (like [`check_update_payload`]) can be used to
367statically check that a payload is in the correct state and its application
368works correctly. These scripts actually apply the payload statically without
369running the code in payload_consumer.
370
371## Postinstall
372
373[Postinstall] is a process called after the updater client writes the new image
374artifacts to the inactive partitions. One of postinstall's main responsibilities
375is to recreate the dm-verity tree hash at the end of the root partition. Among
376other things, it installs new firmware updates or any board specific
377processes. Postinstall runs in separate chroot inside the newly installed
378partition. So it is quite separated from the rest of the active running
379system. Anything that needs to be done after an update and before the device is
380rebooted, should be implemented inside the postinstall.
381
382## Building Update Engine
383
384You can build `update_engine` the same as other platform applications:
385
386### Setup
387
388Run these commands at top of Android repository before building anything.
389You only need to do this once per shell.
390
391* `source build/envsetup.sh`
392* `lunch aosp_cf_x86_64_only_phone-userdebug` (Or replace aosp_cf_x86_64_only_phone-userdebug with your own target)
393
394
395### Building
396
397`m update_engine update_engine_client delta_generator`
398
399## Running Unit Tests
400
401[Running unit tests similar to other platforms]:
402
403* `atest update_engine_unittests` You will need a device connected to
404 your laptop and accessible via ADB to do this. Cuttlefish works as well.
405* `atest update_engine_host_unittests` Run a subset of tests on host, no device
406required.
407
408## Initiating a Configured Update
409
410There are different methods to initiate an update:
411
412* Click on the “Check For Update” button in setting’s About page. There is no
413 way to configure this way of update check.
414* Use the [`scripts/update_device.py`] program and pass a path to your OTA zip file.
415
416
417
418## Note to Developers and Maintainers
419
420When changing the update engine source code be extra careful about these things:
421
422### Do NOT Break Backward Compatibility
423
424At each release cycle we should be able to generate full and delta payloads that
425can correctly be applied to older devices that run older versions of the update
426engine client. So for example, removing or not passing arguments in the metadata
427proto file might break older clients. Or passing operations that are not
428understood in older clients will break them. Whenever changing anything in the
429payload generation process, ask yourself this question: Would it work on older
430clients? If not, do I need to control it with minor versions or any other means.
431
432Especially regarding enterprise rollback, a newer updater client should be able
433to accept an older update payload. Normally this happens using a full payload,
434but care should be taken in order to not break this compatibility.
435
436### Think About The Future
437
438When creating a change in the update engine, think about 5 years from now:
439
440* How can the change be implemented that five years from now older clients
441 don’t break?
442* How is it going to be maintained five years from now?
443* How can it make it easier for future changes without breaking older clients
444 or incurring heavy maintenance costs?
445
446### Prefer Not To Implement Your Feature In The Updater Client
447If a feature can be implemented from server side, Do NOT implement it in the
448client updater. Because the client updater can be fragile at points and small
449mistakes can have catastrophic consequences. For example, if a bug is introduced
450in the updater client that causes it to crash right before checking for update
451and we can't quite catch this bug early in the release process, then the
452production devices which have already moved to the new buggy system, may no
453longer receive automatic updates anymore. So, always think if the feature is
454being implemented can be done form the server side (with potentially minimal
455changes to the client updater)? Or can the feature be moved to another service
456with minimal interface to the updater client. Answering these questions will pay
457off greatly in the future.
458
459### Be Respectful Of Other Code Bases
460
461~~The current update engine code base is used in many projects like Android.~~~
462
463The Android and ChromeOS codebase have officially diverged.
464
465We sync the code base among these two projects frequently. Try to not break Android
466or other systems that share the update engine code. Whenever landing a change,
467always think about whether Android needs that change:
468
469* How will it affect Android?
470* Can the change be moved to an interface and stubs implementations be
471 implemented so as not to affect Android?
472* Can Chrome OS or Android specific code be guarded by macros?
473
474As a basic measure, if adding/removing/renaming code, make sure to change both
475`build.gn` and `Android.bp`. Do not bring Chrome OS specific code (for example
476other libraries that live in `system_api` or `dlcservice`) into the common code
477of update_engine. Try to separate these concerns using best software engineering
478practices.
479
480### Merging from Android (or other code bases)
481
482Chrome OS tracks the Android code as an [upstream branch]. To merge the Android
483code to Chrome OS (or vice versa) just do a `git merge` of that branch into
484Chrome OS, test it using whatever means and upload a merge commit.
485
486```bash
487repo start merge-aosp
488git merge --no-ff --strategy=recursive -X patience cros/upstream
489repo upload --cbr --no-verify .
490```
491
492[Postinstall]: #postinstall
493[update payload file specification]: #update-payload-file-specification
494[OTA]: https://source.android.com/devices/tech/ota
495[DLC]: https://chromium.googlesource.com/chromiumos/platform2/+/master/dlcservice
496[`chromeos-setgoodkernel`]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-setgoodkernel
497[D-Bus interface]: /dbus_bindings/org.chromium.UpdateEngineInterface.dbus-xml
498[this repository]: /
499[UpdateManager]: /update_manager/update_manager.cc
500[update_manager]: /update_manager/
501[P2P update related code]: https://chromium.googlesource.com/chromiumos/platform2/+/master/p2p/
502[`cros_generate_update_payloads`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/scripts/cros_generate_update_payload.py
503[`chromite/lib/paygen`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/lib/paygen/
504[DeltaArchiveManifest]: /update_metadata.proto#302
505[Signatures]: /update_metadata.proto#122
506[hard coded]: /update_engine.conf
507[Manifest protobuf]: /update_metadata.proto
508[update_payload]: /scripts/
509[Postinstall]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-postinst
510[`update_engine` protobufs]: https://chromium.googlesource.com/chromiumos/platform2/+/master/system_api/dbus/update_engine/
511[Running unit tests similar to other platforms]: https://chromium.googlesource.com/chromiumos/docs/+/master/testing/running_unit_tests.md
512[Nebraska]: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/master/nebraska/
513[upstream branch]: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/upstream
514[`cros flash`]: https://chromium.googlesource.com/chromiumos/docs/+/master/cros_flash.md
515[bsdiff]: https://android.googlesource.com/platform/external/bsdiff/+/master
516[puffin]: https://android.googlesource.com/platform/external/puffin/+/master
517[`update_engine_client`]: /update_engine_client.cc
518[`brillo_update_payload`]: /scripts/brillo_update_payload
519[`check_update_payload`]: /scripts/paycheck.py
520[Dev Server]: https://chromium.googlesource.com/chromiumos/chromite/+/master/docs/devserver.md
521