1# Perfetto CI design document 2 3This CI is used on-top of (not in replacement of) AOSP's TreeHugger. 4It gives early testing signals and coverage on other OSes and older Android 5devices not supported by TreeHugger. 6 7See the [Testing](/docs/contributing/testing.md) page for more details about the 8project testing strategy. 9 10## Architecture diagram 11 12![Architecture diagram](/docs/images/continuous-integration.png) 13 14There are four major components: 15 161. Frontend: AppEngine. 172. Controller: AppEngine BG service. 183. Workers: Compute Engine + Docker. 194. Database: Firebase realtime database. 20 21They are coupled via the Firebase DB. The DB is the source of truth for the 22whole CI. 23 24## Controller 25 26The Controller orchestrates the CI. It's the most trusted piece of the system. 27 28It is based on a background AppEngine service. Such service is only 29triggered by deferred tasks and periodic Cron jobs. 30 31The Controller is the only entity which performs authenticated access to Gerrit. 32It uses a non-privileged gmail account and has no meaningful voting power. 33 34The controller loop does mainly the following: 35 36- It periodically (every 5s) polls Gerrit for CLs updated in the last 24h. 37- It checks the list of CLs against the list of already known CLs in the DB. 38- For each new CL it enqueues `N` new jobs in the database, one for each 39 configuration defined in [config.py](/infra/ci/config.py) (e.g. `linux-debug`, 40 `android-release`, ...). 41- It monitors the state of jobs. When all jobs for a CL have been completed, 42 it posts a comment and adds the vote if the CL is marked as `Presubmit-Ready`. 43- It does some other less-relevant bookkeeping. 44- AppEngine is highly reliable and self-healing. If a task fails (e.g. because 45 of a Gerrit 500) it will be automatically re-tried with exponential backoff. 46 47## Frontend 48 49The frontend is an AppEngine service that hosts the CI website @ 50[ci.perfetto.dev](https://ci.perfetto.dev). 51Conversely to the Controller, it is exposed to the public via HTTP. 52 53- It's an almost fully static website based on HTML and Javascript. 54- The only backend-side code ([frontend.py](/infra/ci/frontend/frontend.py)) 55 is used to proxy XHR GET requests to Gerrit, due to the lack of Gerrit 56 CORS headers. 57- Such XHR requests are GET-only and anonymous. 58- The frontend python code also serves as a memcache layer for Gerrit requests 59 that return immutable data (e.g. revision logs) to reduce the likeliness of 60 hitting Gerrit errors / timeouts. 61 62## Worker GCE VM 63 64The actual testing job happens inside these Google Compute Engine VMs. 65The GCE instance is running a CrOS-based 66[Container-Optimized](https://cloud.google.com/container-optimized-os/docs/) OS. 67 68The whole system image is read-only. The VM itself is stateless. No state is 69persisted outside of the DB and Google Cloud Storage (only for UI artifacts). 70The SSD is used only as a scratch disk and is cleared on each reboot. 71 72VMs are dynamically spawned using the Google Cloud Autoscaler and use a 73Stackdriver Custom Metric pushed by the Controller as cost function. 74Such metric is the number of queued + running jobs. 75 76Each VM runs two types of Docker containers: _worker_ and the _sandbox_. 77They are in a 1:1 relationship, each worker controls at most one sandbox 78associated. Workers are always alive (they work in polling-mode), while 79sandboxes are started and stopped by the worker on-demand. 80 81On each GCE instance there are M (currently 10) worker containers running and 82hence up to M sandboxes. 83 84### Worker containers 85 86Worker containers are trusted entities. They can impersonate the GCE service 87account and have R/W access to the DB. They can also spawn sandbox containers. 88 89Their behavior depends only on code that is manually deployed and doesn't depend 90on the checkout under test. The reason why workers are Docker containers is NOT 91security but only reproducibility and maintenance. 92 93Each worker does the following: 94 95- Poll for an available job from the `/jobs_queued` sub-tree of the DB. 96- Move such job into `/jobs_running`. 97- Start the sandbox container, passing down the job config and the git revision 98 via env vars. 99- Stream the sandbox stdout to the `/logs` sub-tree of the DB. 100- Terminate the sandbox container prematurely in case of timeouts or job 101 cancellations requested by the Controller. 102- Upload UI artifacts to GCS. 103- Update the DB to reflect completion of jobs, removing the entry from 104 `/jobs_running` and updating the `/jobs/$jobId/status` fields. 105 106### Sandbox containers 107 108Sandbox containers are untrusted entities. They can access the internet 109(for git pull / install-build-deps) but they cannot impersonate the GCE service 110account, cannot write into the DB, cannot write into GCS buckets. 111Docker here is used both as an isolation boundary and for reproducibility / 112debugging. 113 114Each sandbox does the following: 115 116- Checkout the code at the revision specified in the job config. 117- Run one of the [test/ci/](/test/ci/) scripts which will build and run tests. 118- Return either a success (0) or fail (!= 0) exit code. 119 120A sandbox container is almost completely stateless with the only exception of 121the semi-ephemeral `/ci/cache` mount-point. This mount-point is tmpfs-based 122(hence cleared on reboot) but is shared across all sandboxes. It's used only to 123maintain the shared ccache. 124 125# Data model 126 127The whole CI is based on 128[Firebase Realtime DB](https://firebase.google.com/docs/database). 129It is a high-scale JSON object accessible via a simple REST API. 130Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local 131full-copy of the DB. 132 133```bash 134/ci 135 # For post-submit jobs. 136 /branches 137 /master-20190626000853 138 # ┃ ┗━ Committer-date of the HEAD of the branch. 139 # ┗━ Branch name 140 { 141 author: "primiano@google.com" 142 rev: "0552edf491886d2bb6265326a28fef0f73025b6b" 143 subject: "Cloud-based CI" 144 time_committed: "2019-07-06T02:35:14Z" 145 jobs: 146 { 147 20190708153242--branches-master-20190626000853--android-...: 0 148 20190708153242--branches-master-20190626000853--linux-...: 0 149 ... 150 } 151 } 152 /master-20190701235742 {...} 153 154 # For pre-submit jobs. 155 /cls 156 /1000515-65 157 { 158 change_id: "platform%2F...~I575be190" 159 time_queued: "2019-07-08T15:32:42Z" 160 time_ended: "2019-07-08T15:33:25Z" 161 revision_id: "18c2e4d0a96..." 162 wants_vote: true 163 voted: true 164 jobs: { 165 20190708153242--cls-1000515-65--android-clang: 0 166 ... 167 20190708153242--cls-1000515-65--ui-clang: 0 168 } 169 } 170 /1000515-66 {...} 171 ... 172 /1011130-3 {...} 173 174 /cls_pending 175 # Effectively this is an array of pending CLs that we might need to 176 # vote on at the end. Only the keys matter, the values have no 177 # semantic and are always 0. 178 /1000515-65: 0 179 180 /jobs 181 /20190708153242--cls-1000515-65--android-clang-arm-debug: 182 # ┃ ┃ ┗━ Job type. 183 # ┃ ┗━ Path of the CL or branch object. 184 # ┗━ Datetime when the job was created. 185 { 186 src: "cls/1000515-66" 187 status: "QUEUED" 188 "STARTED" 189 "COMPLETED" 190 "FAILED" 191 "TIMED_OUT" 192 "CANCELLED" 193 "INTERRUPTED" 194 time_ended: "2019-07-07T12:47:22Z" 195 time_queued: "2019-07-07T12:34:22Z" 196 time_started: "2019-07-07T12:34:25Z" 197 type: "android-clang-arm-debug" 198 worker: "zqz2-worker-2" 199 } 200 /20190707123422--cls-1000515-66--android-clang-arm-rel {..} 201 202 /jobs_queued 203 # Effectively this is an array. Only the keys matter, the values 204 # have no semantic and are always 0. 205 /20190708153242--cls-1000515-65--android-clang-arm-debug: 0 206 207 /jobs_running 208 # Effectively this is an array. Only the keys matter, the values 209 # have no semantic and are always 0. 210 /20190707123422--cls-1000515-66--android-clang-arm-rel 211 212 /logs 213 /20190707123422--cls-1000515-66--android-clang-arm-rel 214 /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts" 215 # ┃ ┗━ Monotonic counter to establish total order on log lines 216 # ┃ retrieved within the same read() batch. 217 # ┃ 218 # ┗━ Hex-encoded timestamp, relative since start of test. 219 /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk" 220 ... 221 222``` 223 224# Sequence Diagram 225 226This is what happens, in order, on a worker instance from boot to the test run. 227 228```bash 229make -C /infra/ci worker-start 230┗━ gcloud start ... 231 232[GCE] # From /infra/ci/worker/gce-startup-script.sh 233docker run worker-1 ... 234... 235docker run worker-N ... 236 237[worker-X] # From /infra/ci/worker/Dockerfile 238┗━ /infra/ci/worker/worker.py 239 ┗━ docker run sandbox-X ... 240 241[sandbox-X] # From /infra/ci/sandbox/Dockerfile 242┗━ /infra/ci/sandbox/init.sh 243 ┗━ /infra/ci/sandbox/testrunner.sh 244 ┣━ git fetch refs/changes/... 245 ┇ ... 246 ┇ # This env var is passed by the test definition 247 ┇ # specified in /infra/ci/config.py . 248 ┗━ $PERFETTO_TEST_SCRIPT 249 ┣━ # Which is one of these: 250 ┣━ /test/ci/android_tests.sh 251 ┣━ /test/ci/fuzzer_tests.sh 252 ┣━ /test/ci/linux_tests.sh 253 ┗━ /test/ci/ui_tests.sh 254 ┣━ ninja ... 255 ┗━ out/dist/{unit,integration,...}test 256``` 257 258### [gce-startup-script.sh](/infra/ci/worker/gce-startup-script.sh) 259 260- Is ran once per GVE vm, at (re)boot. 261- It prepares the tmpfs mountpoint for the shared ccache. 262- It wipes the SSD scratch disk for the build artifacts 263- It pulls the latest {worker, sandbox} container images from 264 the Google Cloud Container registry. 265- Sets up Docker and `iptables` (for the sandboxed network). 266- Starts `N` worker containers in Docker. 267 268### [worker.py](/infra/ci/worker/worker.py) 269 270- It polls the DB to retrieve a job. 271- When a job is retrieved starts a sandbox container. 272- It streams the container stdout/stderr to the DB. 273- It upload the build artifacts to GCS. 274 275### [testrunner.sh](/infra/ci/sandbox/testrunner.sh) 276 277- It is pinned in the container image. Does NOT depend on the particular 278 revision being tested. 279- Checks out the repo at the revision specified (by the Controller) in the 280 job config pulled from the DB. 281- Sets up ccache 282- Deals with caching of buildtools/. 283- Runs the test script specified in the job config from the checkout. 284 285### [{android,fuzzer,linux,ui}_tests.sh](/test/ci/linux_tests.sh) 286 287- Are NOT pinned in the container and are ran from the checked out revision. 288- Finally build and run the test. 289 290## Playbook 291 292### Frontend (JS/HTML/CSS) changes 293 294Test-locally: `make -C infra/ci/frontend test` 295 296Deploy with `make -C infra/ci/frontend deploy` 297 298### Controller changes 299 300Deploy with `make -C infra/ci/controller deploy` 301 302It is possible to try locally via the `make -C infra/ci/controller test` 303but this involves: 304 305- Manually stopping the production AppEngine instance via the Cloud Console 306 (stopping via the `gcloud` cli doesn't seem to work, b/136828660) 307- Downloading the testing service credentials `test-credentials.json` 308 (they are in the internal Team drive). 309 310### Worker/Sandbox changes 311 3121. Build and push the new docker containers with: 313 314 `make -C infra/ci build push` 315 3162. Restart the GCE instances, either manually or via 317 318 `make -C infra/ci restart-workers` 319 320 321## Security considerations 322 323- Both the Firebase DB and the gs://perfetto-artifacts GCS bucket are 324 world-readable and writable by the GAE and GCE service accounts. 325 326- The GAE service account also has the ability to log into Gerrit using a 327 dedicated gmail.com account. The GCE service account doesn't. 328 329- Overall, no account in this project has any interesting privilege: 330 - The Gerrit account used for commenting on CLs is just a random gmail account 331 and has no special voting power. 332 - The service accounts of GAE and GCE don't have any special capabilities 333 outside of the CI project itself. 334 335- This CI deals only with functional and performance testing and doesn't deal 336 with any sort of continuous deployment. 337 338- Presubmit jobs are only triggered if at least one of the following is true: 339 - The owner of the CL is a @google.com account. 340 - The user that applied the Presubmit-Ready label is a @google.com account. 341 342- Sandboxes are not too hard to escape (Docker is the only boundary) and can 343 pollute each other via the shared ccache. 344 345- As such neither pre-submit nor post-submit build artifacts are considered 346 trusted. They are only used for establishing functional correctness and 347 performance regression testing. 348 349- Binaries built by the CI are not ran on any other machines outside of the 350 CI project. They are deliberately not downloadable. 351 352- The only build artifacts that are retained (for up to 30 days) and uploaded to 353 the GCS bucket are the UI artifacts. This is for the only sake of getting 354 visual previews of the HTML changes. 355 356- UI artifacts are served from a different origin (the GCS per-bucket API) than 357 the production UI. 358