1Infra Trooper Documentation 2=========================== 3 4### Contents ### 5 6* [What does an Infra trooper do?](#what_is_a_trooper) 7* [View current and upcoming troopers](#view_current_upcoming_troopers) 8* [How to swap trooper shifts](#how_to_swap) 9* [Tips for troopers](#tips) 10 11 12<a name="what_is_a_trooper"></a> 13What does an Infra trooper do? 14------------------------------ 15 16The trooper has two main jobs: 17 181) Keep an eye on Infra alerts emails (sent to infra-alerts@skia.org). The alerts are also available [here](https://alerts.skia.org/infra). 19 202) Resolve the above alerts as they come in. 21 22<a name="view_current_upcoming_troopers"></a> 23View current and upcoming troopers 24---------------------------------- 25 26The list of troopers is specified in the [skia-tree-status web app](http://skia-tree-status.appspot.com/trooper). The current trooper is highlighted in green. 27The banner on the top of the [status page](https://status.skia.org) also displays the current trooper. 28 29 30<a name="how_to_swap"></a> 31How to swap trooper shifts 32-------------------------- 33 34If you need to swap shifts with someone (because you are out sick or on vacation), please get approval from the person you want to swap with. Then send an email to skiabot@google.com and cc rmistry@. 35 36 37<a name="tips"></a> 38Tips for troopers 39----------------- 40 41- Make sure you are a member of 42 [MDB group chrome-skia-ninja](https://ganpati.corp.google.com/#Group_Info?name=chrome-skia-ninja@prod.google.com). 43 Valentine passwords and Chrome Golo access are based on membership in this 44 group. 45 46- These alerts generally auto-dismiss once the criteria for the alert is no 47 longer met: 48 - Monitoring alerts, including prober, collectd, and others 49 - Disconnected build slaves 50 51- These alerts generally do not auto-dismiss ([issue here](https://bug.skia.org/4292)): 52 - Build slaves that failed a step 53 - Disconnected devices (these are detected as the "wait for device" step failing) 54 55- "Failed to execute query" may show a different query than the failing one; 56 dismiss the alert to get a new alert showing the query that is actually 57 failing. (All "failed to execute query" alerts are lumped into a single alert, 58 which is why the failed query which initially triggered the alert may not be 59 failing any more but the alert is still active because another query is 60 failing.) 61 62- Where machines are located: 63 - Machine name like "skia-vm-NNN", "ct-vm-NNN" -> GCE 64 - Machine name ends with "a3", "a4", "m3" -> Chrome Golo 65 - Machine name ends with "m5" -> CT bare-metal bots in Chrome Golo 66 - Machine name starts with "skiabot-" -> Chapel Hill lab 67 - Machine name starts with "win8" -> Chapel Hill lab (Windows machine 68 names can't be very long, so the "skiabot-shuttle-" prefix is dropped.) 69 - slave11-c3 is a Chrome infra GCE machine (not to be confused with the Skia 70 Buildbots GCE, which we refer to as simply "GCE") 71 72- The [chrome-infra hangout](https://goto.google.com/cit-hangout) is useful for 73 questions regarding bots managed by the Chrome Infra team and to get 74 visibility into upstream failures that cause problems for us. 75 76- To log in to a Linux buildbot in GCE, use `gcloud compute ssh default@<machine 77 name>`. Choose the zone listed for the 78 [GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances) 79 (or specify it using the `--zone` command-line flag). 80 81- To log in to a Windows buildbot in GCE, use 82 [Chrome RDP Extension](https://chrome.google.com/webstore/detail/chrome-rdp/cbkkbcmdlboombapidmoeolnmdacpkch?hl=en-US) 83 with the 84 [IP address of the GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances) 85 shown on the [host info page](https://status.skia.org/hosts) for that bot. The 86 username is chrome-bot and the password can be found on 87 [Valentine](https://valentine.corp.google.com/) as "chrome-bot (Win GCE)". 88 89- If there is a problem with a bot in the Chrome Golo or Chrome infra GCE, the 90 best course of action is to 91 [file a bug](https://code.google.com/p/chromium/issues/entry?template=Build%20Infrastructure) 92 with the Chrome infra team. But if you know what you're doing: 93 - To access bots in the Chrome Golo, 94 [follow these instructions](https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md). 95 - Machine name ends with "a3" or "a4" -> ssh command looks like `ssh 96 build3-a3.chrome` 97 - Machine name ends with "m3" -> ssh command looks like `ssh build5-m3.golo` 98 - Machine name ends with "m5" -> ssh command looks like `ssh build1-m5.golo`. 99 [Example bug](https://bugs.chromium.org/p/chromium/issues/detail?id=638193) to file to Infra Labs. 100 - For MacOS and Windows bots, you will be prompted for a password, which is 101 stored on [Valentine](https://valentine.corp.google.com/) as "Chrome Golo, 102 Perf, GPU bots - chrome-bot". 103 - To access bots in the Chrome infra GCE -> command looks like `gcutil 104 --project=google.com:chromecompute ssh --ssh_user=default slave11-c3` (or 105 use the ccompute ssh script from the infra_internal repo). 106 107- Read over the [SkiaLab documentation](../testing/skialab) for more detail on 108 dealing with device alerts. 109 110- To stop a buildslave for a device, log in to the host for that device, `cd 111 ~/buildbot/<slave name>/build/slave; make stop`. To start it again, 112 `TESTING_SLAVENAME=<slave name> make start`. 113 114- Buildslaves can be slow to come up after reboot, but if the buildslave remains 115 disconnected, you may need to start it manually. On Mac and Linux, check using 116 `ps aux | grep python` that neither buildbot nor gclient are running, then run 117 `~/skiabot-slave-start-on-boot.sh`. 118 119- Sometimes iOS builds fail with 'The service is invalid'. Try rebooting the iOS host to fix this. 120