1# LoadLine Benchmark 2 3This folder contains configs for the LoadLine benchmark. The goal of the 4benchmark is to facilitate web performance optimization based on a realistic 5workload. The benchmark has two workload variants: 6 7* General-purpose workload representative of the web usage on mobile phones 8 ("phone"); 9 10* Android Tablet web performance workload ("tablet"). 11 12## tl;dr: Running the Benchmark 13 14Run "phone" workload: 15 16``` 17./cb.py loadline-phone --browser <browser> --cool-down-threshold moderate 18``` 19 20Run "tablet" workload: 21 22``` 23./cb.py loadline-tablet --browser <browser> --cool-down-threshold moderate 24``` 25 26The browser can be `android:chrome-canary`, `android:chrome-stable` etc. See 27crossbench docs for the full list of options. 28 29Cool down threshold is recommended because by default the benchmark runs 100 30repetitions, which creates a significant load on the device and can lead to 31overheating. This option will insert cooldown periods to ensure that the device 32stays below the given thermal level. Possible values include `light`, 33`moderate`, and `severe`. 34 35Results will be located in `results/latest/`. Notable files in this directory: 36 37* `loadline_probe.csv`: Final score for the run 38* `trace_processor/loadline_benchmark_score.csv`: Breakdown of scores per 39 page and repetition 40 41## Benchmark Details 42 43### Background 44 45Web is one of the most important use cases on mobile devices. Page loading speed 46represents a crucial part of user experience, and is not well covered by 47existing benchmarks (Speedometer, Jetstream, MotionMark). Experiments show that 48raw CPU performance does not always result in faster web loading speeds, since 49it's a complex highly parallelized process that stresses a lot of browser and OS 50components and their interactions. Hence the need for a dedicated web loading 51benchmark that will enable us to compare devices, track improvements across OS 52and browser releases. 53 54### Workload 55 56We aimed for two configurations: 57 58* **Representative mobile Web on Android usage (~5 pages)** 59 60 Aimed at covering loading scenarios representative of real web workloads and 61 user environments on Android mobile phones. 62 63* **Android Tablet web performance (~5 pages)** 64 65 A set of larger desktop-class workloads intended for tablet/large screen 66 devices running Android. 67 68The biggest challenges we faced in achieving this goal were: 69 70* **Representativeness**: How do we determine a representative set of web 71 sites given the humongous corpus of websites whose overall distribution is 72 not thoroughly understood. 73* **Metrics** Existing page load metrics generalize well for O(millions) of 74 page loads across a variety of sites, but are poor fit to judge performance 75 of a specific site 76* **Noise**: The web evolves. To ensure the benchmark workloads stay 77 consistent over time, we chose to use recorded & replayed workloads. 78 However, page load is very complex and indeterministic so naive replays are 79 often not consistent. 80 81### Site Selection 82 83We did a thorough analysis to ensure we select relevant and representative 84sites. Our aspiration was to understand the distribution of the most important 85CUJs and performance characteristics on the web and use this knowledge to elect 86a small number of representative CUJs, such that their performance 87characteristics maximize coverage of the distribution. 88 89Practically, we evaluated ~50 prominent sites across a number of different 90characteristics (dimensions) via trace-based analysis, cross-checking via field 91data. We clustered similar pages and selected representatives for important 92clusters. In the end, this was a manual selection aided by algorithmic 93clustering/correlation analysis. 94 95We looked at over 20 dimensions for suitability and relevance to our site 96selection, and low correlation between dimensions. Of these, we chose 6 primary 97metrics that we optimized coverage on: Website type, workload size (CPU time), 98DOM/Layout complexity (#nodes), JavaScript heap size, time spent in V8, time 99spent in V8 callbacks into Blink. Secondarily, we included utilization of web 100features and relevant mojo interfaces, e.g. Video, cookies, main/subframe 101communication, input events, frame production, network requests, etc. 102 103In the end we selected 5 sites for each configuration which we plan to extend in 104the future. 105 106#### Mobile 107 108| Page (mobile version) | CUJ | Performance characteristics | 109| -------------------------- | ------------------ | -------------------------- | 110| amazon.co.uk <br> (product page) | Shopping | * average page load, large workload, large DOM/JS (but heavier on DOM) <br> * high on OOPIFs, input, http(s) resources, frame production | 111| cnn.com <br> (article) | News | * slow page load, large workload, large DOM/JS (but heavier on JS) <br> * high on iframes, main frame, local storage, cookies, http(s) resources | 112| wikipedia.org <br> (article) | Reference work | * fast page load, small workload, large DOM, small JS <br> * high on input <br> * low on iframes, http(s) resources, frame production | 113| globo.com <br> (homepage) | News / web portal | * slow page load, large workload, small DOM, large JS <br> * high on iframes, OOPIFs, http(s) resources, frame production, cookies | 114| google.com <br> (results) | Search | * fast page load, average workload, average DOM + JS <br> * high on main frame, local storage, video | 115 116#### Tablet 117 118| Page (desktop version) | CUJ | Performance characteristics | 119| -------------------------- | ------------ | -------------------------------- | 120| amazon.co.uk <br> (product page) | Shopping | * average page load, large workload, large DOM, average JS <br> * high on OOPIFs, http(s) resources, frame production | 121| cnn.com <br> (article) | News | * slow page load, large workload, large DOM/JS (but heavier on JS) <br> * high on iframes, local storage, video, frame production, cookies | 122| docs.google.com <br> (document) | Productivity | * slow page load, large workload, large DOM + JS (heavier on JS) <br> * high on main frame <br> * high on font resources | 123| google.com <br> (results) | Search | * fast page load, low workload, low DOM + JS <br> * high on main frame, local storage <br> * low on video | 124| youtube.com<br> (video) | Media | * slow page load, very high workload, large DOM, small JS heap, average JS time <br> * high on video | 125 126### Metrics 127 128Measuring page load accurately in generic ways is difficult (e.g. some pages 129require significant work after LCP to become "interactive") and inaccurate 130metrics risk incorrect power/perf trade-off choices. Once we had a selection of 131sites, we looked at each one of them and devised site-specific metrics that 132better reflect when a page is ready to be interacted with. 133 134## Reproducibility / Noise 135 136Page load is a very complex process and is inherently noisy. There is a lot of 137concurrent work happening in the browser and slight timing differences can have 138big impact in the actual workload being executed and thus in the perceived load 139speed. 140 141We took various measures to try to reduce this variability, but there is still 142room to improve and we plan to do this in the next versions of this benchmark. 143 144### Score 145 146We are still actively developing this benchmark and we will try our best to keep 147the score as stable across changes as possible. We will update the benchmark's 148minor version if we introduce changes that have a chance of affecting the score. 149This version is reported in the benchmark output and should be quoted when 150sharing scores. 151 152### Cross-device Comparisons 153 154Workload on two different devices will differ due to variance in application 155tasks, such as the number of frames rendered during load, timers being executed 156more frequently during load, etc. 157 158It is important to stress that page load is a complex workload. As a result, if 159we were to compare scores between two devices A and B, device A having 2x the 160CPU speed compared to device B, then A's score will be less than 2x of B's 161score. This is not an error or an artifact of the measurement, this is a result 162of the adaptable nature of web loading (and / or potential effort of a browser 163trying to get the best user experience from the resources available). The 164benchmark score reflects the actual user-observable loading speed. 165 166### Web Page Replay 167 168To maintain reproducibility, the benchmark uses the 169[web page replay](https://chromium.googlesource.com/catapult/+/HEAD/web_page_replay_go/README.md) 170mechanism. Archives of the web pages are stored in the 171`chrome-partner-telemetry` cloud bucket, so you'll need access to that bucket to 172run the benchmark on recorded pages (you can still run the benchmark on live 173sites if you don't have the access, but there's no guarantee that results will 174be reproducible/comparable). 175 176### Repetitions {#repetitions} 177 178By default, the benchmark runs **100** repetitions, as we have found that this 179brings the noise to an acceptable level. You can override this setting via 180`--repetitions` 181 182### Thermal Throttling 183 184Given the high number of repetitions in the standard configuration, thermal 185throttling can be an issue, especially in more thermally constricted devices. A 186one size fits all solution to this problem is quite hard; even detecting this is 187very device specific. So for the first version of the benchmark, we leave it up 188to the user to determine if the results of the benchmark might be influenced by 189thermal issues. Crossbench has a way of adding a delay between repetitions that 190can be used to mitigate this problem (at the expense of longer running times): 191`--cool-down-time`. 192 193In the future, we want to look at ways to aid users in detecting / mitigating 194thermal throttling (e.g. notify users that thermal throttling happened during 195the test or automatically waiting between repetitions until the device is in a 196good thermal state). 197 198## Configuration 199 200In its standard configuration, the benchmark will run 100 iterations. In 201addition, the WPR server will run on device, rather than on the host, to reduce 202the noise caused by the latency introduced by the host to device connection. 203 204Both these settings can be overridden if needed / desirable. 205([Repetitions](#repetitions), [WPR on host](#host_wpr)) 206 207### Run the benchmark on live sites 208 209``` 210./cb.py loadline-phone --browser <browser> --network live 211``` 212 213*Attention:* This benchmark uses various custom metrics tailored to the 214individual pages. If the pages change, it is not guaranteed that these metrics 215will keep working. 216 217### Record a new WPR archive 218 219Uncomment the `wpr: {},` line in the probe config and run the benchmark on live 220sites (see the command above). The archive will be located in 221`results/latest/archive.wprgo`. 222 223*Attention:* This benchmark uses various custom metrics tailored to the 224individual pages. If the pages change, it is not guaranteed that these metrics 225will keep working. 226 227### Running WPR on the host {#host_wpr} 228 229If you care about running as little overhead as possible on the device, e.g. for 230power measurements, you might consider running the WPR server on the host 231machine instead of the device under test. You can do this by 232 233Adding `run_on_device: false,` to the corresponding network config file 234`config/benchmark/loadline/network_config_phone.hjson` or 235`config/benchmark/loadline/network_config_tablet.hjson`. 236 237Note golang must be available on the host machine. Check 238[go.mod](https://chromium.googlesource.com/catapult/+/HEAD/web_page_replay_go/go.mod) 239for the minimum version. 240 241### Run the benchmark with full set of experimental metrics 242 243Sometimes to investigate a source of a regression, or get deeper insights, 244it may be useful to collect more detailed traces and compute additional metrics. 245This can be done with the following command: 246 247``` 248./cb.py loadline-phone --browser <browser>\ 249 --probe-config config/benchmark/loadline/probe_config_experimental.hjson 250``` 251 252Note that collecting detailed traces incurs significant overhead, so the 253benchmark scores will likely be lower than in the default configuration. 254 255## Common issues 256 257### Problems finding wpr.go 258 259If you see a `Could not find wpr.go binary` error: 260 261* If you have chromium checked out locally: set `CHROMIUM_SRC` environment 262 variable to the path of your chromium/src folder. 263* If not (or if you're still getting this error): see the next section. 264 265### Running the benchmark without full chromium checkout 266 267Follow the 268[crossbench development instructions](https://chromium.googlesource.com/crossbench/#development) 269to check out code and run crossbench standalone. 270 271### Problems accessing the cloud bucket 272 273In some cases, you might need to download the web page archive manually. In this 274case, save the archive file corresponding to the version you are running 275(`gs://chrome-partner-telemetry/loading_benchmark/archive_*.wprgo`) locally and 276run the benchmark as follows: 277 278``` 279./cb.py loadline-phone --network <path to archive.wprgo> 280``` 281