README.md
1# LoadLine Benchmark
2
3This folder contains configs for the LoadLine benchmark. The goal of the
4benchmark is to facilitate web performance optimization based on a realistic
5workload. The benchmark has two workload variants:
6
7* General-purpose workload representative of the web usage on mobile phones
8 ("phone");
9
10* Android Tablet web performance workload ("tablet").
11
12## tl;dr: Running the Benchmark
13
14Run "phone" workload:
15
16```
17./cb.py loadline-phone --browser <browser> --cool-down-threshold moderate
18```
19
20Run "tablet" workload:
21
22```
23./cb.py loadline-tablet --browser <browser> --cool-down-threshold moderate
24```
25
26The browser can be `android:chrome-canary`, `android:chrome-stable` etc. See
27crossbench docs for the full list of options.
28
29Cool down threshold is recommended because by default the benchmark runs 100
30repetitions, which creates a significant load on the device and can lead to
31overheating. This option will insert cooldown periods to ensure that the device
32stays below the given thermal level. Possible values include `light`,
33`moderate`, and `severe`.
34
35Results will be located in `results/latest/`. Notable files in this directory:
36
37* `loadline_probe.csv`: Final score for the run
38* `trace_processor/loadline_benchmark_score.csv`: Breakdown of scores per
39 page and repetition
40
41## Benchmark Details
42
43### Background
44
45Web is one of the most important use cases on mobile devices. Page loading speed
46represents a crucial part of user experience, and is not well covered by
47existing benchmarks (Speedometer, Jetstream, MotionMark). Experiments show that
48raw CPU performance does not always result in faster web loading speeds, since
49it's a complex highly parallelized process that stresses a lot of browser and OS
50components and their interactions. Hence the need for a dedicated web loading
51benchmark that will enable us to compare devices, track improvements across OS
52and browser releases.
53
54### Workload
55
56We aimed for two configurations:
57
58* **Representative mobile Web on Android usage (~5 pages)**
59
60 Aimed at covering loading scenarios representative of real web workloads and
61 user environments on Android mobile phones.
62
63* **Android Tablet web performance (~5 pages)**
64
65 A set of larger desktop-class workloads intended for tablet/large screen
66 devices running Android.
67
68The biggest challenges we faced in achieving this goal were:
69
70* **Representativeness**: How do we determine a representative set of web
71 sites given the humongous corpus of websites whose overall distribution is
72 not thoroughly understood.
73* **Metrics** Existing page load metrics generalize well for O(millions) of
74 page loads across a variety of sites, but are poor fit to judge performance
75 of a specific site
76* **Noise**: The web evolves. To ensure the benchmark workloads stay
77 consistent over time, we chose to use recorded & replayed workloads.
78 However, page load is very complex and indeterministic so naive replays are
79 often not consistent.
80
81### Site Selection
82
83We did a thorough analysis to ensure we select relevant and representative
84sites. Our aspiration was to understand the distribution of the most important
85CUJs and performance characteristics on the web and use this knowledge to elect
86a small number of representative CUJs, such that their performance
87characteristics maximize coverage of the distribution.
88
89Practically, we evaluated ~50 prominent sites across a number of different
90characteristics (dimensions) via trace-based analysis, cross-checking via field
91data. We clustered similar pages and selected representatives for important
92clusters. In the end, this was a manual selection aided by algorithmic
93clustering/correlation analysis.
94
95We looked at over 20 dimensions for suitability and relevance to our site
96selection, and low correlation between dimensions. Of these, we chose 6 primary
97metrics that we optimized coverage on: Website type, workload size (CPU time),
98DOM/Layout complexity (#nodes), JavaScript heap size, time spent in V8, time
99spent in V8 callbacks into Blink. Secondarily, we included utilization of web
100features and relevant mojo interfaces, e.g. Video, cookies, main/subframe
101communication, input events, frame production, network requests, etc.
102
103In the end we selected 5 sites for each configuration which we plan to extend in
104the future.
105
106#### Mobile
107
108| Page (mobile version) | CUJ | Performance characteristics |
109| -------------------------- | ------------------ | -------------------------- |
110| amazon.co.uk <br> (product page) | Shopping | * average page load, large workload, large DOM/JS (but heavier on DOM) <br> * high on OOPIFs, input, http(s) resources, frame production |
111| cnn.com <br> (article) | News | * slow page load, large workload, large DOM/JS (but heavier on JS) <br> * high on iframes, main frame, local storage, cookies, http(s) resources |
112| wikipedia.org <br> (article) | Reference work | * fast page load, small workload, large DOM, small JS <br> * high on input <br> * low on iframes, http(s) resources, frame production |
113| globo.com <br> (homepage) | News / web portal | * slow page load, large workload, small DOM, large JS <br> * high on iframes, OOPIFs, http(s) resources, frame production, cookies |
114| google.com <br> (results) | Search | * fast page load, average workload, average DOM + JS <br> * high on main frame, local storage, video |
115
116#### Tablet
117
118| Page (desktop version) | CUJ | Performance characteristics |
119| -------------------------- | ------------ | -------------------------------- |
120| amazon.co.uk <br> (product page) | Shopping | * average page load, large workload, large DOM, average JS <br> * high on OOPIFs, http(s) resources, frame production |
121| cnn.com <br> (article) | News | * slow page load, large workload, large DOM/JS (but heavier on JS) <br> * high on iframes, local storage, video, frame production, cookies |
122| docs.google.com <br> (document) | Productivity | * slow page load, large workload, large DOM + JS (heavier on JS) <br> * high on main frame <br> * high on font resources |
123| google.com <br> (results) | Search | * fast page load, low workload, low DOM + JS <br> * high on main frame, local storage <br> * low on video |
124| youtube.com<br> (video) | Media | * slow page load, very high workload, large DOM, small JS heap, average JS time <br> * high on video |
125
126### Metrics
127
128Measuring page load accurately in generic ways is difficult (e.g. some pages
129require significant work after LCP to become "interactive") and inaccurate
130metrics risk incorrect power/perf trade-off choices. Once we had a selection of
131sites, we looked at each one of them and devised site-specific metrics that
132better reflect when a page is ready to be interacted with.
133
134## Reproducibility / Noise
135
136Page load is a very complex process and is inherently noisy. There is a lot of
137concurrent work happening in the browser and slight timing differences can have
138big impact in the actual workload being executed and thus in the perceived load
139speed.
140
141We took various measures to try to reduce this variability, but there is still
142room to improve and we plan to do this in the next versions of this benchmark.
143
144### Score
145
146We are still actively developing this benchmark and we will try our best to keep
147the score as stable across changes as possible. We will update the benchmark's
148minor version if we introduce changes that have a chance of affecting the score.
149This version is reported in the benchmark output and should be quoted when
150sharing scores.
151
152### Cross-device Comparisons
153
154Workload on two different devices will differ due to variance in application
155tasks, such as the number of frames rendered during load, timers being executed
156more frequently during load, etc.
157
158It is important to stress that page load is a complex workload. As a result, if
159we were to compare scores between two devices A and B, device A having 2x the
160CPU speed compared to device B, then A's score will be less than 2x of B's
161score. This is not an error or an artifact of the measurement, this is a result
162of the adaptable nature of web loading (and / or potential effort of a browser
163trying to get the best user experience from the resources available). The
164benchmark score reflects the actual user-observable loading speed.
165
166### Web Page Replay
167
168To maintain reproducibility, the benchmark uses the
169[web page replay](https://chromium.googlesource.com/catapult/+/HEAD/web_page_replay_go/README.md)
170mechanism. Archives of the web pages are stored in the
171`chrome-partner-telemetry` cloud bucket, so you'll need access to that bucket to
172run the benchmark on recorded pages (you can still run the benchmark on live
173sites if you don't have the access, but there's no guarantee that results will
174be reproducible/comparable).
175
176### Repetitions {#repetitions}
177
178By default, the benchmark runs **100** repetitions, as we have found that this
179brings the noise to an acceptable level. You can override this setting via
180`--repetitions`
181
182### Thermal Throttling
183
184Given the high number of repetitions in the standard configuration, thermal
185throttling can be an issue, especially in more thermally constricted devices. A
186one size fits all solution to this problem is quite hard; even detecting this is
187very device specific. So for the first version of the benchmark, we leave it up
188to the user to determine if the results of the benchmark might be influenced by
189thermal issues. Crossbench has a way of adding a delay between repetitions that
190can be used to mitigate this problem (at the expense of longer running times):
191`--cool-down-time`.
192
193In the future, we want to look at ways to aid users in detecting / mitigating
194thermal throttling (e.g. notify users that thermal throttling happened during
195the test or automatically waiting between repetitions until the device is in a
196good thermal state).
197
198## Configuration
199
200In its standard configuration, the benchmark will run 100 iterations. In
201addition, the WPR server will run on device, rather than on the host, to reduce
202the noise caused by the latency introduced by the host to device connection.
203
204Both these settings can be overridden if needed / desirable.
205([Repetitions](#repetitions), [WPR on host](#host_wpr))
206
207### Run the benchmark on live sites
208
209```
210./cb.py loadline-phone --browser <browser> --network live
211```
212
213*Attention:* This benchmark uses various custom metrics tailored to the
214individual pages. If the pages change, it is not guaranteed that these metrics
215will keep working.
216
217### Record a new WPR archive
218
219Uncomment the `wpr: {},` line in the probe config and run the benchmark on live
220sites (see the command above). The archive will be located in
221`results/latest/archive.wprgo`.
222
223*Attention:* This benchmark uses various custom metrics tailored to the
224individual pages. If the pages change, it is not guaranteed that these metrics
225will keep working.
226
227### Running WPR on the host {#host_wpr}
228
229If you care about running as little overhead as possible on the device, e.g. for
230power measurements, you might consider running the WPR server on the host
231machine instead of the device under test. You can do this by
232
233Adding `run_on_device: false,` to the corresponding network config file
234`config/benchmark/loadline/network_config_phone.hjson` or
235`config/benchmark/loadline/network_config_tablet.hjson`.
236
237Note golang must be available on the host machine. Check
238[go.mod](https://chromium.googlesource.com/catapult/+/HEAD/web_page_replay_go/go.mod)
239for the minimum version.
240
241### Run the benchmark with full set of experimental metrics
242
243Sometimes to investigate a source of a regression, or get deeper insights,
244it may be useful to collect more detailed traces and compute additional metrics.
245This can be done with the following command:
246
247```
248./cb.py loadline-phone --browser <browser>\
249 --probe-config config/benchmark/loadline/probe_config_experimental.hjson
250```
251
252Note that collecting detailed traces incurs significant overhead, so the
253benchmark scores will likely be lower than in the default configuration.
254
255## Common issues
256
257### Problems finding wpr.go
258
259If you see a `Could not find wpr.go binary` error:
260
261* If you have chromium checked out locally: set `CHROMIUM_SRC` environment
262 variable to the path of your chromium/src folder.
263* If not (or if you're still getting this error): see the next section.
264
265### Running the benchmark without full chromium checkout
266
267Follow the
268[crossbench development instructions](https://chromium.googlesource.com/crossbench/#development)
269to check out code and run crossbench standalone.
270
271### Problems accessing the cloud bucket
272
273In some cases, you might need to download the web page archive manually. In this
274case, save the archive file corresponding to the version you are running
275(`gs://chrome-partner-telemetry/loading_benchmark/archive_*.wprgo`) locally and
276run the benchmark as follows:
277
278```
279./cb.py loadline-phone --network <path to archive.wprgo>
280```
281