• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Trace-based metrics
2
3_The metrics subsystem is a part of the
4[trace processor](/docs/analysis/trace-processor.md) which uses traces to
5compute reproducible metrics. It can be used in a wide range of situations;
6examples include benchmarks, lab tests and on large corpuses of traces._
7
8![Block diagram of metrics](/docs/images/metrics-summary.png)
9
10## Quickstart
11
12The [quickstart](/docs/quickstart/trace-analysis.md) provides a quick overview
13on how to compute trace-based metrics traces using trace processor.
14
15## Introduction
16
17### Motivation
18
19Performance metrics are useful to monitor for the health of a system and ensure
20that a system does not regress over time as new features are added.
21
22However, metrics retrieved directly from the system have a downside: if there is
23a regression, it is difficult to root-cause the issue. Often, the problem may
24not be reproducible or may rely on a particular setup.
25
26Trace-based metrics are one possible solution to this problem. Instead of
27collecting metrics directly on the system, a trace is collected and metrics are
28computed from the trace. If a regression in the metric is spotted, the developer
29can look directly at the trace to understand why the regression has occurred
30instead of having to reproduce the issue.
31
32### Metric subsystem
33
34The metric subsystem is a part of the
35[trace processor](/docs/analysis/trace-processor.md) which executes SQL queries
36against traces and produces a metric which summarizes some performance attribute
37(e.g. CPU, memory, startup latency etc.).
38
39For example, generating the Android CPU metrics on a trace is as simple as:
40
41```python
42> ./trace_processor --run-metrics android_cpu <trace>
43android_cpu {
44  process_info {
45    name: "/system/bin/init"
46    threads {
47      name: "init"
48      core {
49        id: 1
50        metrics {
51          mcycles: 1
52          runtime_ns: 570365
53          min_freq_khz: 1900800
54          max_freq_khz: 1900800
55          avg_freq_khz: 1902017
56        }
57      }
58      ...
59    }
60    ...
61  }
62  ...
63}
64```
65
66## Metric development guide
67
68As metric writing requires a lot of iterations to get right, there are several
69tips which make the experience a lot smoother.
70
71### Hot reloading metrics
72To obtain the fastest possible iteration time when developing metrics,
73it's possible to hot reload any changes to SQL; this will skip over both
74recompilation (for builtin metrics) and trace load (for both builtin and
75custom metrics).
76
77To do this, trace processor is started in *interactive mode* while
78still specifying command line flags about which metrics should be run and
79the paths of any extensions. Then, in the REPL shell, the commands
80`.load-metrics-sql` (which causes any SQL on disk to be re-read) and
81`.run-metrics` (to run the metrics and print the result).
82
83For example, suppose we want to iterate on the `android_startup` metric. We
84can run the following commands from a Perfetto checkout:
85```python
86> ./tools/trace_processor --interactive \
87  --run-metrics android_startup \
88  --metric-extension src/trace_processor/metrics@/ \
89  --dev \
90  <trace>
91android_startup {
92  <contents of startup metric>
93}
94
95# Now make any changes you want to the SQL files related to the startup
96# metric. Even adding new files in the src/trace_processor/metric works.
97
98# Then, we can reload the changes using `.load-metrics-sql`.
99> .load-metrics-sql
100
101# We can rerun the changed metric using `.run-metrics`
102> .run-metrics
103android_startup {
104  <contents of changed startup metric>
105}
106```
107
108NOTE: see below about why `--dev` was required for this command.
109
110This also works for custom metrics specified on the command line:
111```python
112> ./tools/trace_processor -i --run-metrics /tmp/my_custom_metric.sql <trace>
113my_custom_metric {
114  <contents of my_custom_metric>
115}
116
117# Change the SQL file as before.
118
119> .load-metrics-sql
120> .run-metrics
121my_custom_metric {
122  <contents of changed my_custom_metric>
123}
124```
125
126WARNING: it is currently not possible to reload protos in the same way. If
127protos are changed, a recompile (for built-in metrics) and reinvoking
128trace processor is necessary to pick up the changes.
129
130WARNING: Deleted files from `--metric-extension` folders are *not* removed
131and will remain available e.g. to RUN_METRIC invocations.
132
133### Modifying built-in metric SQL without recompiling
134It is possible to override the SQL of built-in metrics at runtime without
135needing to recompile trace processor. To do this, the flag `--metric-extension`
136needs to be specified with the disk path where the built-metrics live and the
137special string `/` for the virtual path.
138
139For example, from inside a Perfetto checkout:
140```python
141> ./tools/trace_processor \
142  --run-metrics android_cpu \
143  --metric-extension src/trace_processor/metrics@/
144  --dev
145  <trace>
146```
147This will run the CPU metric using the live SQL in the repo *not* the SQL
148defintion built into the binary.
149
150NOTE: protos are *not* overriden in the same way - if any proto messages are
151changed a recompile of trace processor is required for the changes to be
152available.
153
154NOTE: the `--dev` flag is required for the use of this feature. This
155flag ensures that this feature is not accidentally in production as it is only
156intended for local development.
157
158WARNING: protos are *not* overriden in the same way - if any proto messages are
159changed a recompile of trace processor is required for the changes to be
160available.
161
162## Metric helper functions
163
164There are several useful helpers functions which are available when writing a metric.
165
166### CREATE_FUNCTION.
167`CREATE_FUNCTION` allows you to define a parameterized SQL statement which
168is executable as a function. The inspiration from this function is the
169`CREATE FUNCTION` syntax which is available in other SQL engines (e.g.
170[Postgres](https://www.postgresql.org/docs/current/sql-createfunction.html)).
171
172NOTE: CREATE_FUNCTION only supports returning *exactly* a single value (i.e.
173single row and single column). For returning multiple a single row with
174multiple columns or multiples rows, see `CREATE_VIEW_FUNCTION` instead.
175
176Usage of `CREATE_FUNCTION` is as follows:
177```sql
178-- First, we define the function we'll use in the following statement.
179SELECT CREATE_FUNCTION(
180  -- First argument: prototype of the function; this is very similar to
181  -- function definitions in other languages - you set the function name
182  -- (IS_TS_IN_RANGE in this example) and the arguments
183  -- (ts, begin_ts and end_ts) along with their types (LONG for all
184  -- arguments here).
185  'IS_TS_IN_RANGE(ts LONG, begin_ts LONG, end_ts LONG)',
186  -- Second argument: the return type of the function. Only single values
187  -- can be returned in CREATE_FUNCTION. See CREATE_VIEW_FUNCTION for defining
188  -- a function returning multiple rows/columns.
189  'BOOL',
190  -- Third argument: the SQL body of the function. This should always be a
191  -- SELECT statement (even if you're not selecting from a table as in this
192  -- example). Arguments can be accessed by prefixing argument names
193  -- with $ (e.g. $ts, $begin_ts, $end_ts).
194  'SELECT $ts >= $begin_ts AND $ts <= $end_ts'
195);
196
197-- Now we can actually use the function in queries as if it was any other
198-- function.
199
200-- For example, it can appear in the SELECT to produce a column:
201SELECT ts, IS_TS_IN_RANGE(slice.ts, 100000, 200000) AS in_range
202FROM slice
203
204-- It can also appear in a where clause:
205SELECT ts
206FROM counter
207WHERE IS_TS_IN_RANGE(counter.ts, 100000, 200000) AS in_range
208
209-- It can even appear in a join on clause:
210SELECT slice.ts
211FROM launches
212JOIN slice ON IS_TS_IN_RANGE(slice.ts, launches.ts, launches.end_ts)
213```
214
215### CREATE_VIEW_FUNCTION.
216Similar to `CREATE_FUNCTION`, `CREATE_VIEW_FUNCTION` can be used to
217define a parameterized SQL statement but returning multiple rows and
218columns. Such functions can be invoked any place where a table or
219view is used.
220
221Usage of `CREATE_VIEW_FUNCTION` is as follows:
222```sql
223-- First, we define the view function we'll use in the following
224-- statement.
225SELECT CREATE_VIEW_FUNCTION(
226  -- First argument: prototype of the function; this is very similar to
227  -- function definitions in other languages - you set the function name
228  -- (NAMED_SLICE_IN_RANGE in this example) and the arguments
229  -- (name_glob, begin_ts and end_ts) along with their types (STRING and
230  -- LONG for arguments here).
231  'NAMED_SLICE_IN_RANGE(name_glob STRING, begin_ts LONG, end_ts LONG)',
232  -- Second argument: the columns returned by the function; both the
233  -- name and type is specified.
234  'id INT, ts LONG, ts_end LONG, name STRING, track_id STRING',
235  -- Third argument: the SQL body of the function. This should always be a
236  -- SELECT statement. Arguments can be accessed by prefixing argument names
237  -- with $ (e.g. $ts, $begin_ts, $end_ts).
238  'SELECT id, ts, ts + dur as ts_end, name, track_id
239   FROM slice
240   WHERE name GLOB $name_glob AND ts >= $begin_ts and ts <= $end_ts'
241);
242
243-- Now we can use the function in queries in the same way we'd query a
244-- table or view.
245SELECT id, ts, ts_end, name
246FROM NAMED_SLICE_IN_RANGE('launching:*', 12345, 67890);
247
248-- This is exactly equivalent to the query:
249SELECT id, ts, ts + dur as ts_end, name
250FROM slice
251WHERE name GLOB 'launching:*' AND ts >= 12345 AND ts <= 67890;
252
253-- View functions can also be used in joins.
254SELECT sl.id, sl.ts, sl.ts_end, sl.name
255FROM trace_bounds AS b
256JOIN NAMED_SLICE_IN_RANGE('launching:*', b.start_ts, b.end_ts) AS sl;
257```
258
259### RUN_METRIC
260`RUN_METRIC` allows you to run another metric file. This allows you to use views
261or tables defined in that file without repeatition.
262
263Conceptually, `RUN_METRIC` adds *composability* for SQL queries to break a big SQL
264metric into smaller, reusable files. This is similar to how functions allow decomposing
265large chunks in traditional programming languages.
266
267A simple usage of `RUN_METRIC` would be as follows:
268
269In file android/foo.sql:
270```sql
271CREATE VIEW view_defined_in_foo AS
272SELECT *
273FROM slice
274LIMIT 1;
275```
276
277In file android/bar.sql
278```sql
279SELECT RUN_METRIC('android/foo.sql');
280
281CREATE VIEW view_depending_on_view_from_foo AS
282SELECT *
283FROM view_defined_in_foo
284LIMIT 1;
285```
286
287`RUN_METRIC` also supports running *templated* metric files. Here's an example of
288what that looks like:
289
290In file android/slice_template.sql:
291```sql
292CREATE VIEW {{view_name}} AS
293SELECT *
294FROM slice
295WHERE slice.name = '{{slice_name}}';
296```
297
298In file android/metric.sql:
299```sql
300SELECT RUN_METRIC(
301  'android/slice_template.sql',
302  'view_name', 'choreographer_slices',
303  'slice_name', 'Chroeographer#doFrame'
304);
305
306CREATE VIEW long_choreographer_slices AS
307SELECT *
308FROM choreographer_slices
309WHERE dur > 1e6;
310```
311
312When running `slice_template.sql`, trace processor will substitute the arguments
313passed to `RUN_METRIC` into the templated file *before* executing the file using
314SQLite.
315
316In other words, this is what SQLite sees and executes in practice for the above
317example:
318```sql
319CREATE VIEW choreographer_slices AS
320SELECT *
321FROM slice
322WHERE slice.name = 'Chroeographer#doFrame';
323
324CREATE VIEW long_choreographer_slices AS
325SELECT *
326FROM choreographer_slices
327WHERE dur > 1e6;
328```
329
330The syntax for templated metric files is essentially a highly simplified version of
331[Jinja's](https://jinja.palletsprojects.com/en/3.0.x/) syntax.
332
333## Walkthrough: prototyping a metric
334
335TIP: To see how to add a new metric to trace processor, see the checklist
336[here](/docs/contributing/common-tasks.md#new-metric)
337
338This walkthrough will outline how to prototype a metric locally without needing
339to compile trace processor. This metric will compute the CPU time for every
340process in the trace and list the names of the top 5 processes (by CPU time) and
341the number of threads created by the process.
342
343NOTE: See this [GitHub gist][gist] to see how the code should look at the end of
344      the walkthrough. The prerequisites and Step 4 below give instructions on
345      how to get trace processor and run the metrics code.
346
347[gist]: https://gist.github.com/LalitMaganti/c221cf0cae17e298dfa82b118edf9080
348
349### Prerequisites
350
351As a setup step, create a folder to act as a scratch workspace; this folder will be referred to using the env variable `$WORKSPACE` in Step 4.
352
353The other requirement is trace processor. This can downloaded from [here](https://get.perfetto.dev/trace_processor) or can be built from source
354using the instructions [here](trace-processor.md). Whichever method is chosen, $TRACE_PROCESSOR env variable will be used to refer to the location of the binary in Step 4.
355
356### Step 1
357
358As all metrics in the metrics platform are defined using protos, the metric needs to be structured as a proto. For this metric, there needs to be some notion of a process name along with its CPU time and number of threads.
359
360Starting off, in a file named `top_five_processes.proto` in our workspace, create a basic proto message called ProcessInfo with those three fields:
361
362```protobuf
363message ProcessInfo {
364  optional string process_name = 1;
365  optional uint64 cpu_time_ms = 2;
366  optional uint32 num_threads = 3;
367}
368```
369
370Next , create a wrapping message which will hold the repeated field containing the top 5 processes.
371
372```protobuf
373message TopProcesses {
374  repeated ProcessInfo process_info = 1;
375}
376```
377
378Finally, define an extension to the root proto for all metrics (the [TraceMetrics](https://android.googlesource.com/platform/external/perfetto/+/HEAD/protos/perfetto/metrics/metrics.proto#39) proto).
379
380```protobuf
381extend TraceMetrics {
382  optional TopProcesses top_five_processes = 450;
383}
384```
385
386Adding this extension field allows trace processor to link the newly defined
387metric to the `TraceMetrics` proto.
388
389_Notes:_
390
391- The field ids 450-500 are reserved for local development so any of them can be used as the field id for the extension field.
392- The choice of field name here is important as the SQL file and the final table generated in SQL will be based on this name.
393
394Putting everything together, along with some boilerplate preamble gives:
395
396```protobuf
397syntax = "proto2";
398
399package perfetto.protos;
400
401import "protos/perfetto/metrics/metrics.proto";
402
403message ProcessInfo {
404  optional string process_name = 1;
405  optional int64 cpu_time_ms = 2;
406  optional uint32 num_threads = 3;
407}
408
409message TopProcesses {
410  repeated ProcessInfo process_info = 1;
411}
412
413extend TraceMetrics {
414  optional TopProcesses top_five_processes = 450;
415}
416```
417
418### Step 2
419
420Next, write the SQL to generate the table of the top 5 processes ordered by the
421sum of the CPU time they ran for and the number of threads which were associated
422with the process.
423
424The following SQL should be added to a file called `top_five_processes.sql` in
425the workspace:
426
427```sql
428CREATE VIEW top_five_processes_by_cpu AS
429SELECT
430  process.name as process_name,
431  CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
432  COUNT(DISTINCT utid) as num_threads
433FROM sched
434INNER JOIN thread USING(utid)
435INNER JOIN process USING(upid)
436GROUP BY process.name
437ORDER BY cpu_time_ms DESC
438LIMIT 5;
439```
440
441Let's break this query down:
442
4431. The first table used is the `sched` table. This contains all the scheduling
444   data available in the trace. Each scheduling "slice" is associated with a
445   thread which is uniquely identified in Perfetto traces using its `utid`. The
446   two pieces of information needed from the sched table are the `dur` -
447   short for duration, this is the amount of time the slice lasted - and the
448   `utid` which will be used to join with the thread table.
4492. The next table is the thread table. This gives us a lot of information which
450   is not particularly interesting (including its thread name) but it does give
451   us the `upid`. Similar to `utid`, `upid` is the unique identifier for a
452   process in a Perfetto trace. In this case, `upid` will refer to the process
453   which hosts the thread given by `utid`.
4543. The final table is the process table. This gives the name of the process
455   associated with the original sched slice.
4564. With the process, thread and duration for each sched slice, all the slices
457   for a single processes are collected and their durations summed to get the
458   CPU time (dividing by 1e6 as sched's duration is in nanoseconds) and
459   the number of distinct threads.
4605. Finally, we order by the cpu time and limit to the top 5 results.
461
462### Step 3
463
464Now that the result of the metric has been expressed as an SQL table, it needs
465to be converted to a proto. The metrics platform has built-in support for emitting
466protos using SQL functions; something which is used extensively in this step.
467
468Let's look at how it works for our table above.
469
470```sql
471CREATE VIEW top_five_processes_output AS
472SELECT TopProcesses(
473  'process_info', (
474    SELECT RepeatedField(
475      ProcessInfo(
476        'process_name', process_name,
477        'cpu_time_ms', cpu_time_ms,
478        'num_threads', num_threads
479      )
480    )
481    FROM top_five_processes_by_cpu
482  )
483);
484```
485
486Breaking this down again:
487
4881. Starting from the inner-most SELECT statement, there is what looks like
489   a function call to the ProcessInfo function; in fact this is no coincidence.
490   For each proto that the metrics platform knows about, an SQL function is
491   generated with the same name as the proto. This function takes key value
492   pairs with the key as the name of the proto field to fill and the value being
493   the data to store in the field. The output is the proto created by writing
494   the fields described in the function. (\*)
495
496   In this case, this function is called once for each row in the
497   `top_five_processes_by_cpu` table. The output will be the fully filled
498   ProcessInfo proto.
499
500   The call to the `RepeatedField` function is the most interesting part and
501   also the most important. In technical terms, `RepeatedField` is an aggregate
502   function. Practically, this means that it takes a full table of values and
503   generates a single array which contains all the values passed to it.
504
505   Therefore, the output of this whole SELECT statement is an array of 5
506   ProcessInfo protos.
507
5082. Next is creation of the `TopProcesses` proto. By now, the syntax should
509   already feel somewhat familiar; the proto builder function is called to fill
510   in the `process_info` field with the array of protos from the inner function.
511
512   The output of this SELECT is a single `TopProcesses` proto containing the
513   ProcessInfos as a repeated field.
514
5153. Finally, the view is created. This view is specially named to allow the
516   metrics platform to query it to obtain the root proto for each metric
517   (in this case `TopProcesses`). See the note below as to the pattern behind
518   this view's name.
519
520(\*) _This is not strictly true. To type-check the protos, some metadata
521is returned about the type of the proto but this is unimportant for metric
522authors._
523
524NOTE: It is important that the views be named {name of TraceMetrics extension
525      field}\_output. This is the pattern used and expected by the metrics
526      platform for all metrics.
527
528The final file should look like so:
529
530```sql
531CREATE VIEW top_five_processes_by_cpu AS
532SELECT
533  process.name as process_name,
534  CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
535  COUNT(DISTINCT utid) as num_threads
536FROM sched
537INNER JOIN thread USING(utid)
538INNER JOIN process USING(upid)
539GROUP BY process.name
540ORDER BY cpu_time_ms DESC
541LIMIT 5;
542
543CREATE VIEW top_five_processes_output AS
544SELECT TopProcesses(
545  'process_info', (
546    SELECT RepeatedField(
547      ProcessInfo(
548        'process_name', process_name,
549        'cpu_time_ms', cpu_time_ms,
550        'num_threads', num_threads
551      )
552    )
553    FROM top_five_processes_by_cpu
554  )
555);
556```
557
558NOTE: The name of the SQL file should be the same as the name of TraceMetrics
559      extension field. This is to allow the metrics platform to associated the
560      proto extension field with the SQL which needs to be run to generate it.
561
562### Step 4
563
564For this step, invoke trace processor shell to run the metrics (see the
565[Quickstart](/docs/quickstart/trace-analysis.md) for downloading instructions):
566
567```shell
568$TRACE_PROCESSOR --run-metrics $WORKSPACE/top_five_processes.sql $TRACE 2> /dev/null
569```
570
571(For an example trace to test this on, see the Notes section below.)
572
573By passing the SQL file for the metric to be computed, trace processor uses the name of this file to find the proto and to figure out the name of the output table for the proto and the name of the extension field for `TraceMetrics`; this is the reason it was important to choose the names of these other objects carefully.
574
575_Notes:_
576
577- If something doesn't work as intended, check that the workspace looks the same as the contents of this [GitHub gist](https://gist.github.com/LalitMaganti/c221cf0cae17e298dfa82b118edf9080).
578- A good example trace for this metric is the Android example trace used by the Perfetto UI found [here](https://storage.googleapis.com/perfetto-misc/example_android_trace_30s_1).
579- stderr is redirected to remove any noise from parsing the trace that trace processor generates.
580
581If everything went successfully, the following output should be visible (specifically this is the output for the Android example trace linked above):
582
583```
584[perfetto.protos.top_five_processes] {
585  process_info {
586    process_name: "com.google.android.GoogleCamera"
587    cpu_time_ms: 15154
588    num_threads: 125
589  }
590  process_info {
591    process_name: "sugov:4"
592    cpu_time_ms: 6846
593    num_threads: 1
594  }
595  process_info {
596    process_name: "system_server"
597    cpu_time_ms: 6809
598    num_threads: 66
599  }
600  process_info {
601    process_name: "cds_ol_rx_threa"
602    cpu_time_ms: 6684
603    num_threads: 1
604  }
605  process_info {
606    process_name: "com.android.chrome"
607    cpu_time_ms: 5125
608    num_threads: 49
609  }
610}
611```
612
613## Next steps
614
615* The [common tasks](/docs/contributing/common-tasks.md) page gives a list of
616  steps on how new metrics can be added to the trace processor.
617
618## Appendix: Case for upstreaming
619
620NOTE: Googlers: for internal usage of metrics in Google3 (i.e. metrics which are
621confidential), please see [this internal page](https://goto.google.com/viecd).
622
623Authors are strongly encouraged to add all metrics derived on Perfetto traces to
624the Perfetto repo unless there is a clear usecase (e.g. confidentiality) why
625these metrics should not be publicly available.
626
627In return for upstreaming metrics, authors will have first class support for
628running metrics locally and the confidence that their metrics will remain stable
629as trace processor is developed.
630
631As well as scaling upwards while developing from running on a single trace
632locally to running on a large set of traces, the reverse is also very useful.
633When an anomaly is observed in the metrics of a lab benchmark, a representative
634trace can be downloaded and the same metric can be run locally in trace
635processor.
636
637Since the same code is running locally and remotely, developers can be confident
638in reproducing the issue and use the trace processor and/or the Perfetto UI to
639identify the problem.
640