1# Trace-based metrics 2 3_The metrics subsystem is a part of the 4[trace processor](/docs/analysis/trace-processor.md) which uses traces to 5compute reproducible metrics. It can be used in a wide range of situations; 6examples include benchmarks, lab tests and on large corpuses of traces._ 7 8![Block diagram of metrics](/docs/images/metrics-summary.png) 9 10## Quickstart 11 12The [quickstart](/docs/quickstart/trace-analysis.md) provides a quick overview 13on how to compute trace-based metrics traces using trace processor. 14 15## Introduction 16 17### Motivation 18 19Performance metrics are useful to monitor for the health of a system and ensure 20that a system does not regress over time as new features are added. 21 22However, metrics retrieved directly from the system have a downside: if there is 23a regression, it is difficult to root-cause the issue. Often, the problem may 24not be reproducible or may rely on a particular setup. 25 26Trace-based metrics are one possible solution to this problem. Instead of 27collecting metrics directly on the system, a trace is collected and metrics are 28computed from the trace. If a regression in the metric is spotted, the developer 29can look directly at the trace to understand why the regression has occurred 30instead of having to reproduce the issue. 31 32### Metric subsystem 33 34The metric subsystem is a part of the 35[trace processor](/docs/analysis/trace-processor.md) which executes SQL queries 36against traces and produces a metric which summarizes some performance attribute 37(e.g. CPU, memory, startup latency etc.). 38 39For example, generating the Android CPU metrics on a trace is as simple as: 40 41```python 42> ./trace_processor --run-metrics android_cpu <trace> 43android_cpu { 44 process_info { 45 name: "/system/bin/init" 46 threads { 47 name: "init" 48 core { 49 id: 1 50 metrics { 51 mcycles: 1 52 runtime_ns: 570365 53 min_freq_khz: 1900800 54 max_freq_khz: 1900800 55 avg_freq_khz: 1902017 56 } 57 } 58 ... 59 } 60 ... 61 } 62 ... 63} 64``` 65 66## Metric development guide 67 68As metric writing requires a lot of iterations to get right, there are several 69tips which make the experience a lot smoother. 70 71### Hot reloading metrics 72To obtain the fastest possible iteration time when developing metrics, 73it's possible to hot reload any changes to SQL; this will skip over both 74recompilation (for builtin metrics) and trace load (for both builtin and 75custom metrics). 76 77To do this, trace processor is started in *interactive mode* while 78still specifying command line flags about which metrics should be run and 79the paths of any extensions. Then, in the REPL shell, the commands 80`.load-metrics-sql` (which causes any SQL on disk to be re-read) and 81`.run-metrics` (to run the metrics and print the result). 82 83For example, suppose we want to iterate on the `android_startup` metric. We 84can run the following commands from a Perfetto checkout: 85```python 86> ./tools/trace_processor --interactive \ 87 --run-metrics android_startup \ 88 --metric-extension src/trace_processor/metrics@/ \ 89 --dev \ 90 <trace> 91android_startup { 92 <contents of startup metric> 93} 94 95# Now make any changes you want to the SQL files related to the startup 96# metric. Even adding new files in the src/trace_processor/metric works. 97 98# Then, we can reload the changes using `.load-metrics-sql`. 99> .load-metrics-sql 100 101# We can rerun the changed metric using `.run-metrics` 102> .run-metrics 103android_startup { 104 <contents of changed startup metric> 105} 106``` 107 108NOTE: see below about why `--dev` was required for this command. 109 110This also works for custom metrics specified on the command line: 111```python 112> ./tools/trace_processor -i --run-metrics /tmp/my_custom_metric.sql <trace> 113my_custom_metric { 114 <contents of my_custom_metric> 115} 116 117# Change the SQL file as before. 118 119> .load-metrics-sql 120> .run-metrics 121my_custom_metric { 122 <contents of changed my_custom_metric> 123} 124``` 125 126WARNING: it is currently not possible to reload protos in the same way. If 127protos are changed, a recompile (for built-in metrics) and reinvoking 128trace processor is necessary to pick up the changes. 129 130WARNING: Deleted files from `--metric-extension` folders are *not* removed 131and will remain available e.g. to RUN_METRIC invocations. 132 133### Modifying built-in metric SQL without recompiling 134It is possible to override the SQL of built-in metrics at runtime without 135needing to recompile trace processor. To do this, the flag `--metric-extension` 136needs to be specified with the disk path where the built-metrics live and the 137special string `/` for the virtual path. 138 139For example, from inside a Perfetto checkout: 140```python 141> ./tools/trace_processor \ 142 --run-metrics android_cpu \ 143 --metric-extension src/trace_processor/metrics@/ 144 --dev 145 <trace> 146``` 147This will run the CPU metric using the live SQL in the repo *not* the SQL 148defintion built into the binary. 149 150NOTE: protos are *not* overriden in the same way - if any proto messages are 151changed a recompile of trace processor is required for the changes to be 152available. 153 154NOTE: the `--dev` flag is required for the use of this feature. This 155flag ensures that this feature is not accidentally in production as it is only 156intended for local development. 157 158WARNING: protos are *not* overriden in the same way - if any proto messages are 159changed a recompile of trace processor is required for the changes to be 160available. 161 162## Metric helper functions 163 164There are several useful helpers functions which are available when writing a metric. 165 166### CREATE_FUNCTION. 167`CREATE_FUNCTION` allows you to define a parameterized SQL statement which 168is executable as a function. The inspiration from this function is the 169`CREATE FUNCTION` syntax which is available in other SQL engines (e.g. 170[Postgres](https://www.postgresql.org/docs/current/sql-createfunction.html)). 171 172NOTE: CREATE_FUNCTION only supports returning *exactly* a single value (i.e. 173single row and single column). For returning multiple a single row with 174multiple columns or multiples rows, see `CREATE_VIEW_FUNCTION` instead. 175 176Usage of `CREATE_FUNCTION` is as follows: 177```sql 178-- First, we define the function we'll use in the following statement. 179SELECT CREATE_FUNCTION( 180 -- First argument: prototype of the function; this is very similar to 181 -- function definitions in other languages - you set the function name 182 -- (IS_TS_IN_RANGE in this example) and the arguments 183 -- (ts, begin_ts and end_ts) along with their types (LONG for all 184 -- arguments here). 185 'IS_TS_IN_RANGE(ts LONG, begin_ts LONG, end_ts LONG)', 186 -- Second argument: the return type of the function. Only single values 187 -- can be returned in CREATE_FUNCTION. See CREATE_VIEW_FUNCTION for defining 188 -- a function returning multiple rows/columns. 189 'BOOL', 190 -- Third argument: the SQL body of the function. This should always be a 191 -- SELECT statement (even if you're not selecting from a table as in this 192 -- example). Arguments can be accessed by prefixing argument names 193 -- with $ (e.g. $ts, $begin_ts, $end_ts). 194 'SELECT $ts >= $begin_ts AND $ts <= $end_ts' 195); 196 197-- Now we can actually use the function in queries as if it was any other 198-- function. 199 200-- For example, it can appear in the SELECT to produce a column: 201SELECT ts, IS_TS_IN_RANGE(slice.ts, 100000, 200000) AS in_range 202FROM slice 203 204-- It can also appear in a where clause: 205SELECT ts 206FROM counter 207WHERE IS_TS_IN_RANGE(counter.ts, 100000, 200000) AS in_range 208 209-- It can even appear in a join on clause: 210SELECT slice.ts 211FROM launches 212JOIN slice ON IS_TS_IN_RANGE(slice.ts, launches.ts, launches.end_ts) 213``` 214 215### CREATE_VIEW_FUNCTION. 216Similar to `CREATE_FUNCTION`, `CREATE_VIEW_FUNCTION` can be used to 217define a parameterized SQL statement but returning multiple rows and 218columns. Such functions can be invoked any place where a table or 219view is used. 220 221Usage of `CREATE_VIEW_FUNCTION` is as follows: 222```sql 223-- First, we define the view function we'll use in the following 224-- statement. 225SELECT CREATE_VIEW_FUNCTION( 226 -- First argument: prototype of the function; this is very similar to 227 -- function definitions in other languages - you set the function name 228 -- (NAMED_SLICE_IN_RANGE in this example) and the arguments 229 -- (name_glob, begin_ts and end_ts) along with their types (STRING and 230 -- LONG for arguments here). 231 'NAMED_SLICE_IN_RANGE(name_glob STRING, begin_ts LONG, end_ts LONG)', 232 -- Second argument: the columns returned by the function; both the 233 -- name and type is specified. 234 'id INT, ts LONG, ts_end LONG, name STRING, track_id STRING', 235 -- Third argument: the SQL body of the function. This should always be a 236 -- SELECT statement. Arguments can be accessed by prefixing argument names 237 -- with $ (e.g. $ts, $begin_ts, $end_ts). 238 'SELECT id, ts, ts + dur as ts_end, name, track_id 239 FROM slice 240 WHERE name GLOB $name_glob AND ts >= $begin_ts and ts <= $end_ts' 241); 242 243-- Now we can use the function in queries in the same way we'd query a 244-- table or view. 245SELECT id, ts, ts_end, name 246FROM NAMED_SLICE_IN_RANGE('launching:*', 12345, 67890); 247 248-- This is exactly equivalent to the query: 249SELECT id, ts, ts + dur as ts_end, name 250FROM slice 251WHERE name GLOB 'launching:*' AND ts >= 12345 AND ts <= 67890; 252 253-- View functions can also be used in joins. 254SELECT sl.id, sl.ts, sl.ts_end, sl.name 255FROM trace_bounds AS b 256JOIN NAMED_SLICE_IN_RANGE('launching:*', b.start_ts, b.end_ts) AS sl; 257``` 258 259### RUN_METRIC 260`RUN_METRIC` allows you to run another metric file. This allows you to use views 261or tables defined in that file without repeatition. 262 263Conceptually, `RUN_METRIC` adds *composability* for SQL queries to break a big SQL 264metric into smaller, reusable files. This is similar to how functions allow decomposing 265large chunks in traditional programming languages. 266 267A simple usage of `RUN_METRIC` would be as follows: 268 269In file android/foo.sql: 270```sql 271CREATE VIEW view_defined_in_foo AS 272SELECT * 273FROM slice 274LIMIT 1; 275``` 276 277In file android/bar.sql 278```sql 279SELECT RUN_METRIC('android/foo.sql'); 280 281CREATE VIEW view_depending_on_view_from_foo AS 282SELECT * 283FROM view_defined_in_foo 284LIMIT 1; 285``` 286 287`RUN_METRIC` also supports running *templated* metric files. Here's an example of 288what that looks like: 289 290In file android/slice_template.sql: 291```sql 292CREATE VIEW {{view_name}} AS 293SELECT * 294FROM slice 295WHERE slice.name = '{{slice_name}}'; 296``` 297 298In file android/metric.sql: 299```sql 300SELECT RUN_METRIC( 301 'android/slice_template.sql', 302 'view_name', 'choreographer_slices', 303 'slice_name', 'Chroeographer#doFrame' 304); 305 306CREATE VIEW long_choreographer_slices AS 307SELECT * 308FROM choreographer_slices 309WHERE dur > 1e6; 310``` 311 312When running `slice_template.sql`, trace processor will substitute the arguments 313passed to `RUN_METRIC` into the templated file *before* executing the file using 314SQLite. 315 316In other words, this is what SQLite sees and executes in practice for the above 317example: 318```sql 319CREATE VIEW choreographer_slices AS 320SELECT * 321FROM slice 322WHERE slice.name = 'Chroeographer#doFrame'; 323 324CREATE VIEW long_choreographer_slices AS 325SELECT * 326FROM choreographer_slices 327WHERE dur > 1e6; 328``` 329 330The syntax for templated metric files is essentially a highly simplified version of 331[Jinja's](https://jinja.palletsprojects.com/en/3.0.x/) syntax. 332 333## Walkthrough: prototyping a metric 334 335TIP: To see how to add a new metric to trace processor, see the checklist 336[here](/docs/contributing/common-tasks.md#new-metric) 337 338This walkthrough will outline how to prototype a metric locally without needing 339to compile trace processor. This metric will compute the CPU time for every 340process in the trace and list the names of the top 5 processes (by CPU time) and 341the number of threads created by the process. 342 343NOTE: See this [GitHub gist][gist] to see how the code should look at the end of 344 the walkthrough. The prerequisites and Step 4 below give instructions on 345 how to get trace processor and run the metrics code. 346 347[gist]: https://gist.github.com/LalitMaganti/c221cf0cae17e298dfa82b118edf9080 348 349### Prerequisites 350 351As a setup step, create a folder to act as a scratch workspace; this folder will be referred to using the env variable `$WORKSPACE` in Step 4. 352 353The other requirement is trace processor. This can downloaded from [here](https://get.perfetto.dev/trace_processor) or can be built from source 354using the instructions [here](trace-processor.md). Whichever method is chosen, $TRACE_PROCESSOR env variable will be used to refer to the location of the binary in Step 4. 355 356### Step 1 357 358As all metrics in the metrics platform are defined using protos, the metric needs to be structured as a proto. For this metric, there needs to be some notion of a process name along with its CPU time and number of threads. 359 360Starting off, in a file named `top_five_processes.proto` in our workspace, create a basic proto message called ProcessInfo with those three fields: 361 362```protobuf 363message ProcessInfo { 364 optional string process_name = 1; 365 optional uint64 cpu_time_ms = 2; 366 optional uint32 num_threads = 3; 367} 368``` 369 370Next , create a wrapping message which will hold the repeated field containing the top 5 processes. 371 372```protobuf 373message TopProcesses { 374 repeated ProcessInfo process_info = 1; 375} 376``` 377 378Finally, define an extension to the root proto for all metrics (the [TraceMetrics](https://android.googlesource.com/platform/external/perfetto/+/HEAD/protos/perfetto/metrics/metrics.proto#39) proto). 379 380```protobuf 381extend TraceMetrics { 382 optional TopProcesses top_five_processes = 450; 383} 384``` 385 386Adding this extension field allows trace processor to link the newly defined 387metric to the `TraceMetrics` proto. 388 389_Notes:_ 390 391- The field ids 450-500 are reserved for local development so any of them can be used as the field id for the extension field. 392- The choice of field name here is important as the SQL file and the final table generated in SQL will be based on this name. 393 394Putting everything together, along with some boilerplate preamble gives: 395 396```protobuf 397syntax = "proto2"; 398 399package perfetto.protos; 400 401import "protos/perfetto/metrics/metrics.proto"; 402 403message ProcessInfo { 404 optional string process_name = 1; 405 optional int64 cpu_time_ms = 2; 406 optional uint32 num_threads = 3; 407} 408 409message TopProcesses { 410 repeated ProcessInfo process_info = 1; 411} 412 413extend TraceMetrics { 414 optional TopProcesses top_five_processes = 450; 415} 416``` 417 418### Step 2 419 420Next, write the SQL to generate the table of the top 5 processes ordered by the 421sum of the CPU time they ran for and the number of threads which were associated 422with the process. 423 424The following SQL should be added to a file called `top_five_processes.sql` in 425the workspace: 426 427```sql 428CREATE VIEW top_five_processes_by_cpu AS 429SELECT 430 process.name as process_name, 431 CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms, 432 COUNT(DISTINCT utid) as num_threads 433FROM sched 434INNER JOIN thread USING(utid) 435INNER JOIN process USING(upid) 436GROUP BY process.name 437ORDER BY cpu_time_ms DESC 438LIMIT 5; 439``` 440 441Let's break this query down: 442 4431. The first table used is the `sched` table. This contains all the scheduling 444 data available in the trace. Each scheduling "slice" is associated with a 445 thread which is uniquely identified in Perfetto traces using its `utid`. The 446 two pieces of information needed from the sched table are the `dur` - 447 short for duration, this is the amount of time the slice lasted - and the 448 `utid` which will be used to join with the thread table. 4492. The next table is the thread table. This gives us a lot of information which 450 is not particularly interesting (including its thread name) but it does give 451 us the `upid`. Similar to `utid`, `upid` is the unique identifier for a 452 process in a Perfetto trace. In this case, `upid` will refer to the process 453 which hosts the thread given by `utid`. 4543. The final table is the process table. This gives the name of the process 455 associated with the original sched slice. 4564. With the process, thread and duration for each sched slice, all the slices 457 for a single processes are collected and their durations summed to get the 458 CPU time (dividing by 1e6 as sched's duration is in nanoseconds) and 459 the number of distinct threads. 4605. Finally, we order by the cpu time and limit to the top 5 results. 461 462### Step 3 463 464Now that the result of the metric has been expressed as an SQL table, it needs 465to be converted to a proto. The metrics platform has built-in support for emitting 466protos using SQL functions; something which is used extensively in this step. 467 468Let's look at how it works for our table above. 469 470```sql 471CREATE VIEW top_five_processes_output AS 472SELECT TopProcesses( 473 'process_info', ( 474 SELECT RepeatedField( 475 ProcessInfo( 476 'process_name', process_name, 477 'cpu_time_ms', cpu_time_ms, 478 'num_threads', num_threads 479 ) 480 ) 481 FROM top_five_processes_by_cpu 482 ) 483); 484``` 485 486Breaking this down again: 487 4881. Starting from the inner-most SELECT statement, there is what looks like 489 a function call to the ProcessInfo function; in fact this is no coincidence. 490 For each proto that the metrics platform knows about, an SQL function is 491 generated with the same name as the proto. This function takes key value 492 pairs with the key as the name of the proto field to fill and the value being 493 the data to store in the field. The output is the proto created by writing 494 the fields described in the function. (\*) 495 496 In this case, this function is called once for each row in the 497 `top_five_processes_by_cpu` table. The output will be the fully filled 498 ProcessInfo proto. 499 500 The call to the `RepeatedField` function is the most interesting part and 501 also the most important. In technical terms, `RepeatedField` is an aggregate 502 function. Practically, this means that it takes a full table of values and 503 generates a single array which contains all the values passed to it. 504 505 Therefore, the output of this whole SELECT statement is an array of 5 506 ProcessInfo protos. 507 5082. Next is creation of the `TopProcesses` proto. By now, the syntax should 509 already feel somewhat familiar; the proto builder function is called to fill 510 in the `process_info` field with the array of protos from the inner function. 511 512 The output of this SELECT is a single `TopProcesses` proto containing the 513 ProcessInfos as a repeated field. 514 5153. Finally, the view is created. This view is specially named to allow the 516 metrics platform to query it to obtain the root proto for each metric 517 (in this case `TopProcesses`). See the note below as to the pattern behind 518 this view's name. 519 520(\*) _This is not strictly true. To type-check the protos, some metadata 521is returned about the type of the proto but this is unimportant for metric 522authors._ 523 524NOTE: It is important that the views be named {name of TraceMetrics extension 525 field}\_output. This is the pattern used and expected by the metrics 526 platform for all metrics. 527 528The final file should look like so: 529 530```sql 531CREATE VIEW top_five_processes_by_cpu AS 532SELECT 533 process.name as process_name, 534 CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms, 535 COUNT(DISTINCT utid) as num_threads 536FROM sched 537INNER JOIN thread USING(utid) 538INNER JOIN process USING(upid) 539GROUP BY process.name 540ORDER BY cpu_time_ms DESC 541LIMIT 5; 542 543CREATE VIEW top_five_processes_output AS 544SELECT TopProcesses( 545 'process_info', ( 546 SELECT RepeatedField( 547 ProcessInfo( 548 'process_name', process_name, 549 'cpu_time_ms', cpu_time_ms, 550 'num_threads', num_threads 551 ) 552 ) 553 FROM top_five_processes_by_cpu 554 ) 555); 556``` 557 558NOTE: The name of the SQL file should be the same as the name of TraceMetrics 559 extension field. This is to allow the metrics platform to associated the 560 proto extension field with the SQL which needs to be run to generate it. 561 562### Step 4 563 564For this step, invoke trace processor shell to run the metrics (see the 565[Quickstart](/docs/quickstart/trace-analysis.md) for downloading instructions): 566 567```shell 568$TRACE_PROCESSOR --run-metrics $WORKSPACE/top_five_processes.sql $TRACE 2> /dev/null 569``` 570 571(For an example trace to test this on, see the Notes section below.) 572 573By passing the SQL file for the metric to be computed, trace processor uses the name of this file to find the proto and to figure out the name of the output table for the proto and the name of the extension field for `TraceMetrics`; this is the reason it was important to choose the names of these other objects carefully. 574 575_Notes:_ 576 577- If something doesn't work as intended, check that the workspace looks the same as the contents of this [GitHub gist](https://gist.github.com/LalitMaganti/c221cf0cae17e298dfa82b118edf9080). 578- A good example trace for this metric is the Android example trace used by the Perfetto UI found [here](https://storage.googleapis.com/perfetto-misc/example_android_trace_30s_1). 579- stderr is redirected to remove any noise from parsing the trace that trace processor generates. 580 581If everything went successfully, the following output should be visible (specifically this is the output for the Android example trace linked above): 582 583``` 584[perfetto.protos.top_five_processes] { 585 process_info { 586 process_name: "com.google.android.GoogleCamera" 587 cpu_time_ms: 15154 588 num_threads: 125 589 } 590 process_info { 591 process_name: "sugov:4" 592 cpu_time_ms: 6846 593 num_threads: 1 594 } 595 process_info { 596 process_name: "system_server" 597 cpu_time_ms: 6809 598 num_threads: 66 599 } 600 process_info { 601 process_name: "cds_ol_rx_threa" 602 cpu_time_ms: 6684 603 num_threads: 1 604 } 605 process_info { 606 process_name: "com.android.chrome" 607 cpu_time_ms: 5125 608 num_threads: 49 609 } 610} 611``` 612 613## Next steps 614 615* The [common tasks](/docs/contributing/common-tasks.md) page gives a list of 616 steps on how new metrics can be added to the trace processor. 617 618## Appendix: Case for upstreaming 619 620NOTE: Googlers: for internal usage of metrics in Google3 (i.e. metrics which are 621confidential), please see [this internal page](https://goto.google.com/viecd). 622 623Authors are strongly encouraged to add all metrics derived on Perfetto traces to 624the Perfetto repo unless there is a clear usecase (e.g. confidentiality) why 625these metrics should not be publicly available. 626 627In return for upstreaming metrics, authors will have first class support for 628running metrics locally and the confidence that their metrics will remain stable 629as trace processor is developed. 630 631As well as scaling upwards while developing from running on a single trace 632locally to running on a large set of traces, the reverse is also very useful. 633When an anomaly is observed in the metrics of a lab benchmark, a representative 634trace can be downloaded and the same metric can be run locally in trace 635processor. 636 637Since the same code is running locally and remotely, developers can be confident 638in reproducing the issue and use the trace processor and/or the Perfetto UI to 639identify the problem. 640