1AutoFDO and ARM Trace {#AutoFDO} 2===================== 3 4@brief Using CoreSight trace and perf with OpenCSD for AutoFDO. 5 6## Introduction 7 8Feedback directed optimization (FDO, also know as profile guided 9optimization - PGO) uses a profile of a program's execution to guide the 10optmizations performed by the compiler. Traditionally, this involves 11building an instrumented version of the program, which records a profile of 12execution as it runs. The instrumentation adds significant runtime 13overhead, possibly changing the behaviour of the program and it may not be 14possible to run the instrumented program in a production environment 15(e.g. where performance criteria must be met). 16 17AutoFDO uses facilities in the hardware to sample the behaviour of the 18program in the production environment and generate the execution profile. 19An improved profile can be obtained by including the branch history 20(i.e. a record of the last branches taken) when generating an instruction 21samples. On Arm systems, the ETM can be used to generate such records. 22 23The process can be broken down into the following steps: 24 25* Record execution trace of the program 26* Convert the execution trace to instruction samples with branch histories 27* Convert the instruction samples to source level profiles 28* Use the source level profile with the compiler 29 30This article describes how to enable ETM trace on Arm targets running Linux 31and use the ETM trace to generate AutoFDO profiles and compile an optimized 32program. 33 34 35## Execution trace on Arm targets 36 37Debug and trace of Arm targets is provided by CoreSight. This consists of 38a set of components that allow access to debug logic, record (trace) the 39execution of a processor and route this data through the system, collecting 40it into a store. 41 42To record the execution of a processor, we require the following 43components: 44 45* A trace source. The core contains a trace unit, called an ETM that emits 46 data describing the instructions executed by the core. 47* Trace links. The trace data generated by the ETM must be moved through 48 the system to the component that collects the data (sink). Links 49 include: 50 * Funnels: merge multiple streams of data 51 * FIFOs: buffer data to smooth out bursts 52 * Replicators: send a stream of data to multiple components 53* Sinks. These receive the trace data and store it or send it to an 54 external device: 55 * ETB: A small circular buffer (64-128 kilobytes) that stores the most 56 recent data 57 * ETR: A larger (several megabytes) buffer that uses system RAM to 58 store data 59 * TPIU: Sends data to an off-chip capture device (e.g. Arm DSTREAM) 60 61Each Arm SoC design may have a different layout (topology) of components. 62This topology is described to the OS drivers by the platform's devicetree 63or (in future) ACPI firmware. 64 65For application profiling, we need to store several megabytes of data 66within the system, so will use ETR with the capture tool (perf) 67periodically draining the buffer to a file. 68 69Even though we have a large capture buffer, the ETM can still generate a 70lot of data very quickly - typically an ETM will generate ~1 bit of data 71per instruction (depending on the workload), which results in 256Mbytes per 72second for a core running at 2GHz. This leads to problems storing and 73decoding such large volumes of data. AutoFDO uses samples of program 74execution, so we can avoid this problem by using the ETM's features to 75only record small slices of execution - e.g. collect ~5000 cycles of data 76every 50M cycles. This reduces the data rate to a manageable level - a few 77megabytes per minute. This technique is known as 'strobing'. 78 79 80## Enabling trace 81 82### Driver support 83 84To collect ETM trace, the CoreSight drivers must be included in the 85kernel. Some of the driver support is not yet included in the mainline 86kernel and many targets are using older kernels. To enable CoreSight trace 87on these targets, Arm have provided backports of the latest CoreSight 88drivers and ETM strobing patch at: 89 90 <https://gitlab.arm.com/linux-arm/linux-coresight-backports> 91 92This repository can be cloned with: 93 94``` 95git clone https://git.gitlab.arm.com/linux-arm/linux-coresight-backports.git 96``` 97 98You can include these backports in your kernel by either merging the 99appropriate branch using git or generating patches (using `git 100format-patch`). 101 102For 5.x based kernel onwards, the only patch which needs to be applied is the one enabling strobing - etm4x: `Enable strobing of ETM`. 103 104For 4.9 based kernels, use the `coresight-4.9-etr-etm_strobe` branch: 105 106``` 107git merge coresight-4.9-etr-etm_strobe 108``` 109 110or 111 112``` 113git format-patch --output-directory /output/dir v4.9..coresight-4.9-etr-etm_strobe 114cd my_kernel 115git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git 116``` 117 118For 4.14 based kernels, use the `coresight-4.14-etm_strobe` branch: 119 120``` 121git merge coresight-4.14-etm_strobe 122``` 123 124or 125 126``` 127git format-patch --output-directory /output/dir v4.14..coresight-4.14-etm_strobe 128cd my_kernel 129git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git 130``` 131 132The CoreSight trace drivers must also be enabled in the kernel 133configuration. This can be done using the configuration menu (`make 134menuconfig`), selecting `Kernel hacking` / `arm64 Debugging` /`CoreSight Tracing Support` and 135enabling all options, or by setting the following in the configuration 136file: 137 138``` 139CONFIG_CORESIGHT=y 140CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y 141CONFIG_CORESIGHT_SINK_TPIU=y 142CONFIG_CORESIGHT_SOURCE_ETM4X=y 143CONFIG_CORESIGHT_DYNAMIC_REPLICATOR=y 144CONFIG_CORESIGHT_STM=y 145CONFIG_CORESIGHT_CATU=y 146``` 147 148Compile the kernel for your target in the usual way, e.g. 149 150``` 151make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- 152``` 153 154Each target may have a different layout of CoreSight components. To 155collect trace into a sink, the kernel drivers need to know which other 156devices need to be configured to route data from the source to the sink. 157This is described in the devicetree (and in future, the ACPI tables). The 158device tree will define which CoreSight devices are present in the system, 159where they are located and how they are connected together. The devicetree 160for some platforms includes a description of the platform's CoreSight 161components, but in other cases you may have to ask the platform/SoC vendor 162to supply it or create it yourself (see Appendix: Describing CoreSight in 163Devicetree). 164 165Once the target has been booted with the devicetree describing the 166CoreSight devices, you should find the devices in sysfs: 167 168``` 169# ls /sys/bus/coresight/devices/ 170etm0 etm2 etm4 etm6 funnel0 funnel2 funnel4 stm0 tmc_etr0 171etm1 etm3 etm5 etm7 funnel1 funnel3 replicator0 tmc_etf0 172``` 173 174The naming convention for etm devices can be different according to the kernel version you're using. 175For more information about the naming scheme, please check out the [Linux Kernel Documentation](https://www.kernel.org/doc/html/latest/trace/coresight/coresight.html#device-naming-scheme) 176 177If `/sys/bus/coresight/devices/` is empty, you may want to check out your Kernel configuration to make sure your .config file is including CoreSight dependencies, such as the clock. 178 179### Perf tools 180 181The perf tool is used to capture execution trace, configuring the trace 182sources to generate trace, routing the data to the sink and collecting the 183data from the sink. 184 185Arm recommends to use the perf version corresponding to the kernel running 186on the target. This can be built from the same kernel sources with 187 188``` 189make -C tools/perf CORESIGHT=1 VF=1 ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- 190``` 191 192When specifying CORESIGHT=1, perf will be built using the installed OpenCSD library. 193If you are cross compiling, then additional setup is required to ensure the build process links against the correct version of the library. 194 195If the post-processing (`perf inject`) of the captured data is not being 196done on the target, then the OpenCSD library is not required for this build 197of perf. 198 199Trace is captured by collecting the `cs_etm` event from perf. The sink 200to collect data into is specified as a parameter of this event. Trace can 201also be restricted to user space or kernel space with 'u' or 'k' 202parameters. For example: 203 204``` 205perf record -e cs_etm/@tmc_etr0/u --per-thread -- /bin/ls 206``` 207 208Will record the userspace execution of '/bin/ls' using tmc_etr0 as sink. 209 210## Capturing modes 211 212You can trace a single-threaded program in two different ways: 213 2141. By specifying `--per-thread`, and in this case the CoreSight subsystem will 215record only a trace relative to the given program. 216 2172. By NOT specifying `--per-thread`, and in this case CPU-wide tracing will 218be enabled. In this scenario the trace will contain both the target program trace 219and other workloads that were executing on the same CPU 220 221 222 223## Processing trace and profiles 224 225perf is also used to convert the execution trace an instruction profile. 226This requires a different build of perf, using the version of perf from 227Linux v4.17 or later, as the trace processing code isn't included in the 228driver backports. Trace decode is provided by the OpenCSD library 229(<https://github.com/Linaro/OpenCSD>), v0.9.1 or later. This is packaged 230for debian testing (install the libopencsd0, libopencsd-dev packages) or 231can be compiled from source and installed. 232 233The autoFDO tool <https://github.com/google/autofdo> is used to convert the 234instruction profiles to source profiles for the GCC and clang/llvm 235compilers. 236 237 238## Recording and profiling 239 240Once trace collection using perf is working, we can now use it to profile 241an application. 242 243The application must be compiled to include sufficient debug information to 244map instructions back to source lines. For GCC, use the `-g1` or `-gmlt` 245options. For clang/llvm, also add the `-fdebug-info-for-profiling` option. 246 247perf identifies the active program or library using the build identifier 248stored in the elf file. This should be added at link time with the compiler 249flag `-Wl,--build-id=sha1`. 250 251The next step is to record the execution trace of the application using the 252perf tool. The ETM strobing should be configured before running the perf 253tool. There are two parameters: 254 255 * window size: A number of CPU cycles (W) 256 * period: Trace is enabled for W cycle every _period_ * W cycles. 257 258For example, a typical configuration is to use a window size of 5000 cycles 259and a period of 10000 - this will collect 5000 cycles of trace every 50M 260cycles. With these proof-of-concept patches, the strobe parameters are 261configured via sysfs - each ETM will have `strobe_window` and 262`strobe_period` parameters in `/sys/bus/coresight/devices/<sink>` and 263these values will have to be written to each (In a future version, this 264will be integrated into the drivers and perf tool). 265The `set_strobing.sh` script in this directory [`<opencsd>/decoder/tests/auto-fdo`] automates this process. 266 267To collect trace from an application using ETM strobing, run: 268 269``` 270sudo ./set_strobing.sh 5000 10000 271perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>" 272``` 273 274The raw trace can be examined using the `perf report` command: 275 276``` 277perf report -D -i perf.data --stdio 278``` 279 280Perf needs to be built from your linux kernel version souce code repository against the OpenCSD library in order to be able to properly read ETM-gathered samples and post-process them. 281If running `perf report` produces an error like: 282 283``` 2840x1f8 [0x268]: failed to process type: 70 [Operation not permitted] 285Error: 286failed to process sample 287``` 288or 289 290``` 291"file uses a more recent and unsupported ABI (8 bytes extra). incompatible file format". 292``` 293 294You are probably using a perf version which is not using this library: please make sure to install this project in your system by either compiling it from [Source Code]( <https://github.com/Linaro/OpenCSD>) from v0.9.1 or later and compile perf using this library. 295Otherwise, this project is packaged for debian (install the libopencsd0, libopencsd-dev packages). 296 297 298For example: 299 300``` 3010x1d370 [0x30]: PERF_RECORD_AUXTRACE size: 0x2003c0 offset: 0 ref: 0x39ba881d145f8639 idx: 0 tid: 4551 cpu: -1 302 303. ... CoreSight ETM Trace data: size 2098112 bytes 304 Idx:0; ID:12; I_ASYNC : Alignment Synchronisation. 305 Idx:12; ID:12; I_TRACE_INFO : Trace Info.; INFO=0x0 306 Idx:17; ID:12; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C; 307 Idx:48; ID:14; I_ASYNC : Alignment Synchronisation. 308 Idx:60; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0 309 Idx:65; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C; 310 Idx:96; ID:14; I_ASYNC : Alignment Synchronisation. 311 Idx:108; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0 312 Idx:113; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C; 313 Idx:122; ID:14; I_TRACE_ON : Trace On. 314 Idx:123; ID:14; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000000000407B00; Ctxt: AArch64,EL0, NS; 315 Idx:134; ID:14; I_ATOM_F3 : Atom format 3.; ENN 316 Idx:135; ID:14; I_ATOM_F5 : Atom format 5.; NENEN 317 Idx:136; ID:14; I_ATOM_F5 : Atom format 5.; ENENE 318 Idx:137; ID:14; I_ATOM_F5 : Atom format 5.; NENEN 319 Idx:138; ID:14; I_ATOM_F3 : Atom format 3.; ENN 320 Idx:139; ID:14; I_ATOM_F3 : Atom format 3.; NNE 321 Idx:140; ID:14; I_ATOM_F1 : Atom format 1.; E 322..... 323``` 324 325The execution trace is then converted to an instruction profile using 326the perf build with trace decode support. This may be done on a different 327machine than that which collected the trace (e.g. when cross compiling for 328an embedded target). The `perf inject` command 329decodes the execution trace and generates periodic instruction samples, 330with branch histories: 331 332!! Careful: if you are using a device different than the one used to collect the profiling data, 333you'll need to run `perf buildid-cache` as described below. 334``` 335perf inject -i perf.data -o inj.data --itrace=i100000il 336``` 337 338The `--itrace` option configures the instruction sample behaviour: 339 340* `i100000i` generates an instruction sample every 100000 instructions 341 (only instruction count periods are currently supported, future versions 342 may support time or cycle count periods) 343* `l` includes the branch histories on each sample 344* `b` generates a sample on each branch (not used here) 345 346Perf requires the original program binaries to decode the execution trace. 347If running the `inject` command on a different system than the trace was 348captured on, then the binary and any shared libraries must be added to 349perf's cache with: 350 351``` 352perf buildid-cache -a /path/to/binary_or_library 353``` 354 355`perf report` can also be used to show the instruction samples: 356 357``` 358perf report -D -i inj.data --stdio 359....... 3600x1528 [0x630]: PERF_RECORD_SAMPLE(IP, 0x2): 4551/4551: 0x434b98 period: 3093 addr: 0 361... branch stack: nr:64 362..... 0: 0000000000434b58 -> 0000000000434b68 0 cycles P 0 363..... 1: 0000000000436a88 -> 0000000000434b4c 0 cycles P 0 364..... 2: 0000000000436a64 -> 0000000000436a78 0 cycles P 0 365..... 3: 00000000004369d0 -> 0000000000436a60 0 cycles P 0 366..... 4: 000000000043693c -> 00000000004369cc 0 cycles P 0 367..... 5: 00000000004368a8 -> 0000000000436928 0 cycles P 0 368..... 6: 000000000042d070 -> 00000000004368a8 0 cycles P 0 369..... 7: 000000000042d108 -> 000000000042d070 0 cycles P 0 370....... 371..... 57: 0000000000448ee0 -> 0000000000448f24 0 cycles P 0 372..... 58: 0000000000448ea4 -> 0000000000448ebc 0 cycles P 0 373..... 59: 0000000000448e20 -> 0000000000448e94 0 cycles P 0 374..... 60: 0000000000448da8 -> 0000000000448ddc 0 cycles P 0 375..... 61: 00000000004486f4 -> 0000000000448da8 0 cycles P 0 376..... 62: 00000000004480fc -> 00000000004486d4 0 cycles P 0 377..... 63: 0000000000448658 -> 00000000004480ec 0 cycles P 0 378 ... thread: program1:4551 379 ...... dso: /home/root/program1 380....... 381``` 382 383The instruction samples produced by `perf inject` is then passed to the 384autofdo tool to generate source level profiles for the compiler. For 385clang/LLVM: 386 387``` 388create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof 389``` 390 391And for GCC: 392 393``` 394create_gcov -binary=/path/to/binary -profile=inj.data -gcov_version=1 -gcov=program.gcov 395``` 396 397The profiles can be viewed with: 398 399``` 400llvm-profdata show -sample program.llvmprof 401``` 402 403Or, for GCC: 404 405``` 406dump_gcov -gcov_version=1 program.gcov 407``` 408 409## Using profile in the compiler 410 411The profile produced by the above steps can then be passed to the compiler 412to optimize the next build of the program. 413 414For GCC, use the `-fauto-profile` option: 415 416``` 417gcc -O2 -fauto-profile=program.gcov -o program program.c 418``` 419 420For Clang, use the `-fprofile-sample-use` option: 421 422``` 423clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c 424``` 425 426 427## Summary 428 429The basic commands to run an application and create a compiler profile are: 430 431``` 432sudo ./set_strobing.sh 5000 10000 433perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>" 434perf inject -i perf.data -o inj.data --itrace=i100000il 435create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof 436clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c 437``` 438 439Use `create_gcov` for gcc. 440 441## High Level Summary for recoding on Arm board and decoding on different host 442 4431. (on Arm board) 444 445 sudo ./set_strobing.sh 5000 10000 446 perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>. 447 If you specify `-N, --no-buildid-cache`, perf will just take care of recording the target binary and nothing will be copied.<br> If you don't specify it, any recorded dynamic library will be copied to ~/.debug in the board. 448 4492. (on Arm board) `perf archive` which saves all the found libraries in a tar (internally, it looks into perf.data file and performs a lookup using perf-buildid-list --with-hits) 4503. (on host) `scp` to copy perf.data and the .tar file generated from `perf archive`. 4514. (on host) Run `tar xvf perf_data.tar.bz2 -C ~/.debug` to populate the buildid-cache 4525. (on host) Double check the setup is correct: 453 454 a. `perf buildid-list -i perf.data` gives you the list of dynamic libraries buildids whose trace has been recorded and saved in perf.data. 455 b. `perf buildid-cache --list` lists the dynamic libraries in the buildid cache that will be used by `perf inject`. 456 Make sure the output of (a) and (b) overlaps as in buildid value for those binaries you are interested into optimizing with afdo. 457 4586. (on host) `perf inject -i perf.data -o inj.data --itrace=i100000il` will check for the dynamic libraries using the buildid inside the buildid-cache and post-process the trace.<br> buildids have to be the same, otherwise it won't be possible to post-process the trace. 459 4607. (on host) `create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof` takes the output from perf-inject and tranforms it into a format that the compiler can read. 4618. (on host) `clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c` to make clang use the produced profile.<br> 462 If you are confident enough that your profile is accurate, you can add the `-fprofile-sample-accurate` flag, which will penalize all the callsites without corresponding profile, marking them as cold. 463 464If you are using the same host for both building the binary to be traced and re-building it with afdo: 465 4661. You won't need to copy back any dynamic libraries from the board (since you already have them), and can use `--no-buildid-cache` when recording 4672. You have to make sure the relevant dynamic libraries to be optimized are present in the buildid-cache. 468 469You can easily add a dynamic library manually into the build-id cache by running: 470 471`perf buildid-cache --add <path/to/library/or/binary> -vvv` 472 473You can easily check what is currently contained in you buildid-cache by running: 474 475`perf buildid-cache --list` 476 477You can check the buildid of a given binary/dynamic library: 478 479`file <path/to/dynamic/library>` 480 481## References 482 483* AutoFDO tool: <https://github.com/google/autofdo> 484* GCC's wiki on autofdo: <https://gcc.gnu.org/wiki/AutoFDO>, <https://gcc.gnu.org/wiki/AutoFDO/Tutorial> 485* Google paper: <https://ai.google/research/pubs/pub45290> 486* CoreSight kernel docs: Documentation/trace/coresight.txt 487 488 489## Appendix: Describing CoreSight in Devicetree 490 491 492Each component has an entry in the device tree that describes its: 493 494* type: The `compatible` field defines which driver to use 495* location: A `reg` defines the component's address and size on the bus 496* clocks: The `clocks` and `clock-names` fields state which clock provides 497 the `apb_pclk` clock. 498* connections to other components: `port` and `ports` field link the 499 component to ports of other components 500 501To create the device tree, some information about the platform is required: 502 503* The memory address of the CoreSight components. This is the address in 504 the CPU's address space where the CPU can access each CoreSight 505 component. 506* The connections between the components. 507 508This information can be found in the SoC's reference manual or you may need 509to ask the platform/SoC vendor to supply it. 510 511An ETMv4 source is declared with a section like this: 512 513``` 514 etm0: etm@22040000 { 515 compatible = "arm,coresight-etm4x", "arm,primecell"; 516 reg = <0 0x22040000 0 0x1000>; 517 518 cpu = <&A72_0>; 519 clocks = <&soc_smc50mhz>; 520 clock-names = "apb_pclk"; 521 port { 522 cluster0_etm0_out_port: endpoint { 523 remote-endpoint = <&cluster0_funnel_in_port0>; 524 }; 525 }; 526 }; 527``` 528 529This describes an ETMv4 attached to core A72_0, located at 0x22040000, with 530its output linked to port 0 of a funnel. The funnel is described with: 531 532``` 533 funnel@220c0000 { /* cluster0 funnel */ 534 compatible = "arm,coresight-funnel", "arm,primecell"; 535 reg = <0 0x220c0000 0 0x1000>; 536 537 clocks = <&soc_smc50mhz>; 538 clock-names = "apb_pclk"; 539 power-domains = <&scpi_devpd 0>; 540 ports { 541 #address-cells = <1>; 542 #size-cells = <0>; 543 544 port@0 { 545 reg = <0>; 546 cluster0_funnel_out_port: endpoint { 547 remote-endpoint = <&main_funnel_in_port0>; 548 }; 549 }; 550 551 port@1 { 552 reg = <0>; 553 cluster0_funnel_in_port0: endpoint { 554 slave-mode; 555 remote-endpoint = <&cluster0_etm0_out_port>; 556 }; 557 }; 558 559 port@2 { 560 reg = <1>; 561 cluster0_funnel_in_port1: endpoint { 562 slave-mode; 563 remote-endpoint = <&cluster0_etm1_out_port>; 564 }; 565 }; 566 }; 567 }; 568``` 569 570This describes a funnel located at 0x220c0000, receiving data from 2 ETMs 571and sending the merged data to another funnel. We continue describing 572components with similar blocks until we reach the sink (an ETR): 573 574``` 575 etr@20070000 { 576 compatible = "arm,coresight-tmc", "arm,primecell"; 577 reg = <0 0x20070000 0 0x1000>; 578 iommus = <&smmu_etr 0>; 579 580 clocks = <&soc_smc50mhz>; 581 clock-names = "apb_pclk"; 582 power-domains = <&scpi_devpd 0>; 583 port { 584 etr_in_port: endpoint { 585 slave-mode; 586 remote-endpoint = <&replicator_out_port1>; 587 }; 588 }; 589 }; 590``` 591 592Full descriptions of the properties of each component can be found in the 593Linux source at Documentation/devicetree/bindings/arm/coresight.txt. 594The Arm Juno platform's devicetree (arch/arm64/boot/dts/arm) provides an example 595description of CoreSight description. 596 597Many systems include a TPIU for off-chip trace. While this isn't required 598for self-hosted trace, it should still be included in the devicetree. This 599allows the drivers to access it to ensure it is put into a disabled state, 600otherwise it may limit the trace bandwidth causing data loss. 601