1# Collect LBR (x86 Architectures) data for AutoFDO 2 3# Table of Contents 4 5- Introduction 6- AFDO Compiler Optimizations 7 - Sampling Profiler 8 - Execution Profiles 9 - Limitations of Code Coverage 10 - Generating Sampling Profiles 11 - AFDO Flow Diagram 12- Intel's Performance Monitoring Unit (PMU) 13- Examples 14 - A complete example: autofdo_inline_test.cpp 15- Related docs 16 17## Introduction 18 19The following user guide provides an overview of AFDO compiler 20optimizations, details on Intel Performance Monitoring Units (PMU), and 21instructions for collecting Last Branch Record (LBR) related profiles on 22x86 platforms. 23 24## AFDO Compiler Optimization 25 26**AutoFDO compiler** optimization refer to a set of advanced techniques 27employed by compilers to enhance the performance of software 28applications. These optimizations are based on insights gained from 29hardware performance metrics, specifically focusing on events such as 30`br_inst_retired.neartaken` and `cpu_cycles`. 31 32### Sampling Profiler 33 34A sampling profiler can generate a performance profile with very low 35runtime overhead. This profile is crucial for optimization purposes but 36is not suitable for code coverage analysis. The profiler collects data 37by periodically sampling the program's execution, which provides a 38statistical representation of where time is being spent in the code. 39 40### Execution Profiles 41 42Compilers utilize **execution profiles** that consist of basic block and 43edge frequency counts. These profiles guide various optimizations, 44including: 45 46- **Instruction Scheduling**: Reordering instructions to minimize 47 delays and improve pipeline efficiency. 48- **Basic Block Re-ordering**: Rearranging basic blocks to enhance 49 cache performance and reduce branch mispredictions. 50- **Function Splitting**: Dividing functions into smaller parts to 51 improve inlining and reduce code size. 52- **Register Allocation**: Efficiently assigning variables to CPU 53 registers to minimize memory access. 54 55These optimizations aim to improve execution speed, reduce resource 56consumption, and enhance the overall efficiency of applications on 57specific hardware configurations. 58 59### Limitations for Code Coverage 60 61While it is technically possible to use sampling profiles for code 62coverage, they are generally too coarse-grained for this purpose. 63Sampling profiles provide a statistical view rather than a precise 64execution trace, leading to poor results in code coverage analysis. 65 66### Generating Sampling Profiles 67 68Sampling profiles must be generated by an external tool like simpleperf 69in the below case. Once generated, the profile needs to be converted 70into a format that can be read by LLVM using create_llvm_prof tool 71 72### AFDO Flow Diagram 73 74 75 76## Intel's Performance Monitoring Unit (PMU) 77 78Intel's Performance Monitoring Unit (PMU) is a hardware feature built 79into their processors to measure various performance parameters. These 80parameters include instruction cycles, cache hits, cache misses, branch 81misses, and more. The PMU helps in understanding how effectively code 82uses hardware resources and provides insights for optimization.The Last 83Branch Record (LBR) is indeed a part of Intel's Performance Monitoring 84Unit (PMU). The PMU includes various performance monitoring features, 85and LBR is one of them. 86 87The Last Branch Record (LBR) is an advanced CPU feature designed to 88meticulously log the source and destination addresses of recently 89executed branch instructions. This capability serves as a vital tool for 90performance monitoring and debugging, allowing developers to track the 91intricate control flow of their programs. By analyzing the data captured 92through LBR, we can gain valuable insights into how applications 93navigate through their execution paths and pinpoint the areas where the 94program spends most of its time-often referred to as "hot paths." 95 96**Branch Statistics**: One of the remarkable applications of LBR is its 97ability to gather comprehensive branch statistics in C++ programs. This 98data can be pivotal in understanding the behavior of conditional 99decisions in the code. 100 101**Virtual Calls**: LBR proves particularly useful for analyzing the 102outcomes of indirect branches and virtual calls, key components in 103object-oriented programming that can significantly influence 104performance. 105 106LBR entries are rich with information, typically consisting of `FROM_IP` 107and `TO_IP`, which denote the source and destination addresses of the 108branching instructions. This detailed logging offers a clear view of the 109program's execution flow. 110 111**Model Specific Registers (MSRs)**: The configuration of LBR relies on 112Model Specific Registers (MSRs) specific to Intel CPUs. These registers 113play a crucial role in enabling and managing LBR functionalities. 114**IA32_DEBUGCTL**: To initiate LBR recording, one must set bit 0 of the 115IA32_DEBUGCTL register to 1, effectively activating this powerful 116feature. **MSR_LASTBRANCH_x\_FROM_IP**: This particular register is 117responsible for storing the originating addresses of the most recent 118branch instructions, preserving a trail of execution paths. 119**MSR_LASTBRANCH_x\_TO_IP**: Conversely, this register captures the 120destination addresses of those most recent branches, creating a 121comprehensive map of transitions within the program. 122 123**Clearing LBRs**: A noteworthy aspect of LBR is that it gets cleared 124when the CPU enters certain low-power sleep states deeper than C2. To 125maintain the integrity of the recorded data, it may be necessary to keep 126the CPU in an awake state. 127 128**Stopping LBR**: Ceasing LBR recording can present challenges and might 129require invoking performance monitoring interrupts (PMIs), introducing 130additional complexity to the management of this feature. 131 132**Advantages**: **Overhead**: One of the standout benefits of LBR is its 133minimal overhead; it provides nearly zero performance degradation 134compared to traditional software-based branch recording methods, making 135it an efficient choice in performance-sensitive applications. 136**Accuracy**: Although manual code instrumentation might yield slightly 137better precision in certain scenarios, this advantage comes at the 138significant cost of increased runtime performance overhead, making LBR a 139more appealing alternative in many cases. **Scenarios**: The utility of 140LBR shines particularly in situations where the source code is not 141readily accessible or when the software builds process remains shrouded 142in mystery. In such cases, LBR becomes an invaluable ally in uncovering 143insights into program behavior, allowing developers and analysts to make 144informed decisions based on the recorded execution paths. 145 146Simpleperf supports collecting LBR data and converting it to input files 147for AutoFDO, which can then be used for Feedback Directed Optimization 148during compilation. 149 150## Examples 151 152Below are examples collecting LBR data for AutoFDO. It has two steps: 153first recording LBR data,second converting LBR data to AutoFDO input 154files. 155 156Record LBR data: 157 158``` sh 159# preparation: we need to be root the device to record LBR data 160# for initial setup 161$ adb root 162$ adb remount 163# device will ask for reboot for changes to be applied 164# once initial setup is done,next time onwards the below steps only should be used 165$ adb root 166$ adb shell 167brya:/ \# cd data/local/tmp 168brya:/data/local/tmp \# 169 170# Do a system wide collection, it writes output to perf.data. 171# If only want LBR data for kernel, use `-e BR_INST_RETIRED.NEAR_TAKEN:k`. 172# If only want LBR data for userspace, use `-e BR_INST_RETIRED.NEAR_TAKEN:u`. 173# If want LBR data for system wide collection, use `-e BR_INST_RETIRED.NEAR_TAKEN -a`. 174 175brya:/data/local/tmp \# simpleperf record -b -p <processid> -e BR_INST_RETIRED.NEAR_TAKEN:u -c 10003 176 177# if you have a standalone binary the below command needs to be used 178 179brya:/data/local/tmp \# simpleperf record -b -e BR_INST_RETIRED.NEAR_TAKEN:u -c 10003 ./<binaryname> 180 181simpleperf record: 182The simpleperf record command is used to profile processes and store the profiling data in a file (usually perf.data). 183 184-b: 185This option enables branch recording. It uses the Last Branch Record (LBR) feature of the CPU to capture the 186most recent branches taken by the processor. This is useful for understanding the control flow of a program. 187 188-a: 189This option tells perf to record system-wide. It collects performance data from all CPUs, not just the one 190where the command is run. This is useful for capturing a comprehensive view of system performance. 191 192-e: 193This option specifies the event (BR_INST_RETIRED.NEAR_TAKEN in this case) to record. 194 195-c: 196This option is used to specify the event count threshold for sampling. 197 198 199# To reduce file size and time converting to AutoFDO input files, we recommend converting LBR data into an intermediate branch-list format. 200 201brya:/data/local/tmp \# simpleperf inject -i perf.data --output branch-list -o branch_list.data 202``` 203 204Converting LBR data to AutoFDO input files needs to read binaries. So 205for userspace and kernel libraries, it needs to be converted on host, 206with vmlinux and kernel modules available. 207 2081) Convert LBR data for userspace libraries: 209 210``` sh 211# Injecting LBR data on device. It writes output to perf_inject.data. 212# perf_inject.data is a text file, containing branch counts for each library. 213# Host simpleperf is in <aosp-top>/aosp/out/host/linux-x86/bin/simpleperf, 214# or you can build simpleperf by `make simpleperf_ndk`. 215 216host $ adb pull /data/local/tmp/branch_list.data 217 218host $ simpleperf inject -i branch_list.data --binary <binaryorlibraryname> --symdir <aosp-top>/aosp/out/target/product/generic_x86_64/symbols/system/ -o perf_inject.data 219``` 220 2212) Convert LBR data for Userspace & kernel: 222 223``` sh 224# pull LBR data to host. 225 226host $ adb pull /data/local/tmp/branch_list.data 227 228# download vmlinux and kernel modules to <binary_dir> 229# host simpleperf is in <aosp-top>/aosp/out/host/linux-x86/bin/simpleperf, 230# or you can build simpleperf by `make simpleperf_ndk`. 231 232host $ simpleperf inject -i branch_list.data --binary <userspacebinaryorlibrary> --symdir <symboldir> -o perf_inject.data 233``` 234 235The generated perf_inject.data may contain branch info for multiple 236binaries. But AutoFDO only accepts one at a time. So we need to split 237perf_inject.data. The format of perf_inject.data is below: 238 239\`\`\`perf_inject.data format 240 241executed range with count info for binary1 branch with count info for 242binary1 // name for binary1 243 244executed range with count info for binary2 branch with count info for 245binary2 // name for binary2 246 247... 248 249 We need to split perf_inject.data, and make sure one file only contains info for one binary. 250 251 Then we can use [AutoFDO](https://github.com/google/autofdo) to create profile. Follow README.md 252 in AutoFDO to build create_llvm_prof, then use `create_llvm_prof` to create profiles for clang. 253 254 ```sh 255 # perf_inject_binary1.data is split from perf_inject.data, and only contains branch info for binary1. 256 host $ create_llvm_prof -profile perf_inject_binary1.data -profiler text -binary path_of_binary1 -out a.afdo -format extbinary 257 258 # perf_inject_kernel.data is split from perf_inject.data, and only contains branch info for [kernel.kallsyms]. 259 host $ create_llvm_prof -profile perf_inject_kernel.data -profiler text -binary vmlinux -out a.afdo -format extbinary 260 261Then we can use a.prof for AFDO during compilation, via 262`-fprofile-sample-use=a.afdo`. 263[Here](https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers) 264are more details. 265 266### A complete example: autofdo_inline_test.cpp 267 268`autofdo_inline_test.cpp` is an example to show the complete 269process. The source code is in 270[autofdo_inline_test.cpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/autofdo_inline_test.cpp). 271The build script is in 272[Android.bp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/Android.bp). 273It builds an executable called `autofdo_inline_test`, which runs on device 274(Referred here as brya). 275 276**Step 1: Build `autofdo_inline_test` binary** 277 278``` sh 279(host) <AOSP>$ source build/envsetup.sh 280(host) <AOSP>$ lunch aosp_x86_64-trunk_staging-userdebug 281(host) <AOSP>$ make autofdo_inline_test 282``` 283 284**Step 2: Run `autofdo_inline_test.cpp` on brya, and collect LBR 285data for its running** 286 287``` sh 288(host) <AOSP>$ adb push out/target/product/generic_x86_64/system/bin/autofdo_inline_test /data/local/tmp 289(host) <AOSP>$ adb root 290(host) <AOSP>$ adb shell 291(brya) / $ cd /data/local/tmp 292(brya) /data/local/tmp $ chmod a+x autofdo_inline_test 293(brya) /data/local/tmp $ simpleperf record -b -p <processidofautofdobinary> -e BR_INST_RETIRED.NEAR_TAKEN:u 294simpleperf I cmd_record.cpp:840] Recorded for 4.0012 seconds. Start post processing. 295simpleperf I cmd_record.cpp:941] Samples recorded: 7. Samples lost: 0. 296(brya) /data/local/tmp $ simpleperf inject --output branch-list -o branch_list.data 297(brya) /data/local/tmp $ simpleperf inject -i branch_list.data 298(brya) /data/local/tmp $ exit 299(host) <AOSP>$ adb pull /data/local/tmp/perf_inject.data 300``` 301 302**Step 3: Convert LBR data to AutoFDO profile** 303 304``` sh 305# Build simpleperf tool on host. 306(host) <AOSP>$ make simpleperf_ndk 307(host) <AOSP>$ cat perf_inject.data 3082 3094160-418d:8 310419d-41bb:9 3113 3124170:1 313419d:1 31441a7:1 3154 3164159->4187:3 3174185->41d2:1 318418d->419d:9 31941bb->4160:11 320// build_id: 0x1631385c6a846e19fd38cec137041c2200000000 321// /data/local/tmp/latest/autofdo_inline_test 322 323(host) <AOSP>$ create_llvm_prof --binary <AOSP>/out/target/product/generic_x86_64/system/bin/autofdo_inline_test --format extbinary --out autofdo_inline_test.afdo --profile perf_inject.data --profiler text 324 325(host) <AOSP>$ ls -lh autofdo_inline_test.afdo 326-rw-rw-rw- 1 root root 1.0K 2025-03-11 10:18 autofdo_inline_test.afdo 327``` 328 329**Step 4: Use AutoFDO profile to build optimized binary** 330 331``` sh 332(host) <AOSP>$ cp autofdo_inline_test.afdo toolchain/pgo-profiles/sampling/ 333(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/Android.bp 334# Edit Android.bp to add a fdo_profile module: 335# 336# fdo_profile { 337# name: "autofdo_inline_test", 338# profile: "autofdo_inline_test.afdo" 339# } 340(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/afdo_profiles.mk 341# Edit afdo_profiles.mk to add autofdo_inline_test profile mapping: 342# 343# AFDO_PROFILES += keystore2://toolchain/pgo-profiles/sampling:keystore2 \ 344# ... 345# server_configurable_flags://toolchain/pgo-profiles/sampling:server_configurable_flags \ 346# autofdo_inline_test://toolchain/pgo-profiles/sampling:autofdo_inline_test 347# 348 349(host) <AOSP>$ make autofdo_inline_test 350``` 351 352We can check if `autofdo_inline_test.afdo` is used when building autofdo_inline_test binary. 353 354``` sh 355(host) <AOSP>$ gzip -d out/verbose.log.gz 356(host) <AOSP>$ cat out/verbose.log | grep autofdo_inline_test` 357 ... -fprofile-sample-use=toolchain/pgo-profiles/sampling/autofdo_inline_test.afdo ... 358``` 359 360If comparing the disassembly of 361`out/target/product/generic_x86_64/system/bin/autofdo_inline_test` before and after 362optimizing with AutoFDO data, we can see different preferences in 363inlining, branching & basic block re-ordering. In addition we can also 364monitor Intel(R) PMU Branch Monitoring events using simpleperf. Refer below events 365comparison data. 366 367|Intel(R) PerfMon EventName|Without AFDO|With AFDO|% Delta| 368|-|-|-|-| 369|BR_INST_RETIRED.ALL_BRANCHES|25289601|25680449|2%| 370|BR_MISP_RETIRED.ALL_BRANCHES|2,693,141|2,376,465|-12%| 371|BR_MISP_RETIRED.COND|2,477,232|2,133,468|-14%| 372|BR_MISP_RETIRED.COND_TAKEN|2,136,117|1,897,894|-11%| 373|BR_MISP_RETIRED.INDIRECT|238,063|200,008|-16%| 374|BR_MISP_RETIRED.INDIRECT_CALL|205,970|179,661|-13%| 375|BR_MISP_RETIRED.RET|76,709|72,147|-6%| 376|BACLEARS.ANY|6,217,138|5,761,070|-7%| 377 378|Standard Events|Without AFDO|With AFDO|% Delta| 379|-|-|-|-| 380|cpu-cycles |780,810,870|743,257,553|-5%| 381|context-switches|7,463|6,659|-11%| 382|task-clock (ms)|187128.967|174391.7821|-7%| 383 384## Related docs 385 386- [Last Branch Record 387 Stack](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) 388- [Performance monitoring events supported by Intel Performance 389 Monitoring Units (PMUs)](https://perfmon-events.intel.com/) 390- [AutoFDO tool for converting profile 391 data](https://github.com/google/autofdo) 392