• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Collect LBR (x86 Architectures) data for AutoFDO
2
3# Table of Contents
4
5-   Introduction
6-   AFDO Compiler Optimizations
7    -   Sampling Profiler
8    -   Execution Profiles
9    -   Limitations of Code Coverage
10    -   Generating Sampling Profiles
11    -   AFDO Flow Diagram
12-   Intel's Performance Monitoring Unit (PMU)
13-   Examples
14    -   A complete example: autofdo_inline_test.cpp
15-   Related docs
16
17## Introduction
18
19The following user guide provides an overview of AFDO compiler
20optimizations, details on Intel Performance Monitoring Units (PMU), and
21instructions for collecting Last Branch Record (LBR) related profiles on
22x86 platforms.
23
24## AFDO Compiler Optimization
25
26**AutoFDO compiler** optimization refer to a set of advanced techniques
27employed by compilers to enhance the performance of software
28applications. These optimizations are based on insights gained from
29hardware performance metrics, specifically focusing on events such as
30`br_inst_retired.neartaken` and `cpu_cycles`.
31
32### Sampling Profiler
33
34A sampling profiler can generate a performance profile with very low
35runtime overhead. This profile is crucial for optimization purposes but
36is not suitable for code coverage analysis. The profiler collects data
37by periodically sampling the program's execution, which provides a
38statistical representation of where time is being spent in the code.
39
40### Execution Profiles
41
42Compilers utilize **execution profiles** that consist of basic block and
43edge frequency counts. These profiles guide various optimizations,
44including:
45
46-   **Instruction Scheduling**: Reordering instructions to minimize
47    delays and improve pipeline efficiency.
48-   **Basic Block Re-ordering**: Rearranging basic blocks to enhance
49    cache performance and reduce branch mispredictions.
50-   **Function Splitting**: Dividing functions into smaller parts to
51    improve inlining and reduce code size.
52-   **Register Allocation**: Efficiently assigning variables to CPU
53    registers to minimize memory access.
54
55These optimizations aim to improve execution speed, reduce resource
56consumption, and enhance the overall efficiency of applications on
57specific hardware configurations.
58
59### Limitations for Code Coverage
60
61While it is technically possible to use sampling profiles for code
62coverage, they are generally too coarse-grained for this purpose.
63Sampling profiles provide a statistical view rather than a precise
64execution trace, leading to poor results in code coverage analysis.
65
66### Generating Sampling Profiles
67
68Sampling profiles must be generated by an external tool like simpleperf
69in the below case. Once generated, the profile needs to be converted
70into a format that can be read by LLVM using create_llvm_prof tool
71
72### AFDO Flow Diagram
73
74![AFDOFlow Image](./AFDOFlow.png)
75
76## Intel's Performance Monitoring Unit (PMU)
77
78Intel's Performance Monitoring Unit (PMU) is a hardware feature built
79into their processors to measure various performance parameters. These
80parameters include instruction cycles, cache hits, cache misses, branch
81misses, and more. The PMU helps in understanding how effectively code
82uses hardware resources and provides insights for optimization.The Last
83Branch Record (LBR) is indeed a part of Intel's Performance Monitoring
84Unit (PMU). The PMU includes various performance monitoring features,
85and LBR is one of them.
86
87The Last Branch Record (LBR) is an advanced CPU feature designed to
88meticulously log the source and destination addresses of recently
89executed branch instructions. This capability serves as a vital tool for
90performance monitoring and debugging, allowing developers to track the
91intricate control flow of their programs. By analyzing the data captured
92through LBR, we can gain valuable insights into how applications
93navigate through their execution paths and pinpoint the areas where the
94program spends most of its time-often referred to as "hot paths."
95
96**Branch Statistics**: One of the remarkable applications of LBR is its
97ability to gather comprehensive branch statistics in C++ programs. This
98data can be pivotal in understanding the behavior of conditional
99decisions in the code.
100
101**Virtual Calls**: LBR proves particularly useful for analyzing the
102outcomes of indirect branches and virtual calls, key components in
103object-oriented programming that can significantly influence
104performance.
105
106LBR entries are rich with information, typically consisting of `FROM_IP`
107and `TO_IP`, which denote the source and destination addresses of the
108branching instructions. This detailed logging offers a clear view of the
109program's execution flow.
110
111**Model Specific Registers (MSRs)**: The configuration of LBR relies on
112Model Specific Registers (MSRs) specific to Intel CPUs. These registers
113play a crucial role in enabling and managing LBR functionalities.
114**IA32_DEBUGCTL**: To initiate LBR recording, one must set bit 0 of the
115IA32_DEBUGCTL register to 1, effectively activating this powerful
116feature. **MSR_LASTBRANCH_x\_FROM_IP**: This particular register is
117responsible for storing the originating addresses of the most recent
118branch instructions, preserving a trail of execution paths.
119**MSR_LASTBRANCH_x\_TO_IP**: Conversely, this register captures the
120destination addresses of those most recent branches, creating a
121comprehensive map of transitions within the program.
122
123**Clearing LBRs**: A noteworthy aspect of LBR is that it gets cleared
124when the CPU enters certain low-power sleep states deeper than C2. To
125maintain the integrity of the recorded data, it may be necessary to keep
126the CPU in an awake state.
127
128**Stopping LBR**: Ceasing LBR recording can present challenges and might
129require invoking performance monitoring interrupts (PMIs), introducing
130additional complexity to the management of this feature.
131
132**Advantages**: **Overhead**: One of the standout benefits of LBR is its
133minimal overhead; it provides nearly zero performance degradation
134compared to traditional software-based branch recording methods, making
135it an efficient choice in performance-sensitive applications.
136**Accuracy**: Although manual code instrumentation might yield slightly
137better precision in certain scenarios, this advantage comes at the
138significant cost of increased runtime performance overhead, making LBR a
139more appealing alternative in many cases. **Scenarios**: The utility of
140LBR shines particularly in situations where the source code is not
141readily accessible or when the software builds process remains shrouded
142in mystery. In such cases, LBR becomes an invaluable ally in uncovering
143insights into program behavior, allowing developers and analysts to make
144informed decisions based on the recorded execution paths.
145
146Simpleperf supports collecting LBR data and converting it to input files
147for AutoFDO, which can then be used for Feedback Directed Optimization
148during compilation.
149
150## Examples
151
152Below are examples collecting LBR data for AutoFDO. It has two steps:
153first recording LBR data,second converting LBR data to AutoFDO input
154files.
155
156Record LBR data:
157
158``` sh
159# preparation: we need to be root the device to record LBR data
160# for initial setup
161$ adb root
162$ adb remount
163# device will ask for reboot for changes to be applied
164# once initial setup is done,next time onwards the below steps only should be used
165$ adb root
166$ adb shell
167brya:/ \# cd data/local/tmp
168brya:/data/local/tmp \#
169
170# Do a system wide collection, it writes output to perf.data.
171# If only want LBR data for kernel, use `-e BR_INST_RETIRED.NEAR_TAKEN:k`.
172# If only want LBR data for userspace, use `-e BR_INST_RETIRED.NEAR_TAKEN:u`.
173# If want LBR data for system wide collection, use `-e BR_INST_RETIRED.NEAR_TAKEN -a`.
174
175brya:/data/local/tmp \# simpleperf record -b -p <processid> -e BR_INST_RETIRED.NEAR_TAKEN:u -c 10003
176
177# if you have a standalone binary the below command needs to be used
178
179brya:/data/local/tmp \# simpleperf record -b -e BR_INST_RETIRED.NEAR_TAKEN:u -c 10003 ./<binaryname>
180
181simpleperf record:
182The simpleperf record command is used to profile processes and store the profiling data in a file (usually perf.data).
183
184-b:
185This option enables branch recording. It uses the Last Branch Record (LBR) feature of the CPU to capture the
186most recent branches taken by the processor. This is useful for understanding the control flow of a program.
187
188-a:
189This option tells perf to record system-wide. It collects performance data from all CPUs, not just the one
190where the command is run. This is useful for capturing a comprehensive view of system performance.
191
192-e:
193This option specifies the event (BR_INST_RETIRED.NEAR_TAKEN in this case) to record.
194
195-c:
196This option is used to specify the event count threshold for sampling.
197
198
199# To reduce file size and time converting to AutoFDO input files, we recommend converting LBR data into an intermediate branch-list format.
200
201brya:/data/local/tmp \# simpleperf inject -i perf.data --output branch-list -o branch_list.data
202```
203
204Converting LBR data to AutoFDO input files needs to read binaries. So
205for userspace and kernel libraries, it needs to be converted on host,
206with vmlinux and kernel modules available.
207
2081) Convert LBR data for userspace libraries:
209
210``` sh
211# Injecting LBR data on device. It writes output to perf_inject.data.
212# perf_inject.data is a text file, containing branch counts for each library.
213# Host simpleperf is in <aosp-top>/aosp/out/host/linux-x86/bin/simpleperf,
214# or you can build simpleperf by `make simpleperf_ndk`.
215
216host $ adb pull /data/local/tmp/branch_list.data
217
218host $ simpleperf inject -i branch_list.data --binary <binaryorlibraryname> --symdir <aosp-top>/aosp/out/target/product/generic_x86_64/symbols/system/ -o perf_inject.data
219```
220
2212) Convert LBR data for Userspace & kernel:
222
223``` sh
224# pull LBR data to host.
225
226host $ adb pull /data/local/tmp/branch_list.data
227
228# download vmlinux and kernel modules to <binary_dir>
229# host simpleperf is in <aosp-top>/aosp/out/host/linux-x86/bin/simpleperf,
230# or you can build simpleperf by `make simpleperf_ndk`.
231
232host $ simpleperf inject -i branch_list.data --binary <userspacebinaryorlibrary> --symdir <symboldir> -o perf_inject.data
233```
234
235The generated perf_inject.data may contain branch info for multiple
236binaries. But AutoFDO only accepts one at a time. So we need to split
237perf_inject.data. The format of perf_inject.data is below:
238
239\`\`\`perf_inject.data format
240
241executed range with count info for binary1 branch with count info for
242binary1 // name for binary1
243
244executed range with count info for binary2 branch with count info for
245binary2 // name for binary2
246
247...
248
249    We need to split perf_inject.data, and make sure one file only contains info for one binary.
250
251    Then we can use [AutoFDO](https://github.com/google/autofdo) to create profile. Follow README.md
252    in AutoFDO to build create_llvm_prof, then use `create_llvm_prof` to create profiles for clang.
253
254    ```sh
255    # perf_inject_binary1.data is split from perf_inject.data, and only contains branch info for binary1.
256    host $ create_llvm_prof -profile perf_inject_binary1.data -profiler text -binary path_of_binary1 -out a.afdo -format extbinary
257
258    # perf_inject_kernel.data is split from perf_inject.data, and only contains branch info for [kernel.kallsyms].
259    host $ create_llvm_prof -profile perf_inject_kernel.data -profiler text -binary vmlinux -out a.afdo -format extbinary
260
261Then we can use a.prof for AFDO during compilation, via
262`-fprofile-sample-use=a.afdo`.
263[Here](https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers)
264are more details.
265
266### A complete example: autofdo_inline_test.cpp
267
268`autofdo_inline_test.cpp` is an example to show the complete
269process. The source code is in
270[autofdo_inline_test.cpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/autofdo_inline_test.cpp).
271The build script is in
272[Android.bp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/Android.bp).
273It builds an executable called `autofdo_inline_test`, which runs on device
274(Referred here as brya).
275
276**Step 1: Build `autofdo_inline_test` binary**
277
278``` sh
279(host) <AOSP>$ source build/envsetup.sh
280(host) <AOSP>$ lunch aosp_x86_64-trunk_staging-userdebug
281(host) <AOSP>$ make autofdo_inline_test
282```
283
284**Step 2: Run `autofdo_inline_test.cpp` on brya, and collect LBR
285data for its running**
286
287``` sh
288(host) <AOSP>$ adb push out/target/product/generic_x86_64/system/bin/autofdo_inline_test /data/local/tmp
289(host) <AOSP>$ adb root
290(host) <AOSP>$ adb shell
291(brya) / $ cd /data/local/tmp
292(brya) /data/local/tmp $ chmod a+x autofdo_inline_test
293(brya) /data/local/tmp $ simpleperf record -b -p <processidofautofdobinary> -e BR_INST_RETIRED.NEAR_TAKEN:u
294simpleperf I cmd_record.cpp:840] Recorded for 4.0012 seconds. Start post processing.
295simpleperf I cmd_record.cpp:941] Samples recorded: 7. Samples lost: 0.
296(brya) /data/local/tmp $ simpleperf inject --output branch-list -o branch_list.data
297(brya) /data/local/tmp $ simpleperf inject -i branch_list.data
298(brya) /data/local/tmp $ exit
299(host) <AOSP>$ adb pull /data/local/tmp/perf_inject.data
300```
301
302**Step 3: Convert LBR data to AutoFDO profile**
303
304``` sh
305# Build simpleperf tool on host.
306(host) <AOSP>$ make simpleperf_ndk
307(host) <AOSP>$ cat perf_inject.data
3082
3094160-418d:8
310419d-41bb:9
3113
3124170:1
313419d:1
31441a7:1
3154
3164159->4187:3
3174185->41d2:1
318418d->419d:9
31941bb->4160:11
320// build_id: 0x1631385c6a846e19fd38cec137041c2200000000
321// /data/local/tmp/latest/autofdo_inline_test
322
323(host) <AOSP>$ create_llvm_prof --binary <AOSP>/out/target/product/generic_x86_64/system/bin/autofdo_inline_test  --format extbinary --out autofdo_inline_test.afdo --profile perf_inject.data --profiler text
324
325(host) <AOSP>$ ls -lh autofdo_inline_test.afdo
326-rw-rw-rw- 1 root root 1.0K 2025-03-11 10:18 autofdo_inline_test.afdo
327```
328
329**Step 4: Use AutoFDO profile to build optimized binary**
330
331``` sh
332(host) <AOSP>$ cp autofdo_inline_test.afdo toolchain/pgo-profiles/sampling/
333(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/Android.bp
334# Edit Android.bp to add a fdo_profile module:
335#
336# fdo_profile {
337#    name: "autofdo_inline_test",
338#    profile: "autofdo_inline_test.afdo"
339# }
340(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/afdo_profiles.mk
341# Edit afdo_profiles.mk to add autofdo_inline_test profile mapping:
342#
343# AFDO_PROFILES += keystore2://toolchain/pgo-profiles/sampling:keystore2 \
344#  ...
345#  server_configurable_flags://toolchain/pgo-profiles/sampling:server_configurable_flags \
346#  autofdo_inline_test://toolchain/pgo-profiles/sampling:autofdo_inline_test
347#
348
349(host) <AOSP>$ make autofdo_inline_test
350```
351
352We can check if `autofdo_inline_test.afdo` is used when building autofdo_inline_test binary.
353
354``` sh
355(host) <AOSP>$ gzip -d out/verbose.log.gz
356(host) <AOSP>$ cat out/verbose.log | grep autofdo_inline_test`
357   ... -fprofile-sample-use=toolchain/pgo-profiles/sampling/autofdo_inline_test.afdo ...
358```
359
360If comparing the disassembly of
361`out/target/product/generic_x86_64/system/bin/autofdo_inline_test` before and after
362optimizing with AutoFDO data, we can see different preferences in
363inlining, branching & basic block re-ordering. In addition we can also
364monitor Intel(R) PMU Branch Monitoring events using simpleperf. Refer below events
365comparison data.
366
367|Intel(R) PerfMon EventName|Without AFDO|With AFDO|% Delta|
368|-|-|-|-|
369|BR_INST_RETIRED.ALL_BRANCHES|25289601|25680449|2%|
370|BR_MISP_RETIRED.ALL_BRANCHES|2,693,141|2,376,465|-12%|
371|BR_MISP_RETIRED.COND|2,477,232|2,133,468|-14%|
372|BR_MISP_RETIRED.COND_TAKEN|2,136,117|1,897,894|-11%|
373|BR_MISP_RETIRED.INDIRECT|238,063|200,008|-16%|
374|BR_MISP_RETIRED.INDIRECT_CALL|205,970|179,661|-13%|
375|BR_MISP_RETIRED.RET|76,709|72,147|-6%|
376|BACLEARS.ANY|6,217,138|5,761,070|-7%|
377
378|Standard Events|Without AFDO|With AFDO|% Delta|
379|-|-|-|-|
380|cpu-cycles      |780,810,870|743,257,553|-5%|
381|context-switches|7,463|6,659|-11%|
382|task-clock      (ms)|187128.967|174391.7821|-7%|
383
384## Related docs
385
386-   [Last Branch Record
387    Stack](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)
388-   [Performance monitoring events supported by Intel Performance
389    Monitoring Units (PMUs)](https://perfmon-events.intel.com/)
390-   [AutoFDO tool for converting profile
391    data](https://github.com/google/autofdo)
392