• Home
Name Date Size #Lines LOC

..--

BUILDD03-May-20241.5 KiB6758

CMakeLists.txtD03-May-20241.2 KiB7365

README.mdD03-May-20245.8 KiB150115

instrumentation.ccD03-May-20243.5 KiB133100

instrumentation.hD03-May-20246.5 KiB204102

profiler.ccD03-May-20242.9 KiB11081

profiler.hD03-May-20243.2 KiB10749

test.ccD03-May-20245.7 KiB168121

test_instrumented_library.ccD03-May-20241.8 KiB6038

test_instrumented_library.hD03-May-2024905 245

treeview.ccD03-May-20248 KiB253201

treeview.hD03-May-20245 KiB13151

README.md

1# A minimalistic profiler sampling pseudo-stacks
2
3## Overview
4
5The present directory is the "ruy profiler". As a time profiler, it allows to
6measure where code is spending time.
7
8Contrary to most typical profilers, what it samples is not real call stacks, but
9"pseudo-stacks" which are just simple data structures constructed from within
10the program being profiled. Using this profiler requires manually instrumenting
11code to construct such pseudo-stack information.
12
13Another unusual characteristic of this profiler is that it uses only the C++11
14standard library. It does not use any non-portable feature, in particular it
15does not rely on signal handlers. The sampling is performed by a thread, the
16"profiler thread".
17
18A discussion of pros/cons of this approach is appended below.
19
20## How to use this profiler
21
22### How to instrument code
23
24An example of instrumented code is given in `test_instrumented_library.cc`.
25
26Code is instrumented by constructing `ScopeLabel` objects. These are RAII
27helpers, ensuring that the thread pseudo-stack contains the label during their
28lifetime. In the most common use case, one would construct such an object at the
29start of a function, so that its scope is the function scope and it allows to
30measure how much time is spent in this function.
31
32```c++
33#include "ruy/profiler/instrumentation.h"
34
35...
36
37void SomeFunction() {
38  ruy::profiler::ScopeLabel function_label("SomeFunction");
39  ... do something ...
40}
41```
42
43A `ScopeLabel` may however have any scope, for instance:
44
45```c++
46if (some_case) {
47  ruy::profiler::ScopeLabel extra_work_label("Some more work");
48  ... do some more work ...
49}
50```
51
52The string passed to the `ScopeLabel` constructor must be just a pointer to a
53literal string (a `char*` pointer). The profiler will assume that these pointers
54stay valid until the profile is finalized.
55
56However, that literal string may be a `printf` format string, and labels may
57have up to 4 parameters, of type `int`. For example:
58
59```c++
60void SomeFunction(int size) {
61  ruy::profiler::ScopeLabel function_label("SomeFunction (size=%d)", size);
62
63```
64
65### How to run the profiler
66
67Profiling instrumentation is a no-op unless the preprocessor token
68`RUY_PROFILER` is defined, so defining it is the first step when actually
69profiling. When building with Bazel, the preferred way to enable that is to pass
70this flag on the Bazel command line:
71
72```
73--define=ruy_profiler=true
74```
75
76To actually profile a code scope, it is enough to construct a `ScopeProfile`
77object, also a RAII helper. It will start the profiler on construction, and on
78destruction it will terminate the profiler and report the profile treeview on
79standard output by default. Example:
80
81```c++
82void SomeProfiledBenchmark() {
83  ruy::profiler::ScopeProfile profile;
84
85  CallSomeInstrumentedCode();
86}
87```
88
89An example is provided by the `:test` target in the present directory. Run it
90with `--define=ruy_profiler=true` as explained above:
91
92```
93bazel run -c opt \
94   --define=ruy_profiler=true \
95  //tensorflow/lite/experimental/ruy/profiler:test
96```
97
98The default behavior dumping the treeview on standard output may be overridden
99by passing a pointer to a `TreeView` object to the `ScopeProfile` constructor.
100This causes the tree-view to be stored in that `TreeView` object, where it may
101be accessed an manipulated using the functions declared in `treeview.h`. The
102aforementioned `:test` provides examples for doing so.
103
104## Advantages and inconvenients
105
106Compared to a traditional profiler, e.g. Linux's "perf", the present kind of
107profiler has the following inconvenients:
108
109*   Requires manual instrumentation of code being profiled.
110*   Substantial overhead, modifying the performance characteristics of the code
111    being measured.
112*   Questionable accuracy.
113
114But also the following advantages:
115
116*   Profiling can be driven from within a benchmark program, allowing the entire
117    profiling procedure to be a single command line.
118*   Not relying on symbol information removes removes exposure to toolchain
119    details and means less hassle in some build environments, especially
120    embedded/mobile (single command line to run and profile, no symbols files
121    required).
122*   Fully portable (all of this is standard C++11).
123*   Fully testable (see `:test`). Profiling becomes just another feature of the
124    code like any other.
125*   Customized instrumentation can result in easier to read treeviews (only
126    relevant functions, and custom labels may be more readable than function
127    names).
128*   Parametrized/formatted labels allow to do things that aren't possible with
129    call-stack-sampling profilers. For example, break down a profile where much
130    time is being spent in matrix multiplications, by the various matrix
131    multiplication shapes involved.
132
133The philosophy underlying this profiler is that software performance depends on
134software engineers profiling often, and a key factor limiting that in practice
135is the difficulty or cumbersome aspects of profiling with more serious profilers
136such as Linux's "perf", especially in embedded/mobile development: multiple
137command lines are involved to copy symbol files to devices, retrieve profile
138data from the device, etc. In that context, it is useful to make profiling as
139easy as benchmarking, even on embedded targets, even if the price to pay for
140that is lower accuracy, higher overhead, and some intrusive instrumentation
141requirement.
142
143Another key aspect determining what profiling approach is suitable for a given
144context, is whether one already has a-priori knowledge of where much of the time
145is likely being spent. When one has such a-priori knowledge, it is feasible to
146instrument the known possibly-critical code as per the present approach. On the
147other hand, in situations where one doesn't have such a-priori knowledge, a real
148profiler such as Linux's "perf" allows to right away get a profile of real
149stacks, from just symbol information generated by the toolchain.
150