1<?xml version="1.0" encoding='ISO-8859-1'?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> 3 4<book id="oprofile-internals"> 5<bookinfo> 6 <title>OProfile Internals</title> 7 8 <authorgroup> 9 <author> 10 <firstname>John</firstname> 11 <surname>Levon</surname> 12 <affiliation> 13 <address><email>levon@movementarian.org</email></address> 14 </affiliation> 15 </author> 16 </authorgroup> 17 18 <copyright> 19 <year>2003</year> 20 <holder>John Levon</holder> 21 </copyright> 22</bookinfo> 23 24<toc></toc> 25 26<chapter id="introduction"> 27<title>Introduction</title> 28 29<para> 30This document is current for OProfile version <oprofileversion />. 31This document provides some details on the internal workings of OProfile for the 32interested hacker. This document assumes strong C, working C++, plus some knowledge of 33kernel internals and CPU hardware. 34</para> 35<note> 36<para> 37Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 38uses a very different kernel module implementation and daemon to produce the sample files. 39</para> 40</note> 41 42<sect1 id="overview"> 43<title>Overview</title> 44<para> 45OProfile is a statistical continuous profiler. In other words, profiles are generated by 46regularly sampling the current registers on each CPU (from an interrupt handler, the 47saved PC value at the time of interrupt is stored), and converting that runtime PC 48value into something meaningful to the programmer. 49</para> 50<para> 51OProfile achieves this by taking the stream of sampled PC values, along with the detail 52of which task was running at the time of the interrupt, and converting into a file offset 53against a particular binary file. Because applications <function>mmap()</function> 54the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename> 55or whatever), it's possible to find the relevant binary file and offset by walking 56the task's list of mapped memory areas. Each PC value is thus converted into a tuple 57of binary-image,offset. This is something that the userspace tools can use directly 58to reconstruct where the code came from, including the particular assembly instructions, 59symbol, and source line (via the binary's debug information if present). 60</para> 61<para> 62Regularly sampling the PC value like this approximates what actually was executed and 63how often - more often than not, this statistical approximation is good enough to 64reflect reality. In common operation, the time between each sample interrupt is regulated 65by a fixed number of clock cycles. This implies that the results will reflect where 66the CPU is spending the most time; this is obviously a very useful information source 67for performance analysis. 68</para> 69<para> 70Sometimes though, an application programmer needs different kinds of information: for example, 71"which of the source routines cause the most cache misses ?". The rise in importance of 72such metrics in recent years has led many CPU manufacturers to provide hardware performance 73counters capable of measuring these events on the hardware level. Typically, these counters 74increment once per each event, and generate an interrupt on reaching some pre-defined 75number of events. OProfile can use these interrupts to generate samples: then, the 76profile results are a statistical approximation of which code caused how many of the 77given event. 78</para> 79<para> 80Consider a simplified system that only executes two functions A and B. A 81takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at 82100 cycles a second, and we've set the performance counter to create an 83interrupt after a set number of "events" (in this case an event is one 84clock cycle). It should be clear that the chances of the interrupt 85occurring in function A is 1/100, and 99/100 for function B. Thus, we 86statistically approximate the actual relative performance features of 87the two functions over time. This same analysis works for other types of 88events, providing that the interrupt is tied to the number of events 89occurring (that is, after N events, an interrupt is generated). 90</para> 91<para> 92There are typically more than one of these counters, so it's possible to set up profiling 93for several different event types. Using these counters gives us a powerful, low-overhead 94way of gaining performance metrics. If OProfile, or the CPU, does not support performance 95counters, then a simpler method is used: the kernel timer interrupt feeds samples 96into OProfile itself. 97</para> 98<para> 99The rest of this document concerns itself with how we get from receiving samples at 100interrupt time to producing user-readable profile information. 101</para> 102</sect1> 103 104<sect1 id="components"> 105<title>Components of the OProfile system</title> 106 107<sect2 id="arch-specific-components"> 108<title>Architecture-specific components</title> 109<para> 110If OProfile supports the hardware performance counters found on 111a particular architecture, code for managing the details of setting 112up and managing these counters can be found in the kernel source 113tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename> 114directory. The architecture-specific implementation works via 115filling in the oprofile_operations structure at init time. This 116provides a set of operations such as <function>setup()</function>, 117<function>start()</function>, <function>stop()</function>, etc. 118that manage the hardware-specific details of fiddling with the 119performance counter registers. 120</para> 121<para> 122The other important facility available to the architecture code is 123<function>oprofile_add_sample()</function>. This is where a particular sample 124taken at interrupt time is fed into the generic OProfile driver code. 125</para> 126</sect2> 127 128<sect2 id="filesystem"> 129<title>oprofilefs</title> 130<para> 131OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from 132userspace at <filename>/dev/oprofile</filename>. This consists of small 133files for reporting and receiving configuration from userspace, as well 134as the actual character device that the OProfile userspace receives samples 135from. At <function>setup()</function> time, the architecture-specific may 136add further configuration files related to the details of the performance 137counters. For example, on x86, one numbered directory for each hardware 138performance counter is added, with files in each for the event type, 139reset value, etc. 140</para> 141<para> 142The filesystem also contains a <filename>stats</filename> directory with 143a number of useful counters for various OProfile events. 144</para> 145</sect2> 146 147<sect2 id="driver"> 148<title>Generic kernel driver</title> 149<para> 150This lives in <filename>drivers/oprofile/</filename>, and forms the core of 151how OProfile works in the kernel. Its job is to take samples delivered 152from the architecture-specific code (via <function>oprofile_add_sample()</function>), 153and buffer this data, in a transformed form as described later, until releasing 154the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename> 155character device. 156</para> 157</sect2> 158 159<sect2 id="daemon"> 160<title>The OProfile daemon</title> 161<para> 162The OProfile userspace daemon's job is to take the raw data provided by the 163kernel and write it to the disk. It takes the single data stream from the 164kernel and logs sample data against a number of sample files (found in 165<filename>$SESSION_DIR/samples/current/</filename>, by default located at 166<filename>/var/lib/oprofile/samples/current/</filename>. For the benefit 167of the "separate" functionality, the names/paths of these sample files 168are mangled to reflect where the samples were from: this can include 169thread IDs, the binary file path, the event type used, and more. 170</para> 171<para> 172After this final step from interrupt to disk file, the data is now 173persistent (that is, changes in the running of the system do not invalidate 174stored data). So the post-profiling tools can run on this data at any 175time (assuming the original binary files are still available and unchanged, 176naturally). 177</para> 178</sect2> 179 180<sect2 id="post-profiling"> 181<title>Post-profiling tools</title> 182So far, we've collected data, but we've yet to present it in a useful form 183to the user. This is the job of the post-profiling tools. In general form, 184they collate a subset of the available sample files, load and process each one 185correlated against the relevant binary file, and finally produce user-readable 186information. 187</sect2> 188 189</sect1> 190 191</chapter> 192 193<chapter id="performance-counters"> 194<title>Performance counter management</title> 195 196<sect1 id ="performance-counters-ui"> 197<title>Providing a user interface</title> 198 199<para> 200The performance counter registers need programming in order to set the 201type of event to count, etc. OProfile uses a standard model across all 202CPUs for defining these events as follows : 203</para> 204<informaltable frame="all"> 205<tgroup cols='2'> 206<tbody> 207<row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row> 208<row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row> 209<row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row> 210<row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row> 211<row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row> 212<row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row> 213</tbody> 214</tgroup> 215</informaltable> 216<para> 217The term "unit mask" is borrowed from the Intel architectures, and can 218further specify exactly when a counter is incremented (for example, 219cache-related events can be restricted to particular state transitions 220of the cache lines). 221</para> 222<para> 223All of the available hardware events and their details are specified in 224the textual files in the <filename>events</filename> directory. The 225syntax of these files should be fairly obvious. The user specifies the 226names and configuration details of the chosen counters via 227<command>opcontrol</command>. These are then written to the kernel 228module (in numerical form) via <filename>/dev/oprofile/N/</filename> 229where N is the physical hardware counter (some events can only be used 230on specific counters; OProfile hides these details from the user when 231possible). On IA64, the perfmon-based interface behaves somewhat 232differently, as described later. 233</para> 234 235</sect1> 236 237<sect1 id="performance-counters-programming"> 238<title>Programming the performance counter registers</title> 239 240<para> 241We have described how the user interface fills in the desired 242configuration of the counters and transmits the information to the 243kernel. It is the job of the <function>->setup()</function> method 244to actually program the performance counter registers. Clearly, the 245details of how this is done is architecture-specific; it is also 246model-specific on many architectures. For example, i386 provides methods 247for each model type that programs the counter registers correctly 248(see the <filename>op_model_*</filename> files in 249<filename>arch/i386/oprofile</filename> for the details). The method 250reads the values stored in the virtual oprofilefs files and programs 251the registers appropriately, ready for starting the actual profiling 252session. 253</para> 254<para> 255The architecture-specific drivers make sure to save the old register 256settings before doing OProfile setup. They are restored when OProfile 257shuts down. This is useful, for example, on i386, where the NMI watchdog 258uses the same performance counter registers as OProfile; they cannot 259run concurrently, but OProfile makes sure to restore the setup it found 260before it was running. 261</para> 262<para> 263In addition to programming the counter registers themselves, other setup 264is often necessary. For example, on i386, the local APIC needs 265programming in order to make the counter's overflow interrupt appear as 266an NMI (non-maskable interrupt). This allows sampling (and therefore 267profiling) of regions where "normal" interrupts are masked, enabling 268more reliable profiles. 269</para> 270 271<sect2 id="performance-counters-start"> 272<title>Starting and stopping the counters</title> 273<para> 274Initiating a profiling session is done via writing an ASCII '1' 275to the file <filename>/dev/oprofile/enable</filename>. This sets up the 276core, and calls into the architecture-specific driver to actually 277enable each configured counter. Again, the details of how this is 278done is model-specific (for example, the Athlon models can disable 279or enable on a per-counter basis, unlike the PPro models). 280</para> 281</sect2> 282 283<sect2> 284<title>IA64 and perfmon</title> 285<para> 286The IA64 architecture provides a different interface from the other 287architectures, using the existing perfmon driver. Register programming 288is handled entirely in user-space (see 289<filename>daemon/opd_perfmon.c</filename> for the details). A process 290is forked for each CPU, which creates a perfmon context and sets the 291counter registers appropriately via the 292<function>sys_perfmonctl</function> interface. In addition, the actual 293initiation and termination of the profiling session is handled via the 294same interface using <constant>PFM_START</constant> and 295<constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs 296files for the performance counters, as the kernel driver does not 297program the registers itself. 298</para> 299<para> 300Instead, the perfmon driver for OProfile simply registers with the 301OProfile core with an OProfile-specific UUID. During a profiling 302session, the perfmon core calls into the OProfile perfmon driver and 303samples are registered with the OProfile core itself as usual (with 304<function>oprofile_add_sample()</function>). 305</para> 306</sect2> 307 308</sect1> 309 310</chapter> 311 312<chapter id="collecting-samples"> 313<title>Collecting and processing samples</title> 314 315<sect1 id="receiving-interrupts"> 316<title>Receiving interrupts</title> 317<para> 318Naturally, how the overflow interrupts are received is specific 319to the hardware architecture, unless we are in "timer" mode, where the 320logging routine is called directly from the standard kernel timer 321interrupt handler. 322</para> 323<para> 324On the i386 architecture, the local APIC is programmed such that when a 325counter overflows (that is, it receives an event that causes an integer 326overflow of the register value to zero), an NMI is generated. This calls 327into the general handler <function>do_nmi()</function>; because OProfile 328has registered itself as capable of handling NMI interrupts, this will 329call into the OProfile driver code in 330<filename>arch/i386/oprofile</filename>. Here, the saved PC value (the 331CPU saves the register set at the time of interrupt on the stack 332available for inspection) is extracted, and the counters are examined to 333find out which one generated the interrupt. Also determined is whether 334the system was inside kernel or user space at the time of the interrupt. 335These three pieces of information are then forwarded onto the OProfile 336core via <function>oprofile_add_sample()</function>. Finally, the 337counter values are reset to the chosen count value, to ensure another 338interrupt happens after another N events have occurred. Other 339architectures behave in a similar manner. 340</para> 341</sect1> 342 343<sect1 id="core-structure"> 344<title>Core data structures</title> 345<para> 346Before considering what happens when we log a sample, we shall digress 347for a moment and look at the general structure of the data collection 348system. 349</para> 350<para> 351OProfile maintains a small buffer for storing the logged samples for 352each CPU on the system. Only this buffer is altered when we actually log 353a sample (remember, we may still be in an NMI context, so no locking is 354possible). The buffer is managed by a two-handed system; the "head" 355iterator dictates where the next sample data should be placed in the 356buffer. Of course, overflow of the buffer is possible, in which case 357the sample is discarded. 358</para> 359<para> 360It is critical to remember that at this point, the PC value is an 361absolute value, and is therefore only meaningful in the context of which 362task it was logged against. Thus, these per-CPU buffers also maintain 363details of which task each logged sample is for, as described in the 364next section. In addition, we store whether the sample was in kernel 365space or user space (on some architectures and configurations, the address 366space is not sub-divided neatly at a specific PC value, so we must store 367this information). 368</para> 369<para> 370As well as these small per-CPU buffers, we have a considerably larger 371single buffer. This holds the data that is eventually copied out into 372the OProfile daemon. On certain system events, the per-CPU buffers are 373processed and entered (in mutated form) into the main buffer, known in 374the source as the "event buffer". The "tail" iterator indicates the 375point from which the CPU may be read, up to the position of the "head" 376iterator. This provides an entirely lock-free method for extracting data 377from the CPU buffers. This process is described in detail later in this chapter. 378</para> 379<figure><title>The OProfile buffers</title> 380<graphic fileref="buffers.png" /> 381</figure> 382</sect1> 383 384<sect1 id="logging-sample"> 385<title>Logging a sample</title> 386<para> 387As mentioned, the sample is logged into the buffer specific to the 388current CPU. The CPU buffer is a simple array of pairs of unsigned long 389values; for a sample, they hold the PC value and the counter for the 390sample. (The counter value is later used to translate back into the relevant 391event type the counter was programmed to). 392</para> 393<para> 394In addition to logging the sample itself, we also log task switches. 395This is simply done by storing the address of the last task to log a 396sample on that CPU in a data structure, and writing a task switch entry 397into the buffer if the new value of <function>current()</function> has 398changed. Note that later we will directly de-reference this pointer; 399this imposes certain restrictions on when and how the CPU buffers need 400to be processed. 401</para> 402<para> 403Finally, as mentioned, we log whether we have changed between kernel and 404userspace using a similar method. Both of these variables 405(<varname>last_task</varname> and <varname>last_is_kernel</varname>) are 406reset when the CPU buffer is read. 407</para> 408</sect1> 409 410<sect1 id="logging-stack"> 411<title>Logging stack traces</title> 412<para> 413OProfile can also provide statistical samples of call chains (on x86). To 414do this, at sample time, the frame pointer chain is traversed, recording 415the return address for each stack frame. This will only work if the code 416was compiled with frame pointers, but we're careful to abort the 417traversal if the frame pointer appears bad. We store the set of return 418addresses straight into the CPU buffer. Note that, since this traversal 419is keyed off the standard sample interrupt, the number of times a 420function appears in a stack trace is not an indicator of how many times 421the call site was executed: rather, it's related to the number of 422samples we took where that call site was involved. Thus, the results for 423stack traces are not necessarily proportional to the call counts: 424typical programs will have many <function>main()</function> samples. 425</para> 426</sect1> 427 428<sect1 id="synchronising-buffers"> 429<title>Synchronising the CPU buffers to the event buffer</title> 430<!-- FIXME: update when percpu patch goes in --> 431<para> 432At some point, we have to process the data in each CPU buffer and enter 433it into the main (event) buffer. The file 434<filename>buffer_sync.c</filename> contains the relevant code. We 435periodically (currently every <constant>HZ</constant>/4 jiffies) start 436the synchronisation process. In addition, we process the buffers on 437certain events, such as an application calling 438<function>munmap()</function>. This is particularly important for 439<function>exit()</function> - because the CPU buffers contain pointers 440to the task structure, if we don't process all the buffers before the 441task is actually destroyed and the task structure freed, then we could 442end up trying to dereference a bogus pointer in one of the CPU buffers. 443</para> 444<para> 445We also add a notification when a kernel module is loaded; this is so 446that user-space can re-read <filename>/proc/modules</filename> to 447determine the load addresses of kernel module text sections. Without 448this notification, samples for a newly-loaded module could get lost or 449be attributed to the wrong module. 450</para> 451<para> 452The synchronisation itself works in the following manner: first, mutual 453exclusion on the event buffer is taken. Remember, we do not need to do 454that for each CPU buffer, as we only read from the tail iterator (whilst 455interrupts might be arriving at the same buffer, but they will write to 456the position of the head iterator, leaving previously written entries 457intact). Then, we process each CPU buffer in turn. A CPU switch 458notification is added to the buffer first (for 459<option>--separate=cpu</option> support). Then the processing of the 460actual data starts. 461</para> 462<para> 463As mentioned, the CPU buffer consists of task switch entries and the 464actual samples. When the routine <function>sync_buffer()</function> sees 465a task switch, the process ID and process group ID are recorded into the 466event buffer, along with a dcookie (see below) identifying the 467application binary (e.g. <filename>/bin/bash</filename>). The 468<varname>mmap_sem</varname> for the task is then taken, to allow safe 469iteration across the tasks' list of mapped areas. Each sample is then 470processed as described in the next section. 471</para> 472<para> 473After a buffer has been read, the tail iterator is updated to reflect 474how much of the buffer was processed. Note that when we determined how 475much data there was to read in the CPU buffer, we also called 476<function>cpu_buffer_reset()</function> to reset 477<varname>last_task</varname> and <varname>last_is_kernel</varname>, as 478we've already mentioned. During the processing, more samples may have 479been arriving in the CPU buffer; this is OK because we are careful to 480only update the tail iterator to how much we actually read - on the next 481buffer synchronisation, we will start again from that point. 482</para> 483</sect1> 484 485<sect1 id="dentry-cookies"> 486<title>Identifying binary images</title> 487<para> 488In order to produce useful profiles, we need to be able to associate a 489particular PC value sample with an actual ELF binary on the disk. This 490leaves us with the problem of how to export this information to 491user-space. We create unique IDs that identify a particular directory 492entry (dentry), and write those IDs into the event buffer. Later on, 493the user-space daemon can call the <function>lookup_dcookie</function> 494system call, which looks up the ID and fills in the full path of 495the binary image in the buffer user-space passes in. These IDs are 496maintained by the code in <filename>fs/dcookies.c</filename>; the 497cache lasts for as long as the daemon has the event buffer open. 498</para> 499</sect1> 500 501<sect1 id="finding-dentry"> 502<title>Finding a sample's binary image and offset</title> 503<para> 504We haven't yet described how we process the absolute PC value into 505something usable by the user-space daemon. When we find a sample entered 506into the CPU buffer, we traverse the list of mappings for the task 507(remember, we will have seen a task switch earlier, so we know which 508task's lists to look at). When a mapping is found that contains the PC 509value, we look up the mapped file's dentry in the dcookie cache. This 510gives the dcookie ID that will uniquely identify the mapped file. Then 511we alter the absolute value such that it is an offset from the start of 512the file being mapped (the mapping need not start at the start of the 513actual file, so we have to consider the offset value of the mapping). We 514store this dcookie ID into the event buffer; this identifies which 515binary the samples following it are against. 516In this manner, we have converted a PC value, which has transitory 517meaning only, into a static offset value for later processing by the 518daemon. 519</para> 520<para> 521We also attempt to avoid the relatively expensive lookup of the dentry 522cookie value by storing the cookie value directly into the dentry 523itself; then we can simply derive the cookie value immediately when we 524find the correct mapping. 525</para> 526</sect1> 527 528</chapter> 529 530<chapter id="sample-files"> 531<title>Generating sample files</title> 532 533<sect1 id="processing-buffer"> 534<title>Processing the buffer</title> 535 536<para> 537Now we can move onto user-space in our description of how raw interrupt 538samples are processed into useful information. As we described in 539previous sections, the kernel OProfile driver creates a large buffer of 540sample data consisting of offset values, interspersed with 541notification of changes in context. These context changes indicate how 542following samples should be attributed, and include task switches, CPU 543changes, and which dcookie the sample value is against. By processing 544this buffer entry-by-entry, we can determine where the samples should 545be accredited to. This is particularly important when using the 546<option>--separate</option>. 547</para> 548<para> 549The file <filename>daemon/opd_trans.c</filename> contains the basic routine 550for the buffer processing. The <varname>struct transient</varname> 551structure is used to hold changes in context. Its members are modified 552as we process each entry; it is passed into the routines in 553<filename>daemon/opd_sfile.c</filename> for actually logging the sample 554to a particular sample file (which will be held in 555<filename>$SESSION_DIR/samples/current</filename>). 556</para> 557<para> 558The buffer format is designed for conciseness, as high sampling rates 559can easily generate a lot of data. Thus, context changes are prefixed 560by an escape code, identified by <function>is_escape_code()</function>. 561If an escape code is found, the next entry in the buffer identifies 562what type of context change is being read. These are handed off to 563various handlers (see the <varname>handlers</varname> array), which 564modify the transient structure as appropriate. If it's not an escape 565code, then it must be a PC offset value, and the very next entry will 566be the numeric hardware counter. These values are read and recorded 567in the transient structure; we then do a lookup to find the correct 568sample file, and log the sample, as described in the next section. 569</para> 570 571<sect2 id="handling-kernel-samples"> 572<title>Handling kernel samples</title> 573 574<para> 575Samples from kernel code require a little special handling. Because 576the binary text which the sample is against does not correspond to 577any file that the kernel directly knows about, the OProfile driver 578stores the absolute PC value in the buffer, instead of the file offset. 579Of course, we need an offset against some particular binary. To handle 580this, we keep a list of loaded modules by parsing 581<filename>/proc/modules</filename> as needed. When a module is loaded, 582a notification is placed in the OProfile buffer, and this triggers a 583re-read. We store the module name, and the loading address and size. 584This is also done for the main kernel image, as specified by the user. 585The absolute PC value is matched against each address range, and 586modified into an offset when the matching module is found. See 587<filename>daemon/opd_kernel.c</filename> for the details. 588</para> 589 590</sect2> 591 592 593</sect1> 594 595<sect1 id="sample-file-generation"> 596<title>Locating and creating sample files</title> 597 598<para> 599We have a sample value and its satellite data stored in a 600<varname>struct transient</varname>, and we must locate an 601actual sample file to store the sample in, using the context 602information in the transient structure as a key. The transient data to 603sample file lookup is handled in 604<filename>daemon/opd_sfile.c</filename>. A hash is taken of the 605transient values that are relevant (depending upon the setting of 606<option>--separate</option>, some values might be irrelevant), and the 607hash value is used to lookup the list of currently open sample files. 608Of course, the sample file might not be found, in which case we need 609to create and open it. 610</para> 611<para> 612OProfile uses a rather complex scheme for naming sample files, in order 613to make selecting relevant sample files easier for the post-profiling 614utilities. The exact details of the scheme are given in 615<filename>oprofile-tests/pp_interface</filename>, but for now it will 616suffice to remember that the filename will include only relevant 617information for the current settings, taken from the transient data. A 618fully-specified filename looks something like : 619</para> 620<computeroutput> 621/var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0 622</computeroutput> 623<para> 624It should be clear that this identifies such information as the 625application binary, the dependent (library) binary, the hardware event, 626and the process and thread ID. Typically, not all this information is 627needed, in which cases some values may be replaced with the token 628<filename>all</filename>. 629</para> 630<para> 631The code that generates this filename and opens the file is found in 632<filename>daemon/opd_mangling.c</filename>. You may have realised that 633at this point, we do not have the binary image file names, only the 634dcookie values. In order to determine a file name, a dcookie value is 635looked up in the dcookie cache. This is to be found in 636<filename>daemon/opd_cookie.c</filename>. Since dcookies are both 637persistent and unique during a sampling session, we can cache the 638values. If the value is not found in the cache, then we ask the kernel 639to do the lookup from value to file name for us by calling 640<function>lookup_dcookie()</function>. This looks up the value in a 641kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns 642the fully-qualified file name to userspace. 643</para> 644 645</sect1> 646 647<sect1 id="sample-file-writing"> 648<title>Writing data to a sample file</title> 649 650<para> 651Each specific sample file is a hashed collection, where the key is 652the PC offset from the transient data, and the value is the number of 653samples recorded against that offset. The files are 654<function>mmap()</function>ed into the daemon's memory space. The code 655to actually log the write against the sample file can be found in 656<filename>libdb/</filename>. 657</para> 658<para> 659For recording stack traces, we have a more complicated sample filename 660mangling scheme that allows us to identify cross-binary calls. We use 661the same sample file format, where the key is a 64-bit value composed 662from the from,to pair of offsets. 663</para> 664 665</sect1> 666 667</chapter> 668 669<chapter id="output"> 670<title>Generating useful output</title> 671 672<para> 673All of the tools used to generate human-readable output have to take 674roughly the same steps to collect the data for processing. First, the 675profile specification given by the user has to be parsed. Next, a list 676of sample files matching the specification has to obtained. Using this 677list, we need to locate the binary file for each sample file, and then 678use them to extract meaningful data, before a final collation and 679presentation to the user. 680</para> 681 682<sect1 id="profile-specification"> 683<title>Handling the profile specification</title> 684 685<para> 686The profile specification presented by the user is parsed in 687the function <function>profile_spec::create()</function>. This 688creates an object representing the specification. Then we 689use <function>profile_spec::generate_file_list()</function> 690to search for all sample files and match them against the 691<varname>profile_spec</varname>. 692</para> 693 694<para> 695To enable this matching process to work, the attributes of 696each sample file is encoded in its filename. This is a low-tech 697approach to matching specifications against candidate sample 698files, but it works reasonably well. A typical sample file 699might look like these: 700</para> 701<screen> 702/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all 703/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all 704/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0 705/var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all 706</screen> 707<para> 708This looks unnecessarily complex, but it's actually fairly simple. First 709we have the session of the sample, by default located here 710<filename>/var/lib/oprofile/samples/current</filename>. This location 711can be changed by specifying the --session-dir option at command-line. 712This session could equally well be inside an archive from <command>oparchive</command>. 713Next we have one of the tokens <filename>{root}</filename> or 714<filename>{kern}</filename>. <filename>{root}</filename> indicates 715that the binary is found on a file system, and we will encode its path 716in the next section (e.g. <filename>/bin/ls</filename>). 717<filename>{kern}</filename> indicates a kernel module - on 2.6 kernels 718the path information is not available from the kernel, so we have to 719special-case kernel modules like this; we encode merely the name of the 720module as loaded. 721</para> 722<para> 723Next there is a <filename>{dep}</filename> token, indicating another 724token/path which identifies the dependent binary image. This is used even for 725the "primary" binary (i.e. the one that was 726<function>execve()</function>d), as it simplifies processing. Finally, 727if this sample file is a normal flat profile, the actual file is next in 728the path. If it's a call-graph sample file, we need one further 729specification, to allow us to identify cross-binary arcs in the call 730graph. 731</para> 732<para> 733The actual sample file name is dot-separated, where the fields are, in 734order: event name, event count, unit mask, task group ID, task ID, and 735CPU number. 736</para> 737<para> 738This sample file can be reliably parsed (with 739<function>parse_filename()</function>) into a 740<varname>filename_spec</varname>. Finally, we can check whether to 741include the sample file in the final results by comparing this 742<varname>filename_spec</varname> against the 743<varname>profile_spec</varname> the user specified (for the interested, 744see <function>valid_candidate()</function> and 745<function>profile_spec::match</function>). Then comes the really 746complicated bit... 747</para> 748 749</sect1> 750 751<sect1 id="sample-file-collating"> 752<title>Collating the candidate sample files</title> 753 754<para> 755At this point we have a duplicate-free list of sample files we need 756to process. But first we need to do some further arrangement: we 757need to classify each sample file, and we may also need to "invert" 758the profiles. 759</para> 760 761<sect2 id="sample-file-classifying"> 762<title>Classifying sample files</title> 763 764<para> 765It's possible for utilities like <command>opreport</command> to show 766data in columnar format: for example, we might want to show the results 767of two threads within a process side-by-side. To do this, we need 768to classify each sample file into classes - the classes correspond 769with each <command>opreport</command> column. The function that handles 770this is <function>arrange_profiles()</function>. Each sample file 771is added to a particular class. If the sample file is the first in 772its class, a template is generated from the sample file. Each template 773describes a particular class (thus, in our example above, each template 774will have a different thread ID, and this uniquely identifies each 775class). 776</para> 777 778<para> 779Each class has a list of "profile sets" matching that class's template. 780A profile set is either a profile of the primary binary image, or any of 781its dependent images. After all sample files have been listed in one of 782the profile sets belonging to the classes, we have to name each class and 783perform error-checking. This is done by 784<function>identify_classes()</function>; each class is checked to ensure 785that its "axis" is the same as all the others. This is needed because 786<command>opreport</command> can't produce results in 3D format: we can 787only differ in one aspect, such as thread ID or event name. 788</para> 789 790</sect2> 791 792<sect2 id="sample-file-inverting"> 793<title>Creating inverted profile lists</title> 794 795<para> 796Remember that if we're using certain profile separation options, such as 797"--separate=lib", a single binary could be a dependent image to many 798different binaries. For example, the C library image would be a 799dependent image for most programs that have been profiled. As it 800happens, this can cause severe performance problems: without some 801re-arrangement, these dependent binary images would be opened each 802time we need to process sample files for each program. 803</para> 804 805<para> 806The solution is to "invert" the profiles via 807<function>invert_profiles()</function>. We create a new data structure 808where the dependent binary is first, and the primary binary images using 809that dependent binary are listed as sub-images. This helps our 810performance problem, as now we only need to open each dependent image 811once, when we process the list of inverted profiles. 812</para> 813 814</sect2> 815 816</sect1> 817 818<sect1 id="generating-profile-data"> 819<title>Generating profile data</title> 820 821<para> 822Things don't get any simpler at this point, unfortunately. At this point 823we've collected and classified the sample files into the set of inverted 824profiles, as described in the previous section. Now we need to process 825each inverted profile and make something of the data. The entry point 826for this is <function>populate_for_image()</function>. 827</para> 828 829<sect2 id="bfd"> 830<title>Processing the binary image</title> 831<para> 832The first thing we do with an inverted profile is attempt to open the 833binary image (remember each inverted profile set is only for one binary 834image, but may have many sample files to process). The 835<varname>op_bfd</varname> class provides an abstracted interface to 836this; internally it uses <filename>libbfd</filename>. The main purpose 837of this class is to process the symbols for the binary image; this is 838also where symbol filtering happens. This is actually quite tricky, but 839should be clear from the source. 840</para> 841</sect2> 842 843<sect2 id="processing-sample-files"> 844<title>Processing the sample files</title> 845<para> 846The class <varname>profile_container</varname> is a hold-all that 847contains all the processed results. It is a container of 848<varname>profile_t</varname> objects. The 849<function>add_sample_files()</function> method uses 850<filename>libdb</filename> to open the given sample file and add the 851key/value types to the <varname>profile_t</varname>. Once this has been 852done, <function>profile_container::add()</function> is passed the 853<varname>profile_t</varname> plus the <varname>op_bfd</varname> for 854processing. 855</para> 856<para> 857<function>profile_container::add()</function> walks through the symbols 858collected in the <varname>op_bfd</varname>. 859<function>op_bfd::get_symbol_range()</function> gives us the start and 860end of the symbol as an offset from the start of the binary image, 861then we interrogate the <varname>profile_t</varname> for the relevant samples 862for that offset range. We create a <varname>symbol_entry</varname> 863object for this symbol and fill it in. If needed, here we also collect 864debug information from the <varname>op_bfd</varname>, and possibly 865record the detailed sample information (as used by <command>opreport 866-d</command> and <command>opannotate</command>). 867Finally the <varname>symbol_entry</varname> is added to 868a private container of <varname>profile_container</varname> - this 869<varname>symbol_container</varname> holds all such processed symbols. 870</para> 871</sect2> 872 873</sect1> 874 875<sect1 id="generating-output"> 876<title>Generating output</title> 877 878<para> 879After the processing described in the previous section, we've now got 880full details of what we need to output stored in the 881<varname>profile_container</varname> on a symbol-by-symbol basis. To 882produce output, we need to replay that data and format it suitably. 883</para> 884<para> 885<command>opreport</command> first asks the 886<varname>profile_container</varname> for a 887<varname>symbol_collection</varname> (this is also where thresholding 888happens). 889This is sorted, then a 890<varname>opreport_formatter</varname> is initialised. 891This object initialises a set of field formatters as requested. Then 892<function>opreport_formatter::output()</function> is called. This 893iterates through the (sorted) <varname>symbol_collection</varname>; 894for each entry, the selected fields (as set by the 895<varname>format_flags</varname> options) are output by calling the 896field formatters, with the <varname>symbol_entry</varname> passed in. 897</para> 898 899</sect1> 900 901</chapter> 902 903<chapter id="ext"> 904<title>Extended Feature Interface</title> 905 906<sect1 id="ext-intro"> 907<title>Introduction</title> 908 909<para> 910The Extended Feature Interface is a standard callback interface 911designed to allow extension to the OProfile daemon's sample processing. 912Each feature defines a set of callback handlers which can be enabled or 913disabled through the OProfile daemon's command-line option. 914This interface can be used to implement support for architecture-specific 915features or features not commonly used by general OProfile users. 916</para> 917 918</sect1> 919 920<sect1 id="ext-name-and-handlers"> 921<title>Feature Name and Handlers</title> 922 923<para> 924Each extended feature has an entry in the <varname>ext_feature_table</varname> 925in <filename>opd_extended.cpp</filename>. Each entry contains a feature name, 926and a corresponding set of handlers. Feature name is a unique string, which is 927used to identify a feature in the table. Each feature provides a set 928of handlers, which will be executed by the OProfile daemon from pre-determined 929locations to perform certain tasks. At runtime, the OProfile daemon calls a feature 930handler wrapper from one of the predetermined locations to check whether 931an extended feature is enabled, and whether a particular handler exists. 932Only the handlers of the enabled feature will be executed. 933</para> 934 935</sect1> 936 937<sect1 id="ext-enable"> 938<title>Enabling Features</title> 939 940<para> 941Each feature is enabled using the OProfile daemon (oprofiled) command-line 942option "--ext-feature=<extended-feature-name>:[args]". The 943"extended-feature-name" is used to determine the feature to be enabled. 944The optional "args" is passed into the feature-specific initialization handler 945(<function>ext_init</function>). Currently, only one extended feature can be 946enabled at a time. 947</para> 948 949</sect1> 950 951<sect1 id="ext-types-of-handlers"> 952<title>Type of Handlers</title> 953 954<para> 955Each feature is responsible for providing its own set of handlers. 956Types of handler are: 957</para> 958 959<sect2 id="ext_init"> 960<title>ext_init Handler</title> 961 962<para> 963"ext_init" handles initialization of an extended feature. It takes 964"args" parameter which is passed in through the "oprofiled --ext-feature=< 965extended-feature-name>:[args]". This handler is executed in the function 966<function>opd_options()</function> in the file <filename>daemon/oprofiled.c 967</filename>. 968</para> 969 970<note> 971<para> 972The ext_init handler is required for all features. 973</para> 974</note> 975 976</sect2> 977 978<sect2 id="ext_print_stats"> 979<title>ext_print_stats Handler</title> 980 981<para> 982"ext_print_stats" handles the extended feature statistics report. It adds 983a new section in the OProfile daemon statistics report, which is normally 984outputed to the file 985<filename>/var/lib/oprofile/samples/oprofiled.log</filename>. 986This handler is executed in the function <function>opd_print_stats()</function> 987in the file <filename>daemon/opd_stats.c</filename>. 988</para> 989 990</sect2> 991 992<sect2 id="ext_sfile_handlers"> 993<title>ext_sfile Handler</title> 994 995<para> 996"ext_sfile" contains a set of handlers related to operations on the extended 997sample files (sample files for events related to extended feature). 998These operations include <function>create_sfile()</function>, 999<function>sfile_dup()</function>, <function>close_sfile()</function>, 1000<function>sync_sfile()</function>, and <function>get_file()</function> 1001as defined in <filename>daemon/opd_sfile.c</filename>. 1002An additional field, <varname>odb_t * ext_file</varname>, is added to the 1003<varname>struct sfile</varname> for storing extended sample files 1004information. 1005 1006</para> 1007 1008</sect2> 1009 1010</sect1> 1011 1012<sect1 id="ext-implementation"> 1013<title>Extended Feature Reference Implementation</title> 1014 1015<sect2 id="ext-ibs"> 1016<title>Instruction-Based Sampling (IBS)</title> 1017 1018<para> 1019An example of extended feature implementation can be seen by 1020examining the AMD Instruction-Based Sampling support. 1021</para> 1022 1023<sect3 id="ibs-init"> 1024<title>IBS Initialization</title> 1025 1026<para> 1027Instruction-Based Sampling (IBS) is a new performance measurement technique 1028available on AMD Family 10h processors. Enabling IBS profiling is done simply 1029by specifying IBS performance events through the "--event=" options. 1030</para> 1031 1032<screen> 1033opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user> 1034opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user> 1035 1036Note: * Count and unitmask for all IBS fetch events must be the same, 1037 as do those for IBS op. 1038</screen> 1039 1040<para> 1041IBS performance events are listed in <function>opcontrol --list-events</function>. 1042When users specify these events, opcontrol verifies them using ophelp, which 1043checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname> 1044tag in <filename>events/x86-64/family10/events</filename> file. 1045Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and 1046/dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows. 1047</para> 1048 1049<screen> 1050oprofiled \ 1051 --ext-feature=ibs:\ 1052 fetch:<IBS_FETCH_EVENT1>,<IBS_FETCH_EVENT2>,...,:<IBS fetch count>:<IBS Fetch um>|\ 1053 op:<IBS_OP_EVENT1>,<IBS_OP_EVENT2>,...,:<IBS op count>:<IBS op um> 1054</screen> 1055 1056<para> 1057Here, the OProfile daemon parses the <varname>--ext-feature</varname> 1058option and checks the feature name ("ibs") before calling the 1059the initialization function to handle the string 1060containing IBS events, counts, and unitmasks. 1061Then, it stores each event in the IBS virtual-counter table 1062(<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and 1063stores the event index in the IBS Virtual Counter Index (VCI) map 1064(<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value 1065as the map key. 1066</para> 1067</sect3> 1068 1069<sect3 id="ibs-data-processing"> 1070<title>IBS Data Processing</title> 1071 1072<para> 1073During a profile session, the OProfile daemon identifies IBS samples in the 1074event buffer using the <varname>"IBS_FETCH_CODE"</varname> or 1075<varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers 1076<function>code_ibs_fetch_sample()</function> or 1077<function>code_ibs_op_sample()</function> listed in the 1078<varname>handler_t handlers[]</varname> vector in 1079<filename>daemon/opd_trans.c </filename>. These handlers are responsible for 1080processing IBS samples and translate them into IBS performance events. 1081</para> 1082 1083<para> 1084Unlike traditional performance events, each IBS sample can be derived into 1085multiple IBS performance events. For each event that the user specifies, 1086a combination of bits from Model-Specific Registers (MSR) are checked 1087against the bitmask defining the event. If the condition is met, the event 1088will then be recorded. The derivation logic is in the files 1089<filename>daemon/opd_ibs_macro.h</filename> and 1090<filename>daemon/opd_ibs_trans.[h,c]</filename>. 1091</para> 1092 1093</sect3> 1094 1095<sect3 id="ibs-sample-file"> 1096<title>IBS Sample File</title> 1097 1098<para> 1099Traditionally, sample file information <varname>(odb_t)</varname> is stored 1100in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>. 1101Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on 1102alpha-based system. Event index (the counter number on which the event 1103is configured) is used to access the corresponding entry in the array. 1104Unlike the traditional performance event, IBS does not use the actual 1105counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>). 1106Also, the number of performance events generated by IBS could be larger than 1107<varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op 1108events). Therefore IBS requires a special data structure and sfile 1109handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing 1110IBS sample files. IBS-sample-file information is stored in a memory 1111allocated by handler <function>ibs_sfile_create()</function>, which can 1112be accessed through <varname>struct sfile::odb_t * ext_files</varname>. 1113</para> 1114 1115</sect3> 1116 1117</sect2> 1118 1119</sect1> 1120 1121</chapter> 1122 1123<glossary id="glossary"> 1124<title>Glossary of OProfile source concepts and types</title> 1125 1126<glossentry><glossterm>application image</glossterm> 1127<glossdef><para> 1128The primary binary image used by an application. This is derived 1129from the kernel and corresponds to the binary started upon running 1130an application: for example, <filename>/bin/bash</filename>. 1131</para></glossdef></glossentry> 1132 1133<glossentry><glossterm>binary image</glossterm> 1134<glossdef><para> 1135An ELF file containing executable code: this includes kernel modules, 1136the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries, 1137and application binaries. 1138</para></glossdef></glossentry> 1139 1140<glossentry><glossterm>dcookie</glossterm> 1141<glossdef><para> 1142Short for "dentry cookie". A unique ID that can be looked up to provide 1143the full path name of a binary image. 1144</para></glossdef></glossentry> 1145 1146<glossentry><glossterm>dependent image</glossterm> 1147<glossdef><para> 1148A binary image that is dependent upon an application, used with 1149per-application separation. Most commonly, shared libraries. For example, 1150if <filename>/bin/bash</filename> is running and we take 1151some samples inside the C library itself due to <command>bash</command> 1152calling library code, then the image <filename>/lib/libc.so</filename> 1153would be dependent upon <filename>/bin/bash</filename>. 1154</para></glossdef></glossentry> 1155 1156<glossentry><glossterm>merging</glossterm> 1157<glossdef><para> 1158This refers to the ability to merge several distinct sample files 1159into one set of data at runtime, in the post-profiling tools. For example, 1160per-thread sample files can be merged into one set of data, because 1161they are compatible (i.e. the aggregation of the data is meaningful), 1162but it's not possible to merge sample files for two different events, 1163because there would be no useful meaning to the results. 1164</para></glossdef></glossentry> 1165 1166<glossentry><glossterm>profile class</glossterm> 1167<glossdef><para> 1168A collection of profile data that has been collected under the same 1169class template. For example, if we're using <command>opreport</command> 1170to show results after profiling with two performance counters enabled 1171profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>, 1172there would be two profile classes, one for each event. Or if we're on 1173an SMP system and doing per-cpu profiling, and we request 1174<command>opreport</command> to show results for each CPU side-by-side, 1175there would be a profile class for each CPU. 1176</para></glossdef></glossentry> 1177 1178<glossentry><glossterm>profile specification</glossterm> 1179<glossdef><para> 1180The parameters the user passes to the post-profiling tools that limit 1181what sample files are used. This specification is matched against 1182the available sample files to generate a selection of profile data. 1183</para></glossdef></glossentry> 1184 1185<glossentry><glossterm>profile template</glossterm> 1186<glossdef><para> 1187The parameters that define what goes in a particular profile class. 1188This includes a symbolic name (e.g. "cpu:1") and the code-usable 1189equivalent. 1190</para></glossdef></glossentry> 1191 1192</glossary> 1193 1194</book> 1195