OProfile Internals John Levon
levon@movementarian.org
2003 John Levon
Introduction This document is current for OProfile version . This document provides some details on the internal workings of OProfile for the interested hacker. This document assumes strong C, working C++, plus some knowledge of kernel internals and CPU hardware. Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 uses a very different kernel module implementation and daemon to produce the sample files. Overview OProfile is a statistical continuous profiler. In other words, profiles are generated by regularly sampling the current registers on each CPU (from an interrupt handler, the saved PC value at the time of interrupt is stored), and converting that runtime PC value into something meaningful to the programmer. OProfile achieves this by taking the stream of sampled PC values, along with the detail of which task was running at the time of the interrupt, and converting into a file offset against a particular binary file. Because applications mmap() the code they run (be it /bin/bash, /lib/libfoo.so or whatever), it's possible to find the relevant binary file and offset by walking the task's list of mapped memory areas. Each PC value is thus converted into a tuple of binary-image,offset. This is something that the userspace tools can use directly to reconstruct where the code came from, including the particular assembly instructions, symbol, and source line (via the binary's debug information if present). Regularly sampling the PC value like this approximates what actually was executed and how often - more often than not, this statistical approximation is good enough to reflect reality. In common operation, the time between each sample interrupt is regulated by a fixed number of clock cycles. This implies that the results will reflect where the CPU is spending the most time; this is obviously a very useful information source for performance analysis. Sometimes though, an application programmer needs different kinds of information: for example, "which of the source routines cause the most cache misses ?". The rise in importance of such metrics in recent years has led many CPU manufacturers to provide hardware performance counters capable of measuring these events on the hardware level. Typically, these counters increment once per each event, and generate an interrupt on reaching some pre-defined number of events. OProfile can use these interrupts to generate samples: then, the profile results are a statistical approximation of which code caused how many of the given event. Consider a simplified system that only executes two functions A and B. A takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at 100 cycles a second, and we've set the performance counter to create an interrupt after a set number of "events" (in this case an event is one clock cycle). It should be clear that the chances of the interrupt occurring in function A is 1/100, and 99/100 for function B. Thus, we statistically approximate the actual relative performance features of the two functions over time. This same analysis works for other types of events, providing that the interrupt is tied to the number of events occurring (that is, after N events, an interrupt is generated). There are typically more than one of these counters, so it's possible to set up profiling for several different event types. Using these counters gives us a powerful, low-overhead way of gaining performance metrics. If OProfile, or the CPU, does not support performance counters, then a simpler method is used: the kernel timer interrupt feeds samples into OProfile itself. The rest of this document concerns itself with how we get from receiving samples at interrupt time to producing user-readable profile information. Components of the OProfile system Architecture-specific components If OProfile supports the hardware performance counters found on a particular architecture, code for managing the details of setting up and managing these counters can be found in the kernel source tree in the relevant arch/arch/oprofile/ directory. The architecture-specific implementation works via filling in the oprofile_operations structure at init time. This provides a set of operations such as setup(), start(), stop(), etc. that manage the hardware-specific details of fiddling with the performance counter registers. The other important facility available to the architecture code is oprofile_add_sample(). This is where a particular sample taken at interrupt time is fed into the generic OProfile driver code. oprofilefs OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from userspace at /dev/oprofile. This consists of small files for reporting and receiving configuration from userspace, as well as the actual character device that the OProfile userspace receives samples from. At setup() time, the architecture-specific may add further configuration files related to the details of the performance counters. For example, on x86, one numbered directory for each hardware performance counter is added, with files in each for the event type, reset value, etc. The filesystem also contains a stats directory with a number of useful counters for various OProfile events. Generic kernel driver This lives in drivers/oprofile/, and forms the core of how OProfile works in the kernel. Its job is to take samples delivered from the architecture-specific code (via oprofile_add_sample()), and buffer this data, in a transformed form as described later, until releasing the data to the userspace daemon via the /dev/oprofile/buffer character device. The OProfile daemon The OProfile userspace daemon's job is to take the raw data provided by the kernel and write it to the disk. It takes the single data stream from the kernel and logs sample data against a number of sample files (found in $SESSION_DIR/samples/current/, by default located at /var/lib/oprofile/samples/current/. For the benefit of the "separate" functionality, the names/paths of these sample files are mangled to reflect where the samples were from: this can include thread IDs, the binary file path, the event type used, and more. After this final step from interrupt to disk file, the data is now persistent (that is, changes in the running of the system do not invalidate stored data). So the post-profiling tools can run on this data at any time (assuming the original binary files are still available and unchanged, naturally). Post-profiling tools So far, we've collected data, but we've yet to present it in a useful form to the user. This is the job of the post-profiling tools. In general form, they collate a subset of the available sample files, load and process each one correlated against the relevant binary file, and finally produce user-readable information. Performance counter management Providing a user interface The performance counter registers need programming in order to set the type of event to count, etc. OProfile uses a standard model across all CPUs for defining these events as follows : The event type e.g. DATA_MEM_REFS The sub-events to count (more detailed specification) The hardware counter(s) that can count this event The reset value (how many events before an interrupt) Whether the counter should increment when in kernel space Whether the counter should increment when in user space The term "unit mask" is borrowed from the Intel architectures, and can further specify exactly when a counter is incremented (for example, cache-related events can be restricted to particular state transitions of the cache lines). All of the available hardware events and their details are specified in the textual files in the events directory. The syntax of these files should be fairly obvious. The user specifies the names and configuration details of the chosen counters via opcontrol. These are then written to the kernel module (in numerical form) via /dev/oprofile/N/ where N is the physical hardware counter (some events can only be used on specific counters; OProfile hides these details from the user when possible). On IA64, the perfmon-based interface behaves somewhat differently, as described later. Programming the performance counter registers We have described how the user interface fills in the desired configuration of the counters and transmits the information to the kernel. It is the job of the ->setup() method to actually program the performance counter registers. Clearly, the details of how this is done is architecture-specific; it is also model-specific on many architectures. For example, i386 provides methods for each model type that programs the counter registers correctly (see the op_model_* files in arch/i386/oprofile for the details). The method reads the values stored in the virtual oprofilefs files and programs the registers appropriately, ready for starting the actual profiling session. The architecture-specific drivers make sure to save the old register settings before doing OProfile setup. They are restored when OProfile shuts down. This is useful, for example, on i386, where the NMI watchdog uses the same performance counter registers as OProfile; they cannot run concurrently, but OProfile makes sure to restore the setup it found before it was running. In addition to programming the counter registers themselves, other setup is often necessary. For example, on i386, the local APIC needs programming in order to make the counter's overflow interrupt appear as an NMI (non-maskable interrupt). This allows sampling (and therefore profiling) of regions where "normal" interrupts are masked, enabling more reliable profiles. Starting and stopping the counters Initiating a profiling session is done via writing an ASCII '1' to the file /dev/oprofile/enable. This sets up the core, and calls into the architecture-specific driver to actually enable each configured counter. Again, the details of how this is done is model-specific (for example, the Athlon models can disable or enable on a per-counter basis, unlike the PPro models). IA64 and perfmon The IA64 architecture provides a different interface from the other architectures, using the existing perfmon driver. Register programming is handled entirely in user-space (see daemon/opd_perfmon.c for the details). A process is forked for each CPU, which creates a perfmon context and sets the counter registers appropriately via the sys_perfmonctl interface. In addition, the actual initiation and termination of the profiling session is handled via the same interface using PFM_START and PFM_STOP. On IA64, then, there are no oprofilefs files for the performance counters, as the kernel driver does not program the registers itself. Instead, the perfmon driver for OProfile simply registers with the OProfile core with an OProfile-specific UUID. During a profiling session, the perfmon core calls into the OProfile perfmon driver and samples are registered with the OProfile core itself as usual (with oprofile_add_sample()). Collecting and processing samples Receiving interrupts Naturally, how the overflow interrupts are received is specific to the hardware architecture, unless we are in "timer" mode, where the logging routine is called directly from the standard kernel timer interrupt handler. On the i386 architecture, the local APIC is programmed such that when a counter overflows (that is, it receives an event that causes an integer overflow of the register value to zero), an NMI is generated. This calls into the general handler do_nmi(); because OProfile has registered itself as capable of handling NMI interrupts, this will call into the OProfile driver code in arch/i386/oprofile. Here, the saved PC value (the CPU saves the register set at the time of interrupt on the stack available for inspection) is extracted, and the counters are examined to find out which one generated the interrupt. Also determined is whether the system was inside kernel or user space at the time of the interrupt. These three pieces of information are then forwarded onto the OProfile core via oprofile_add_sample(). Finally, the counter values are reset to the chosen count value, to ensure another interrupt happens after another N events have occurred. Other architectures behave in a similar manner. Core data structures Before considering what happens when we log a sample, we shall digress for a moment and look at the general structure of the data collection system. OProfile maintains a small buffer for storing the logged samples for each CPU on the system. Only this buffer is altered when we actually log a sample (remember, we may still be in an NMI context, so no locking is possible). The buffer is managed by a two-handed system; the "head" iterator dictates where the next sample data should be placed in the buffer. Of course, overflow of the buffer is possible, in which case the sample is discarded. It is critical to remember that at this point, the PC value is an absolute value, and is therefore only meaningful in the context of which task it was logged against. Thus, these per-CPU buffers also maintain details of which task each logged sample is for, as described in the next section. In addition, we store whether the sample was in kernel space or user space (on some architectures and configurations, the address space is not sub-divided neatly at a specific PC value, so we must store this information). As well as these small per-CPU buffers, we have a considerably larger single buffer. This holds the data that is eventually copied out into the OProfile daemon. On certain system events, the per-CPU buffers are processed and entered (in mutated form) into the main buffer, known in the source as the "event buffer". The "tail" iterator indicates the point from which the CPU may be read, up to the position of the "head" iterator. This provides an entirely lock-free method for extracting data from the CPU buffers. This process is described in detail later in this chapter.
The OProfile buffers
Logging a sample As mentioned, the sample is logged into the buffer specific to the current CPU. The CPU buffer is a simple array of pairs of unsigned long values; for a sample, they hold the PC value and the counter for the sample. (The counter value is later used to translate back into the relevant event type the counter was programmed to). In addition to logging the sample itself, we also log task switches. This is simply done by storing the address of the last task to log a sample on that CPU in a data structure, and writing a task switch entry into the buffer if the new value of current() has changed. Note that later we will directly de-reference this pointer; this imposes certain restrictions on when and how the CPU buffers need to be processed. Finally, as mentioned, we log whether we have changed between kernel and userspace using a similar method. Both of these variables (last_task and last_is_kernel) are reset when the CPU buffer is read. Logging stack traces OProfile can also provide statistical samples of call chains (on x86). To do this, at sample time, the frame pointer chain is traversed, recording the return address for each stack frame. This will only work if the code was compiled with frame pointers, but we're careful to abort the traversal if the frame pointer appears bad. We store the set of return addresses straight into the CPU buffer. Note that, since this traversal is keyed off the standard sample interrupt, the number of times a function appears in a stack trace is not an indicator of how many times the call site was executed: rather, it's related to the number of samples we took where that call site was involved. Thus, the results for stack traces are not necessarily proportional to the call counts: typical programs will have many main() samples. Synchronising the CPU buffers to the event buffer At some point, we have to process the data in each CPU buffer and enter it into the main (event) buffer. The file buffer_sync.c contains the relevant code. We periodically (currently every HZ/4 jiffies) start the synchronisation process. In addition, we process the buffers on certain events, such as an application calling munmap(). This is particularly important for exit() - because the CPU buffers contain pointers to the task structure, if we don't process all the buffers before the task is actually destroyed and the task structure freed, then we could end up trying to dereference a bogus pointer in one of the CPU buffers. We also add a notification when a kernel module is loaded; this is so that user-space can re-read /proc/modules to determine the load addresses of kernel module text sections. Without this notification, samples for a newly-loaded module could get lost or be attributed to the wrong module. The synchronisation itself works in the following manner: first, mutual exclusion on the event buffer is taken. Remember, we do not need to do that for each CPU buffer, as we only read from the tail iterator (whilst interrupts might be arriving at the same buffer, but they will write to the position of the head iterator, leaving previously written entries intact). Then, we process each CPU buffer in turn. A CPU switch notification is added to the buffer first (for support). Then the processing of the actual data starts. As mentioned, the CPU buffer consists of task switch entries and the actual samples. When the routine sync_buffer() sees a task switch, the process ID and process group ID are recorded into the event buffer, along with a dcookie (see below) identifying the application binary (e.g. /bin/bash). The mmap_sem for the task is then taken, to allow safe iteration across the tasks' list of mapped areas. Each sample is then processed as described in the next section. After a buffer has been read, the tail iterator is updated to reflect how much of the buffer was processed. Note that when we determined how much data there was to read in the CPU buffer, we also called cpu_buffer_reset() to reset last_task and last_is_kernel, as we've already mentioned. During the processing, more samples may have been arriving in the CPU buffer; this is OK because we are careful to only update the tail iterator to how much we actually read - on the next buffer synchronisation, we will start again from that point. Identifying binary images In order to produce useful profiles, we need to be able to associate a particular PC value sample with an actual ELF binary on the disk. This leaves us with the problem of how to export this information to user-space. We create unique IDs that identify a particular directory entry (dentry), and write those IDs into the event buffer. Later on, the user-space daemon can call the lookup_dcookie system call, which looks up the ID and fills in the full path of the binary image in the buffer user-space passes in. These IDs are maintained by the code in fs/dcookies.c; the cache lasts for as long as the daemon has the event buffer open. Finding a sample's binary image and offset We haven't yet described how we process the absolute PC value into something usable by the user-space daemon. When we find a sample entered into the CPU buffer, we traverse the list of mappings for the task (remember, we will have seen a task switch earlier, so we know which task's lists to look at). When a mapping is found that contains the PC value, we look up the mapped file's dentry in the dcookie cache. This gives the dcookie ID that will uniquely identify the mapped file. Then we alter the absolute value such that it is an offset from the start of the file being mapped (the mapping need not start at the start of the actual file, so we have to consider the offset value of the mapping). We store this dcookie ID into the event buffer; this identifies which binary the samples following it are against. In this manner, we have converted a PC value, which has transitory meaning only, into a static offset value for later processing by the daemon. We also attempt to avoid the relatively expensive lookup of the dentry cookie value by storing the cookie value directly into the dentry itself; then we can simply derive the cookie value immediately when we find the correct mapping.
Generating sample files Processing the buffer Now we can move onto user-space in our description of how raw interrupt samples are processed into useful information. As we described in previous sections, the kernel OProfile driver creates a large buffer of sample data consisting of offset values, interspersed with notification of changes in context. These context changes indicate how following samples should be attributed, and include task switches, CPU changes, and which dcookie the sample value is against. By processing this buffer entry-by-entry, we can determine where the samples should be accredited to. This is particularly important when using the . The file daemon/opd_trans.c contains the basic routine for the buffer processing. The struct transient structure is used to hold changes in context. Its members are modified as we process each entry; it is passed into the routines in daemon/opd_sfile.c for actually logging the sample to a particular sample file (which will be held in $SESSION_DIR/samples/current). The buffer format is designed for conciseness, as high sampling rates can easily generate a lot of data. Thus, context changes are prefixed by an escape code, identified by is_escape_code(). If an escape code is found, the next entry in the buffer identifies what type of context change is being read. These are handed off to various handlers (see the handlers array), which modify the transient structure as appropriate. If it's not an escape code, then it must be a PC offset value, and the very next entry will be the numeric hardware counter. These values are read and recorded in the transient structure; we then do a lookup to find the correct sample file, and log the sample, as described in the next section. Handling kernel samples Samples from kernel code require a little special handling. Because the binary text which the sample is against does not correspond to any file that the kernel directly knows about, the OProfile driver stores the absolute PC value in the buffer, instead of the file offset. Of course, we need an offset against some particular binary. To handle this, we keep a list of loaded modules by parsing /proc/modules as needed. When a module is loaded, a notification is placed in the OProfile buffer, and this triggers a re-read. We store the module name, and the loading address and size. This is also done for the main kernel image, as specified by the user. The absolute PC value is matched against each address range, and modified into an offset when the matching module is found. See daemon/opd_kernel.c for the details. Locating and creating sample files We have a sample value and its satellite data stored in a struct transient, and we must locate an actual sample file to store the sample in, using the context information in the transient structure as a key. The transient data to sample file lookup is handled in daemon/opd_sfile.c. A hash is taken of the transient values that are relevant (depending upon the setting of , some values might be irrelevant), and the hash value is used to lookup the list of currently open sample files. Of course, the sample file might not be found, in which case we need to create and open it. OProfile uses a rather complex scheme for naming sample files, in order to make selecting relevant sample files easier for the post-profiling utilities. The exact details of the scheme are given in oprofile-tests/pp_interface, but for now it will suffice to remember that the filename will include only relevant information for the current settings, taken from the transient data. A fully-specified filename looks something like : /var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0 It should be clear that this identifies such information as the application binary, the dependent (library) binary, the hardware event, and the process and thread ID. Typically, not all this information is needed, in which cases some values may be replaced with the token all. The code that generates this filename and opens the file is found in daemon/opd_mangling.c. You may have realised that at this point, we do not have the binary image file names, only the dcookie values. In order to determine a file name, a dcookie value is looked up in the dcookie cache. This is to be found in daemon/opd_cookie.c. Since dcookies are both persistent and unique during a sampling session, we can cache the values. If the value is not found in the cache, then we ask the kernel to do the lookup from value to file name for us by calling lookup_dcookie(). This looks up the value in a kernel-side cache (see fs/dcookies.c) and returns the fully-qualified file name to userspace. Writing data to a sample file Each specific sample file is a hashed collection, where the key is the PC offset from the transient data, and the value is the number of samples recorded against that offset. The files are mmap()ed into the daemon's memory space. The code to actually log the write against the sample file can be found in libdb/. For recording stack traces, we have a more complicated sample filename mangling scheme that allows us to identify cross-binary calls. We use the same sample file format, where the key is a 64-bit value composed from the from,to pair of offsets. Generating useful output All of the tools used to generate human-readable output have to take roughly the same steps to collect the data for processing. First, the profile specification given by the user has to be parsed. Next, a list of sample files matching the specification has to obtained. Using this list, we need to locate the binary file for each sample file, and then use them to extract meaningful data, before a final collation and presentation to the user. Handling the profile specification The profile specification presented by the user is parsed in the function profile_spec::create(). This creates an object representing the specification. Then we use profile_spec::generate_file_list() to search for all sample files and match them against the profile_spec. To enable this matching process to work, the attributes of each sample file is encoded in its filename. This is a low-tech approach to matching specifications against candidate sample files, but it works reasonably well. A typical sample file might look like these: /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0 /var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all This looks unnecessarily complex, but it's actually fairly simple. First we have the session of the sample, by default located here /var/lib/oprofile/samples/current. This location can be changed by specifying the --session-dir option at command-line. This session could equally well be inside an archive from oparchive. Next we have one of the tokens {root} or {kern}. {root} indicates that the binary is found on a file system, and we will encode its path in the next section (e.g. /bin/ls). {kern} indicates a kernel module - on 2.6 kernels the path information is not available from the kernel, so we have to special-case kernel modules like this; we encode merely the name of the module as loaded. Next there is a {dep} token, indicating another token/path which identifies the dependent binary image. This is used even for the "primary" binary (i.e. the one that was execve()d), as it simplifies processing. Finally, if this sample file is a normal flat profile, the actual file is next in the path. If it's a call-graph sample file, we need one further specification, to allow us to identify cross-binary arcs in the call graph. The actual sample file name is dot-separated, where the fields are, in order: event name, event count, unit mask, task group ID, task ID, and CPU number. This sample file can be reliably parsed (with parse_filename()) into a filename_spec. Finally, we can check whether to include the sample file in the final results by comparing this filename_spec against the profile_spec the user specified (for the interested, see valid_candidate() and profile_spec::match). Then comes the really complicated bit... Collating the candidate sample files At this point we have a duplicate-free list of sample files we need to process. But first we need to do some further arrangement: we need to classify each sample file, and we may also need to "invert" the profiles. Classifying sample files It's possible for utilities like opreport to show data in columnar format: for example, we might want to show the results of two threads within a process side-by-side. To do this, we need to classify each sample file into classes - the classes correspond with each opreport column. The function that handles this is arrange_profiles(). Each sample file is added to a particular class. If the sample file is the first in its class, a template is generated from the sample file. Each template describes a particular class (thus, in our example above, each template will have a different thread ID, and this uniquely identifies each class). Each class has a list of "profile sets" matching that class's template. A profile set is either a profile of the primary binary image, or any of its dependent images. After all sample files have been listed in one of the profile sets belonging to the classes, we have to name each class and perform error-checking. This is done by identify_classes(); each class is checked to ensure that its "axis" is the same as all the others. This is needed because opreport can't produce results in 3D format: we can only differ in one aspect, such as thread ID or event name. Creating inverted profile lists Remember that if we're using certain profile separation options, such as "--separate=lib", a single binary could be a dependent image to many different binaries. For example, the C library image would be a dependent image for most programs that have been profiled. As it happens, this can cause severe performance problems: without some re-arrangement, these dependent binary images would be opened each time we need to process sample files for each program. The solution is to "invert" the profiles via invert_profiles(). We create a new data structure where the dependent binary is first, and the primary binary images using that dependent binary are listed as sub-images. This helps our performance problem, as now we only need to open each dependent image once, when we process the list of inverted profiles. Generating profile data Things don't get any simpler at this point, unfortunately. At this point we've collected and classified the sample files into the set of inverted profiles, as described in the previous section. Now we need to process each inverted profile and make something of the data. The entry point for this is populate_for_image(). Processing the binary image The first thing we do with an inverted profile is attempt to open the binary image (remember each inverted profile set is only for one binary image, but may have many sample files to process). The op_bfd class provides an abstracted interface to this; internally it uses libbfd. The main purpose of this class is to process the symbols for the binary image; this is also where symbol filtering happens. This is actually quite tricky, but should be clear from the source. Processing the sample files The class profile_container is a hold-all that contains all the processed results. It is a container of profile_t objects. The add_sample_files() method uses libdb to open the given sample file and add the key/value types to the profile_t. Once this has been done, profile_container::add() is passed the profile_t plus the op_bfd for processing. profile_container::add() walks through the symbols collected in the op_bfd. op_bfd::get_symbol_range() gives us the start and end of the symbol as an offset from the start of the binary image, then we interrogate the profile_t for the relevant samples for that offset range. We create a symbol_entry object for this symbol and fill it in. If needed, here we also collect debug information from the op_bfd, and possibly record the detailed sample information (as used by opreport -d and opannotate). Finally the symbol_entry is added to a private container of profile_container - this symbol_container holds all such processed symbols. Generating output After the processing described in the previous section, we've now got full details of what we need to output stored in the profile_container on a symbol-by-symbol basis. To produce output, we need to replay that data and format it suitably. opreport first asks the profile_container for a symbol_collection (this is also where thresholding happens). This is sorted, then a opreport_formatter is initialised. This object initialises a set of field formatters as requested. Then opreport_formatter::output() is called. This iterates through the (sorted) symbol_collection; for each entry, the selected fields (as set by the format_flags options) are output by calling the field formatters, with the symbol_entry passed in. Extended Feature Interface Introduction The Extended Feature Interface is a standard callback interface designed to allow extension to the OProfile daemon's sample processing. Each feature defines a set of callback handlers which can be enabled or disabled through the OProfile daemon's command-line option. This interface can be used to implement support for architecture-specific features or features not commonly used by general OProfile users. Feature Name and Handlers Each extended feature has an entry in the ext_feature_table in opd_extended.cpp. Each entry contains a feature name, and a corresponding set of handlers. Feature name is a unique string, which is used to identify a feature in the table. Each feature provides a set of handlers, which will be executed by the OProfile daemon from pre-determined locations to perform certain tasks. At runtime, the OProfile daemon calls a feature handler wrapper from one of the predetermined locations to check whether an extended feature is enabled, and whether a particular handler exists. Only the handlers of the enabled feature will be executed. Enabling Features Each feature is enabled using the OProfile daemon (oprofiled) command-line option "--ext-feature=<extended-feature-name>:[args]". The "extended-feature-name" is used to determine the feature to be enabled. The optional "args" is passed into the feature-specific initialization handler (ext_init). Currently, only one extended feature can be enabled at a time. Type of Handlers Each feature is responsible for providing its own set of handlers. Types of handler are: ext_init Handler "ext_init" handles initialization of an extended feature. It takes "args" parameter which is passed in through the "oprofiled --ext-feature=< extended-feature-name>:[args]". This handler is executed in the function opd_options() in the file daemon/oprofiled.c . The ext_init handler is required for all features. ext_print_stats Handler "ext_print_stats" handles the extended feature statistics report. It adds a new section in the OProfile daemon statistics report, which is normally outputed to the file /var/lib/oprofile/samples/oprofiled.log. This handler is executed in the function opd_print_stats() in the file daemon/opd_stats.c. ext_sfile Handler "ext_sfile" contains a set of handlers related to operations on the extended sample files (sample files for events related to extended feature). These operations include create_sfile(), sfile_dup(), close_sfile(), sync_sfile(), and get_file() as defined in daemon/opd_sfile.c. An additional field, odb_t * ext_file, is added to the struct sfile for storing extended sample files information. Extended Feature Reference Implementation Instruction-Based Sampling (IBS) An example of extended feature implementation can be seen by examining the AMD Instruction-Based Sampling support. IBS Initialization Instruction-Based Sampling (IBS) is a new performance measurement technique available on AMD Family 10h processors. Enabling IBS profiling is done simply by specifying IBS performance events through the "--event=" options. opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user> opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user> Note: * Count and unitmask for all IBS fetch events must be the same, as do those for IBS op. IBS performance events are listed in opcontrol --list-events. When users specify these events, opcontrol verifies them using ophelp, which checks for the ext:ibs_fetch or ext:ibs_op tag in events/x86-64/family10/events file. Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and /dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows. oprofiled \ --ext-feature=ibs:\ fetch:<IBS_FETCH_EVENT1>,<IBS_FETCH_EVENT2>,...,:<IBS fetch count>:<IBS Fetch um>|\ op:<IBS_OP_EVENT1>,<IBS_OP_EVENT2>,...,:<IBS op count>:<IBS op um> Here, the OProfile daemon parses the --ext-feature option and checks the feature name ("ibs") before calling the the initialization function to handle the string containing IBS events, counts, and unitmasks. Then, it stores each event in the IBS virtual-counter table (struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]) and stores the event index in the IBS Virtual Counter Index (VCI) map (ibs_vci_map[OP_MAX_IBS_COUNTERS]) with IBS event value as the map key. IBS Data Processing During a profile session, the OProfile daemon identifies IBS samples in the event buffer using the "IBS_FETCH_CODE" or "IBS_OP_CODE". These codes trigger the handlers code_ibs_fetch_sample() or code_ibs_op_sample() listed in the handler_t handlers[] vector in daemon/opd_trans.c . These handlers are responsible for processing IBS samples and translate them into IBS performance events. Unlike traditional performance events, each IBS sample can be derived into multiple IBS performance events. For each event that the user specifies, a combination of bits from Model-Specific Registers (MSR) are checked against the bitmask defining the event. If the condition is met, the event will then be recorded. The derivation logic is in the files daemon/opd_ibs_macro.h and daemon/opd_ibs_trans.[h,c]. IBS Sample File Traditionally, sample file information (odb_t) is stored in the struct sfile::odb_t file[OP_MAX_COUNTER]. Currently, OP_MAX_COUNTER is 8 on non-alpha, and 20 on alpha-based system. Event index (the counter number on which the event is configured) is used to access the corresponding entry in the array. Unlike the traditional performance event, IBS does not use the actual counter registers (i.e. /dev/oprofile/0,1,2,3). Also, the number of performance events generated by IBS could be larger than OP_MAX_COUNTER (currently upto 13 IBS-fetch and 46 IBS-op events). Therefore IBS requires a special data structure and sfile handlers (struct opd_ext_sfile_handlers) for managing IBS sample files. IBS-sample-file information is stored in a memory allocated by handler ibs_sfile_create(), which can be accessed through struct sfile::odb_t * ext_files. Glossary of OProfile source concepts and types application image The primary binary image used by an application. This is derived from the kernel and corresponds to the binary started upon running an application: for example, /bin/bash. binary image An ELF file containing executable code: this includes kernel modules, the kernel itself (a.k.a. vmlinux), shared libraries, and application binaries. dcookie Short for "dentry cookie". A unique ID that can be looked up to provide the full path name of a binary image. dependent image A binary image that is dependent upon an application, used with per-application separation. Most commonly, shared libraries. For example, if /bin/bash is running and we take some samples inside the C library itself due to bash calling library code, then the image /lib/libc.so would be dependent upon /bin/bash. merging This refers to the ability to merge several distinct sample files into one set of data at runtime, in the post-profiling tools. For example, per-thread sample files can be merged into one set of data, because they are compatible (i.e. the aggregation of the data is meaningful), but it's not possible to merge sample files for two different events, because there would be no useful meaning to the results. profile class A collection of profile data that has been collected under the same class template. For example, if we're using opreport to show results after profiling with two performance counters enabled profiling DATA_MEM_REFS and CPU_CLK_UNHALTED, there would be two profile classes, one for each event. Or if we're on an SMP system and doing per-cpu profiling, and we request opreport to show results for each CPU side-by-side, there would be a profile class for each CPU. profile specification The parameters the user passes to the post-profiling tools that limit what sample files are used. This specification is matched against the available sample files to generate a selection of profile data. profile template The parameters that define what goes in a particular profile class. This includes a symbolic name (e.g. "cpu:1") and the code-usable equivalent.