• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1========================
2ftrace - Function Tracer
3========================
4
5Copyright 2008 Red Hat Inc.
6
7:Author:   Steven Rostedt <srostedt@redhat.com>
8:License:  The GNU Free Documentation License, Version 1.2
9          (dual licensed under the GPL v2)
10:Original Reviewers:  Elias Oltmanns, Randy Dunlap, Andrew Morton,
11		      John Kacur, and David Teigland.
12
13- Written for: 2.6.28-rc2
14- Updated for: 3.10
15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
16- Converted to rst format - Changbin Du <changbin.du@intel.com>
17
18Introduction
19------------
20
21Ftrace is an internal tracer designed to help out developers and
22designers of systems to find what is going on inside the kernel.
23It can be used for debugging or analyzing latencies and
24performance issues that take place outside of user-space.
25
26Although ftrace is typically considered the function tracer, it
27is really a framework of several assorted tracing utilities.
28There's latency tracing to examine what occurs between interrupts
29disabled and enabled, as well as for preemption and from a time
30a task is woken to the task is actually scheduled in.
31
32One of the most common uses of ftrace is the event tracing.
33Throughout the kernel is hundreds of static event points that
34can be enabled via the tracefs file system to see what is
35going on in certain parts of the kernel.
36
37See events.rst for more information.
38
39
40Implementation Details
41----------------------
42
43See :doc:`ftrace-design` for details for arch porters and such.
44
45
46The File System
47---------------
48
49Ftrace uses the tracefs file system to hold the control files as
50well as the files to display output.
51
52When tracefs is configured into the kernel (which selecting any ftrace
53option will do) the directory /sys/kernel/tracing will be created. To mount
54this directory, you can add to your /etc/fstab file::
55
56 tracefs       /sys/kernel/tracing       tracefs defaults        0       0
57
58Or you can mount it at run time with::
59
60 mount -t tracefs nodev /sys/kernel/tracing
61
62For quicker access to that directory you may want to make a soft link to
63it::
64
65 ln -s /sys/kernel/tracing /tracing
66
67.. attention::
68
69  Before 4.1, all ftrace tracing control files were within the debugfs
70  file system, which is typically located at /sys/kernel/debug/tracing.
71  For backward compatibility, when mounting the debugfs file system,
72  the tracefs file system will be automatically mounted at:
73
74  /sys/kernel/debug/tracing
75
76  All files located in the tracefs file system will be located in that
77  debugfs file system directory as well.
78
79  In order to not automount tracefs in the debugfs filesystem, enable the
80  defconfig option CONFIG_TRACEFS_DISABLE_AUTOMOUNT.
81
82.. attention::
83
84  Any selected ftrace option will also create the tracefs file system.
85  The rest of the document will assume that you are in the ftrace directory
86  (cd /sys/kernel/tracing) and will only concentrate on the files within that
87  directory and not distract from the content with the extended
88  "/sys/kernel/tracing" path name.
89
90That's it! (assuming that you have ftrace configured into your kernel)
91
92After mounting tracefs you will have access to the control and output files
93of ftrace. Here is a list of some of the key files:
94
95
96 Note: all time values are in microseconds.
97
98  current_tracer:
99
100	This is used to set or display the current tracer
101	that is configured. Changing the current tracer clears
102	the ring buffer content as well as the "snapshot" buffer.
103
104  available_tracers:
105
106	This holds the different types of tracers that
107	have been compiled into the kernel. The
108	tracers listed here can be configured by
109	echoing their name into current_tracer.
110
111  tracing_on:
112
113	This sets or displays whether writing to the trace
114	ring buffer is enabled. Echo 0 into this file to disable
115	the tracer or 1 to enable it. Note, this only disables
116	writing to the ring buffer, the tracing overhead may
117	still be occurring.
118
119	The kernel function tracing_off() can be used within the
120	kernel to disable writing to the ring buffer, which will
121	set this file to "0". User space can re-enable tracing by
122	echoing "1" into the file.
123
124	Note, the function and event trigger "traceoff" will also
125	set this file to zero and stop tracing. Which can also
126	be re-enabled by user space using this file.
127
128  trace:
129
130	This file holds the output of the trace in a human
131	readable format (described below). Opening this file for
132	writing with the O_TRUNC flag clears the ring buffer content.
133        Note, this file is not a consumer. If tracing is off
134        (no tracer running, or tracing_on is zero), it will produce
135        the same output each time it is read. When tracing is on,
136        it may produce inconsistent results as it tries to read
137        the entire buffer without consuming it.
138
139  trace_pipe:
140
141	The output is the same as the "trace" file but this
142	file is meant to be streamed with live tracing.
143	Reads from this file will block until new data is
144	retrieved.  Unlike the "trace" file, this file is a
145	consumer. This means reading from this file causes
146	sequential reads to display more current data. Once
147	data is read from this file, it is consumed, and
148	will not be read again with a sequential read. The
149	"trace" file is static, and if the tracer is not
150	adding more data, it will display the same
151	information every time it is read.
152
153  trace_options:
154
155	This file lets the user control the amount of data
156	that is displayed in one of the above output
157	files. Options also exist to modify how a tracer
158	or events work (stack traces, timestamps, etc).
159
160  options:
161
162	This is a directory that has a file for every available
163	trace option (also in trace_options). Options may also be set
164	or cleared by writing a "1" or "0" respectively into the
165	corresponding file with the option name.
166
167  tracing_max_latency:
168
169	Some of the tracers record the max latency.
170	For example, the maximum time that interrupts are disabled.
171	The maximum time is saved in this file. The max trace will also be
172	stored,	and displayed by "trace". A new max trace will only be
173	recorded if the latency is greater than the value in this file
174	(in microseconds).
175
176	By echoing in a time into this file, no latency will be recorded
177	unless it is greater than the time in this file.
178
179  tracing_thresh:
180
181	Some latency tracers will record a trace whenever the
182	latency is greater than the number in this file.
183	Only active when the file contains a number greater than 0.
184	(in microseconds)
185
186  buffer_size_kb:
187
188	This sets or displays the number of kilobytes each CPU
189	buffer holds. By default, the trace buffers are the same size
190	for each CPU. The displayed number is the size of the
191	CPU buffer and not total size of all buffers. The
192	trace buffers are allocated in pages (blocks of memory
193	that the kernel uses for allocation, usually 4 KB in size).
194	A few extra pages may be allocated to accommodate buffer management
195	meta-data. If the last page allocated has room for more bytes
196	than requested, the rest of the page will be used,
197	making the actual allocation bigger than requested or shown.
198	( Note, the size may not be a multiple of the page size
199	due to buffer management meta-data. )
200
201	Buffer sizes for individual CPUs may vary
202	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
203	this file will show "X".
204
205  buffer_total_size_kb:
206
207	This displays the total combined size of all the trace buffers.
208
209  free_buffer:
210
211	If a process is performing tracing, and the ring buffer	should be
212	shrunk "freed" when the process is finished, even if it were to be
213	killed by a signal, this file can be used for that purpose. On close
214	of this file, the ring buffer will be resized to its minimum size.
215	Having a process that is tracing also open this file, when the process
216	exits its file descriptor for this file will be closed, and in doing so,
217	the ring buffer will be "freed".
218
219	It may also stop tracing if disable_on_free option is set.
220
221  tracing_cpumask:
222
223	This is a mask that lets the user only trace on specified CPUs.
224	The format is a hex string representing the CPUs.
225
226  set_ftrace_filter:
227
228	When dynamic ftrace is configured in (see the
229	section below "dynamic ftrace"), the code is dynamically
230	modified (code text rewrite) to disable calling of the
231	function profiler (mcount). This lets tracing be configured
232	in with practically no overhead in performance.  This also
233	has a side effect of enabling or disabling specific functions
234	to be traced. Echoing names of functions into this file
235	will limit the trace to only those functions.
236	This influences the tracers "function" and "function_graph"
237	and thus also function profiling (see "function_profile_enabled").
238
239	The functions listed in "available_filter_functions" are what
240	can be written into this file.
241
242	This interface also allows for commands to be used. See the
243	"Filter commands" section for more details.
244
245	As a speed up, since processing strings can be quite expensive
246	and requires a check of all functions registered to tracing, instead
247	an index can be written into this file. A number (starting with "1")
248	written will instead select the same corresponding at the line position
249	of the "available_filter_functions" file.
250
251  set_ftrace_notrace:
252
253	This has an effect opposite to that of
254	set_ftrace_filter. Any function that is added here will not
255	be traced. If a function exists in both set_ftrace_filter
256	and set_ftrace_notrace,	the function will _not_ be traced.
257
258  set_ftrace_pid:
259
260	Have the function tracer only trace the threads whose PID are
261	listed in this file.
262
263	If the "function-fork" option is set, then when a task whose
264	PID is listed in this file forks, the child's PID will
265	automatically be added to this file, and the child will be
266	traced by the function tracer as well. This option will also
267	cause PIDs of tasks that exit to be removed from the file.
268
269  set_ftrace_notrace_pid:
270
271        Have the function tracer ignore threads whose PID are listed in
272        this file.
273
274        If the "function-fork" option is set, then when a task whose
275	PID is listed in this file forks, the child's PID will
276	automatically be added to this file, and the child will not be
277	traced by the function tracer as well. This option will also
278	cause PIDs of tasks that exit to be removed from the file.
279
280        If a PID is in both this file and "set_ftrace_pid", then this
281        file takes precedence, and the thread will not be traced.
282
283  set_event_pid:
284
285	Have the events only trace a task with a PID listed in this file.
286	Note, sched_switch and sched_wake_up will also trace events
287	listed in this file.
288
289	To have the PIDs of children of tasks with their PID in this file
290	added on fork, enable the "event-fork" option. That option will also
291	cause the PIDs of tasks to be removed from this file when the task
292	exits.
293
294  set_event_notrace_pid:
295
296	Have the events not trace a task with a PID listed in this file.
297	Note, sched_switch and sched_wakeup will trace threads not listed
298	in this file, even if a thread's PID is in the file if the
299        sched_switch or sched_wakeup events also trace a thread that should
300        be traced.
301
302	To have the PIDs of children of tasks with their PID in this file
303	added on fork, enable the "event-fork" option. That option will also
304	cause the PIDs of tasks to be removed from this file when the task
305	exits.
306
307  set_graph_function:
308
309	Functions listed in this file will cause the function graph
310	tracer to only trace these functions and the functions that
311	they call. (See the section "dynamic ftrace" for more details).
312	Note, set_ftrace_filter and set_ftrace_notrace still affects
313	what functions are being traced.
314
315  set_graph_notrace:
316
317	Similar to set_graph_function, but will disable function graph
318	tracing when the function is hit until it exits the function.
319	This makes it possible to ignore tracing functions that are called
320	by a specific function.
321
322  available_filter_functions:
323
324	This lists the functions that ftrace has processed and can trace.
325	These are the function names that you can pass to
326	"set_ftrace_filter", "set_ftrace_notrace",
327	"set_graph_function", or "set_graph_notrace".
328	(See the section "dynamic ftrace" below for more details.)
329
330  dyn_ftrace_total_info:
331
332	This file is for debugging purposes. The number of functions that
333	have been converted to nops and are available to be traced.
334
335  enabled_functions:
336
337	This file is more for debugging ftrace, but can also be useful
338	in seeing if any function has a callback attached to it.
339	Not only does the trace infrastructure use ftrace function
340	trace utility, but other subsystems might too. This file
341	displays all functions that have a callback attached to them
342	as well as the number of callbacks that have been attached.
343	Note, a callback may also call multiple functions which will
344	not be listed in this count.
345
346	If the callback registered to be traced by a function with
347	the "save regs" attribute (thus even more overhead), a 'R'
348	will be displayed on the same line as the function that
349	is returning registers.
350
351	If the callback registered to be traced by a function with
352	the "ip modify" attribute (thus the regs->ip can be changed),
353	an 'I' will be displayed on the same line as the function that
354	can be overridden.
355
356	If the architecture supports it, it will also show what callback
357	is being directly called by the function. If the count is greater
358	than 1 it most likely will be ftrace_ops_list_func().
359
360	If the callback of the function jumps to a trampoline that is
361	specific to a the callback and not the standard trampoline,
362	its address will be printed as well as the function that the
363	trampoline calls.
364
365  function_profile_enabled:
366
367	When set it will enable all functions with either the function
368	tracer, or if configured, the function graph tracer. It will
369	keep a histogram of the number of functions that were called
370	and if the function graph tracer was configured, it will also keep
371	track of the time spent in those functions. The histogram
372	content can be displayed in the files:
373
374	trace_stat/function<cpu> ( function0, function1, etc).
375
376  trace_stat:
377
378	A directory that holds different tracing stats.
379
380  kprobe_events:
381
382	Enable dynamic trace points. See kprobetrace.rst.
383
384  kprobe_profile:
385
386	Dynamic trace points stats. See kprobetrace.rst.
387
388  max_graph_depth:
389
390	Used with the function graph tracer. This is the max depth
391	it will trace into a function. Setting this to a value of
392	one will show only the first kernel function that is called
393	from user space.
394
395  printk_formats:
396
397	This is for tools that read the raw format files. If an event in
398	the ring buffer references a string, only a pointer to the string
399	is recorded into the buffer and not the string itself. This prevents
400	tools from knowing what that string was. This file displays the string
401	and address for	the string allowing tools to map the pointers to what
402	the strings were.
403
404  saved_cmdlines:
405
406	Only the pid of the task is recorded in a trace event unless
407	the event specifically saves the task comm as well. Ftrace
408	makes a cache of pid mappings to comms to try to display
409	comms for events. If a pid for a comm is not listed, then
410	"<...>" is displayed in the output.
411
412	If the option "record-cmd" is set to "0", then comms of tasks
413	will not be saved during recording. By default, it is enabled.
414
415  saved_cmdlines_size:
416
417	By default, 128 comms are saved (see "saved_cmdlines" above). To
418	increase or decrease the amount of comms that are cached, echo
419	the number of comms to cache into this file.
420
421  saved_tgids:
422
423	If the option "record-tgid" is set, on each scheduling context switch
424	the Task Group ID of a task is saved in a table mapping the PID of
425	the thread to its TGID. By default, the "record-tgid" option is
426	disabled.
427
428  snapshot:
429
430	This displays the "snapshot" buffer and also lets the user
431	take a snapshot of the current running trace.
432	See the "Snapshot" section below for more details.
433
434  stack_max_size:
435
436	When the stack tracer is activated, this will display the
437	maximum stack size it has encountered.
438	See the "Stack Trace" section below.
439
440  stack_trace:
441
442	This displays the stack back trace of the largest stack
443	that was encountered when the stack tracer is activated.
444	See the "Stack Trace" section below.
445
446  stack_trace_filter:
447
448	This is similar to "set_ftrace_filter" but it limits what
449	functions the stack tracer will check.
450
451  trace_clock:
452
453	Whenever an event is recorded into the ring buffer, a
454	"timestamp" is added. This stamp comes from a specified
455	clock. By default, ftrace uses the "local" clock. This
456	clock is very fast and strictly per cpu, but on some
457	systems it may not be monotonic with respect to other
458	CPUs. In other words, the local clocks may not be in sync
459	with local clocks on other CPUs.
460
461	Usual clocks for tracing::
462
463	  # cat trace_clock
464	  [local] global counter x86-tsc
465
466	The clock with the square brackets around it is the one in effect.
467
468	local:
469		Default clock, but may not be in sync across CPUs
470
471	global:
472		This clock is in sync with all CPUs but may
473		be a bit slower than the local clock.
474
475	counter:
476		This is not a clock at all, but literally an atomic
477		counter. It counts up one by one, but is in sync
478		with all CPUs. This is useful when you need to
479		know exactly the order events occurred with respect to
480		each other on different CPUs.
481
482	uptime:
483		This uses the jiffies counter and the time stamp
484		is relative to the time since boot up.
485
486	perf:
487		This makes ftrace use the same clock that perf uses.
488		Eventually perf will be able to read ftrace buffers
489		and this will help out in interleaving the data.
490
491	x86-tsc:
492		Architectures may define their own clocks. For
493		example, x86 uses its own TSC cycle clock here.
494
495	ppc-tb:
496		This uses the powerpc timebase register value.
497		This is in sync across CPUs and can also be used
498		to correlate events across hypervisor/guest if
499		tb_offset is known.
500
501	mono:
502		This uses the fast monotonic clock (CLOCK_MONOTONIC)
503		which is monotonic and is subject to NTP rate adjustments.
504
505	mono_raw:
506		This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
507		which is monotonic but is not subject to any rate adjustments
508		and ticks at the same rate as the hardware clocksource.
509
510	boot:
511		This is the boot clock (CLOCK_BOOTTIME) and is based on the
512		fast monotonic clock, but also accounts for time spent in
513		suspend. Since the clock access is designed for use in
514		tracing in the suspend path, some side effects are possible
515		if clock is accessed after the suspend time is accounted before
516		the fast mono clock is updated. In this case, the clock update
517		appears to happen slightly sooner than it normally would have.
518		Also on 32-bit systems, it's possible that the 64-bit boot offset
519		sees a partial update. These effects are rare and post
520		processing should be able to handle them. See comments in the
521		ktime_get_boot_fast_ns() function for more information.
522
523	To set a clock, simply echo the clock name into this file::
524
525	  # echo global > trace_clock
526
527	Setting a clock clears the ring buffer content as well as the
528	"snapshot" buffer.
529
530  trace_marker:
531
532	This is a very useful file for synchronizing user space
533	with events happening in the kernel. Writing strings into
534	this file will be written into the ftrace buffer.
535
536	It is useful in applications to open this file at the start
537	of the application and just reference the file descriptor
538	for the file::
539
540		void trace_write(const char *fmt, ...)
541		{
542			va_list ap;
543			char buf[256];
544			int n;
545
546			if (trace_fd < 0)
547				return;
548
549			va_start(ap, fmt);
550			n = vsnprintf(buf, 256, fmt, ap);
551			va_end(ap);
552
553			write(trace_fd, buf, n);
554		}
555
556	start::
557
558		trace_fd = open("trace_marker", WR_ONLY);
559
560	Note: Writing into the trace_marker file can also initiate triggers
561	      that are written into /sys/kernel/tracing/events/ftrace/print/trigger
562	      See "Event triggers" in Documentation/trace/events.rst and an
563              example in Documentation/trace/histogram.rst (Section 3.)
564
565  trace_marker_raw:
566
567	This is similar to trace_marker above, but is meant for binary data
568	to be written to it, where a tool can be used to parse the data
569	from trace_pipe_raw.
570
571  uprobe_events:
572
573	Add dynamic tracepoints in programs.
574	See uprobetracer.rst
575
576  uprobe_profile:
577
578	Uprobe statistics. See uprobetrace.txt
579
580  instances:
581
582	This is a way to make multiple trace buffers where different
583	events can be recorded in different buffers.
584	See "Instances" section below.
585
586  events:
587
588	This is the trace event directory. It holds event tracepoints
589	(also known as static tracepoints) that have been compiled
590	into the kernel. It shows what event tracepoints exist
591	and how they are grouped by system. There are "enable"
592	files at various levels that can enable the tracepoints
593	when a "1" is written to them.
594
595	See events.rst for more information.
596
597  set_event:
598
599	By echoing in the event into this file, will enable that event.
600
601	See events.rst for more information.
602
603  available_events:
604
605	A list of events that can be enabled in tracing.
606
607	See events.rst for more information.
608
609  timestamp_mode:
610
611	Certain tracers may change the timestamp mode used when
612	logging trace events into the event buffer.  Events with
613	different modes can coexist within a buffer but the mode in
614	effect when an event is logged determines which timestamp mode
615	is used for that event.  The default timestamp mode is
616	'delta'.
617
618	Usual timestamp modes for tracing:
619
620	  # cat timestamp_mode
621	  [delta] absolute
622
623	  The timestamp mode with the square brackets around it is the
624	  one in effect.
625
626	  delta: Default timestamp mode - timestamp is a delta against
627	         a per-buffer timestamp.
628
629	  absolute: The timestamp is a full timestamp, not a delta
630                 against some other value.  As such it takes up more
631                 space and is less efficient.
632
633  hwlat_detector:
634
635	Directory for the Hardware Latency Detector.
636	See "Hardware Latency Detector" section below.
637
638  per_cpu:
639
640	This is a directory that contains the trace per_cpu information.
641
642  per_cpu/cpu0/buffer_size_kb:
643
644	The ftrace buffer is defined per_cpu. That is, there's a separate
645	buffer for each CPU to allow writes to be done atomically,
646	and free from cache bouncing. These buffers may have different
647	size buffers. This file is similar to the buffer_size_kb
648	file, but it only displays or sets the buffer size for the
649	specific CPU. (here cpu0).
650
651  per_cpu/cpu0/trace:
652
653	This is similar to the "trace" file, but it will only display
654	the data specific for the CPU. If written to, it only clears
655	the specific CPU buffer.
656
657  per_cpu/cpu0/trace_pipe
658
659	This is similar to the "trace_pipe" file, and is a consuming
660	read, but it will only display (and consume) the data specific
661	for the CPU.
662
663  per_cpu/cpu0/trace_pipe_raw
664
665	For tools that can parse the ftrace ring buffer binary format,
666	the trace_pipe_raw file can be used to extract the data
667	from the ring buffer directly. With the use of the splice()
668	system call, the buffer data can be quickly transferred to
669	a file or to the network where a server is collecting the
670	data.
671
672	Like trace_pipe, this is a consuming reader, where multiple
673	reads will always produce different data.
674
675  per_cpu/cpu0/snapshot:
676
677	This is similar to the main "snapshot" file, but will only
678	snapshot the current CPU (if supported). It only displays
679	the content of the snapshot for a given CPU, and if
680	written to, only clears this CPU buffer.
681
682  per_cpu/cpu0/snapshot_raw:
683
684	Similar to the trace_pipe_raw, but will read the binary format
685	from the snapshot buffer for the given CPU.
686
687  per_cpu/cpu0/stats:
688
689	This displays certain stats about the ring buffer:
690
691	entries:
692		The number of events that are still in the buffer.
693
694	overrun:
695		The number of lost events due to overwriting when
696		the buffer was full.
697
698	commit overrun:
699		Should always be zero.
700		This gets set if so many events happened within a nested
701		event (ring buffer is re-entrant), that it fills the
702		buffer and starts dropping events.
703
704	bytes:
705		Bytes actually read (not overwritten).
706
707	oldest event ts:
708		The oldest timestamp in the buffer
709
710	now ts:
711		The current timestamp
712
713	dropped events:
714		Events lost due to overwrite option being off.
715
716	read events:
717		The number of events read.
718
719The Tracers
720-----------
721
722Here is the list of current tracers that may be configured.
723
724  "function"
725
726	Function call tracer to trace all kernel functions.
727
728  "function_graph"
729
730	Similar to the function tracer except that the
731	function tracer probes the functions on their entry
732	whereas the function graph tracer traces on both entry
733	and exit of the functions. It then provides the ability
734	to draw a graph of function calls similar to C code
735	source.
736
737  "blk"
738
739	The block tracer. The tracer used by the blktrace user
740	application.
741
742  "hwlat"
743
744	The Hardware Latency tracer is used to detect if the hardware
745	produces any latency. See "Hardware Latency Detector" section
746	below.
747
748  "irqsoff"
749
750	Traces the areas that disable interrupts and saves
751	the trace with the longest max latency.
752	See tracing_max_latency. When a new max is recorded,
753	it replaces the old trace. It is best to view this
754	trace with the latency-format option enabled, which
755	happens automatically when the tracer is selected.
756
757  "preemptoff"
758
759	Similar to irqsoff but traces and records the amount of
760	time for which preemption is disabled.
761
762  "preemptirqsoff"
763
764	Similar to irqsoff and preemptoff, but traces and
765	records the largest time for which irqs and/or preemption
766	is disabled.
767
768  "wakeup"
769
770	Traces and records the max latency that it takes for
771	the highest priority task to get scheduled after
772	it has been woken up.
773        Traces all tasks as an average developer would expect.
774
775  "wakeup_rt"
776
777        Traces and records the max latency that it takes for just
778        RT tasks (as the current "wakeup" does). This is useful
779        for those interested in wake up timings of RT tasks.
780
781  "wakeup_dl"
782
783	Traces and records the max latency that it takes for
784	a SCHED_DEADLINE task to be woken (as the "wakeup" and
785	"wakeup_rt" does).
786
787  "mmiotrace"
788
789	A special tracer that is used to trace binary module.
790	It will trace all the calls that a module makes to the
791	hardware. Everything it writes and reads from the I/O
792	as well.
793
794  "branch"
795
796	This tracer can be configured when tracing likely/unlikely
797	calls within the kernel. It will trace when a likely and
798	unlikely branch is hit and if it was correct in its prediction
799	of being correct.
800
801  "nop"
802
803	This is the "trace nothing" tracer. To remove all
804	tracers from tracing simply echo "nop" into
805	current_tracer.
806
807Error conditions
808----------------
809
810  For most ftrace commands, failure modes are obvious and communicated
811  using standard return codes.
812
813  For other more involved commands, extended error information may be
814  available via the tracing/error_log file.  For the commands that
815  support it, reading the tracing/error_log file after an error will
816  display more detailed information about what went wrong, if
817  information is available.  The tracing/error_log file is a circular
818  error log displaying a small number (currently, 8) of ftrace errors
819  for the last (8) failed commands.
820
821  The extended error information and usage takes the form shown in
822  this example::
823
824    # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
825    echo: write error: Invalid argument
826
827    # cat /sys/kernel/debug/tracing/error_log
828    [ 5348.887237] location: error: Couldn't yyy: zzz
829      Command: xxx
830               ^
831    [ 7517.023364] location: error: Bad rrr: sss
832      Command: ppp qqq
833                   ^
834
835  To clear the error log, echo the empty string into it::
836
837    # echo > /sys/kernel/debug/tracing/error_log
838
839Examples of using the tracer
840----------------------------
841
842Here are typical examples of using the tracers when controlling
843them only with the tracefs interface (without using any
844user-land utilities).
845
846Output format:
847--------------
848
849Here is an example of the output format of the file "trace"::
850
851  # tracer: function
852  #
853  # entries-in-buffer/entries-written: 140080/250280   #P:4
854  #
855  #                              _-----=> irqs-off
856  #                             / _----=> need-resched
857  #                            | / _---=> hardirq/softirq
858  #                            || / _--=> preempt-depth
859  #                            ||| /     delay
860  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
861  #              | |       |   ||||       |         |
862              bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
863              bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
864              bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
865              sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
866              bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
867              bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
868              bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
869              bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
870              bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
871              sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
872              ....
873
874A header is printed with the tracer name that is represented by
875the trace. In this case the tracer is "function". Then it shows the
876number of events in the buffer as well as the total number of entries
877that were written. The difference is the number of entries that were
878lost due to the buffer filling up (250280 - 140080 = 110200 events
879lost).
880
881The header explains the content of the events. Task name "bash", the task
882PID "1977", the CPU that it was running on "000", the latency format
883(explained below), the timestamp in <secs>.<usecs> format, the
884function name that was traced "sys_close" and the parent function that
885called this function "system_call_fastpath". The timestamp is the time
886at which the function was entered.
887
888Latency trace format
889--------------------
890
891When the latency-format option is enabled or when one of the latency
892tracers is set, the trace file gives somewhat more information to see
893why a latency happened. Here is a typical trace::
894
895  # tracer: irqsoff
896  #
897  # irqsoff latency trace v1.1.5 on 3.8.0-test+
898  # --------------------------------------------------------------------
899  # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
900  #    -----------------
901  #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
902  #    -----------------
903  #  => started at: __lock_task_sighand
904  #  => ended at:   _raw_spin_unlock_irqrestore
905  #
906  #
907  #                  _------=> CPU#
908  #                 / _-----=> irqs-off
909  #                | / _----=> need-resched
910  #                || / _---=> hardirq/softirq
911  #                ||| / _--=> preempt-depth
912  #                |||| /     delay
913  #  cmd     pid   ||||| time  |   caller
914  #     \   /      |||||  \    |   /
915        ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
916        ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
917        ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
918        ps-6143    2d..1  306us : <stack trace>
919   => trace_hardirqs_on_caller
920   => trace_hardirqs_on
921   => _raw_spin_unlock_irqrestore
922   => do_task_stat
923   => proc_tgid_stat
924   => proc_single_show
925   => seq_read
926   => vfs_read
927   => sys_read
928   => system_call_fastpath
929
930
931This shows that the current tracer is "irqsoff" tracing the time
932for which interrupts were disabled. It gives the trace version (which
933never changes) and the version of the kernel upon which this was executed on
934(3.8). Then it displays the max latency in microseconds (259 us). The number
935of trace entries displayed and the total number (both are four: #4/4).
936VP, KP, SP, and HP are always zero and are reserved for later use.
937#P is the number of online CPUs (#P:4).
938
939The task is the process that was running when the latency
940occurred. (ps pid: 6143).
941
942The start and stop (the functions in which the interrupts were
943disabled and enabled respectively) that caused the latencies:
944
945  - __lock_task_sighand is where the interrupts were disabled.
946  - _raw_spin_unlock_irqrestore is where they were enabled again.
947
948The next lines after the header are the trace itself. The header
949explains which is which.
950
951  cmd: The name of the process in the trace.
952
953  pid: The PID of that process.
954
955  CPU#: The CPU which the process was running on.
956
957  irqs-off: 'd' interrupts are disabled. '.' otherwise.
958	.. caution:: If the architecture does not support a way to
959		read the irq flags variable, an 'X' will always
960		be printed here.
961
962  need-resched:
963	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
964	- 'n' only TIF_NEED_RESCHED is set,
965	- 'p' only PREEMPT_NEED_RESCHED is set,
966	- '.' otherwise.
967
968  hardirq/softirq:
969	- 'Z' - NMI occurred inside a hardirq
970	- 'z' - NMI is running
971	- 'H' - hard irq occurred inside a softirq.
972	- 'h' - hard irq is running
973	- 's' - soft irq is running
974	- '.' - normal context.
975
976  preempt-depth: The level of preempt_disabled
977
978The above is mostly meaningful for kernel developers.
979
980  time:
981	When the latency-format option is enabled, the trace file
982	output includes a timestamp relative to the start of the
983	trace. This differs from the output when latency-format
984	is disabled, which includes an absolute timestamp.
985
986  delay:
987	This is just to help catch your eye a bit better. And
988	needs to be fixed to be only relative to the same CPU.
989	The marks are determined by the difference between this
990	current trace and the next trace.
991
992	  - '$' - greater than 1 second
993	  - '@' - greater than 100 millisecond
994	  - '*' - greater than 10 millisecond
995	  - '#' - greater than 1000 microsecond
996	  - '!' - greater than 100 microsecond
997	  - '+' - greater than 10 microsecond
998	  - ' ' - less than or equal to 10 microsecond.
999
1000  The rest is the same as the 'trace' file.
1001
1002  Note, the latency tracers will usually end with a back trace
1003  to easily find where the latency occurred.
1004
1005trace_options
1006-------------
1007
1008The trace_options file (or the options directory) is used to control
1009what gets printed in the trace output, or manipulate the tracers.
1010To see what is available, simply cat the file::
1011
1012  cat trace_options
1013	print-parent
1014	nosym-offset
1015	nosym-addr
1016	noverbose
1017	noraw
1018	nohex
1019	nobin
1020	noblock
1021	trace_printk
1022	annotate
1023	nouserstacktrace
1024	nosym-userobj
1025	noprintk-msg-only
1026	context-info
1027	nolatency-format
1028	record-cmd
1029	norecord-tgid
1030	overwrite
1031	nodisable_on_free
1032	irq-info
1033	markers
1034	noevent-fork
1035	function-trace
1036	nofunction-fork
1037	nodisplay-graph
1038	nostacktrace
1039	nobranch
1040
1041To disable one of the options, echo in the option prepended with
1042"no"::
1043
1044  echo noprint-parent > trace_options
1045
1046To enable an option, leave off the "no"::
1047
1048  echo sym-offset > trace_options
1049
1050Here are the available options:
1051
1052  print-parent
1053	On function traces, display the calling (parent)
1054	function as well as the function being traced.
1055	::
1056
1057	  print-parent:
1058	   bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
1059
1060	  noprint-parent:
1061	   bash-4000  [01]  1477.606694: simple_strtoul
1062
1063
1064  sym-offset
1065	Display not only the function name, but also the
1066	offset in the function. For example, instead of
1067	seeing just "ktime_get", you will see
1068	"ktime_get+0xb/0x20".
1069	::
1070
1071	  sym-offset:
1072	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
1073
1074  sym-addr
1075	This will also display the function address as well
1076	as the function name.
1077	::
1078
1079	  sym-addr:
1080	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
1081
1082  verbose
1083	This deals with the trace file when the
1084        latency-format option is enabled.
1085	::
1086
1087	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
1088	    (+0.000ms): simple_strtoul (kstrtoul)
1089
1090  raw
1091	This will display raw numbers. This option is best for
1092	use with user applications that can translate the raw
1093	numbers better than having it done in the kernel.
1094
1095  hex
1096	Similar to raw, but the numbers will be in a hexadecimal format.
1097
1098  bin
1099	This will print out the formats in raw binary.
1100
1101  block
1102	When set, reading trace_pipe will not block when polled.
1103
1104  trace_printk
1105	Can disable trace_printk() from writing into the buffer.
1106
1107  annotate
1108	It is sometimes confusing when the CPU buffers are full
1109	and one CPU buffer had a lot of events recently, thus
1110	a shorter time frame, were another CPU may have only had
1111	a few events, which lets it have older events. When
1112	the trace is reported, it shows the oldest events first,
1113	and it may look like only one CPU ran (the one with the
1114	oldest events). When the annotate option is set, it will
1115	display when a new CPU buffer started::
1116
1117			  <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1118			  <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1119			  <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1120		##### CPU 2 buffer started ####
1121			  <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1122			  <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1123			  <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1124
1125  userstacktrace
1126	This option changes the trace. It records a
1127	stacktrace of the current user space thread after
1128	each trace event.
1129
1130  sym-userobj
1131	when user stacktrace are enabled, look up which
1132	object the address belongs to, and print a
1133	relative address. This is especially useful when
1134	ASLR is on, otherwise you don't get a chance to
1135	resolve the address to object/file/line after
1136	the app is no longer running
1137
1138	The lookup is performed when you read
1139	trace,trace_pipe. Example::
1140
1141		  a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1142		  x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1143
1144
1145  printk-msg-only
1146	When set, trace_printk()s will only show the format
1147	and not their parameters (if trace_bprintk() or
1148	trace_bputs() was used to save the trace_printk()).
1149
1150  context-info
1151	Show only the event data. Hides the comm, PID,
1152	timestamp, CPU, and other useful data.
1153
1154  latency-format
1155	This option changes the trace output. When it is enabled,
1156	the trace displays additional information about the
1157	latency, as described in "Latency trace format".
1158
1159  pause-on-trace
1160	When set, opening the trace file for read, will pause
1161	writing to the ring buffer (as if tracing_on was set to zero).
1162	This simulates the original behavior of the trace file.
1163	When the file is closed, tracing will be enabled again.
1164
1165  record-cmd
1166	When any event or tracer is enabled, a hook is enabled
1167	in the sched_switch trace point to fill comm cache
1168	with mapped pids and comms. But this may cause some
1169	overhead, and if you only care about pids, and not the
1170	name of the task, disabling this option can lower the
1171	impact of tracing. See "saved_cmdlines".
1172
1173  record-tgid
1174	When any event or tracer is enabled, a hook is enabled
1175	in the sched_switch trace point to fill the cache of
1176	mapped Thread Group IDs (TGID) mapping to pids. See
1177	"saved_tgids".
1178
1179  overwrite
1180	This controls what happens when the trace buffer is
1181	full. If "1" (default), the oldest events are
1182	discarded and overwritten. If "0", then the newest
1183	events are discarded.
1184	(see per_cpu/cpu0/stats for overrun and dropped)
1185
1186  disable_on_free
1187	When the free_buffer is closed, tracing will
1188	stop (tracing_on set to 0).
1189
1190  irq-info
1191	Shows the interrupt, preempt count, need resched data.
1192	When disabled, the trace looks like::
1193
1194		# tracer: function
1195		#
1196		# entries-in-buffer/entries-written: 144405/9452052   #P:4
1197		#
1198		#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
1199		#              | |       |          |         |
1200			  <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1201			  <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
1202			  <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
1203
1204
1205  markers
1206	When set, the trace_marker is writable (only by root).
1207	When disabled, the trace_marker will error with EINVAL
1208	on write.
1209
1210  event-fork
1211	When set, tasks with PIDs listed in set_event_pid will have
1212	the PIDs of their children added to set_event_pid when those
1213	tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1214	their PIDs will be removed from the file.
1215
1216        This affects PIDs listed in set_event_notrace_pid as well.
1217
1218  function-trace
1219	The latency tracers will enable function tracing
1220	if this option is enabled (default it is). When
1221	it is disabled, the latency tracers do not trace
1222	functions. This keeps the overhead of the tracer down
1223	when performing latency tests.
1224
1225  function-fork
1226	When set, tasks with PIDs listed in set_ftrace_pid will
1227	have the PIDs of their children added to set_ftrace_pid
1228	when those tasks fork. Also, when tasks with PIDs in
1229	set_ftrace_pid exit, their PIDs will be removed from the
1230	file.
1231
1232        This affects PIDs in set_ftrace_notrace_pid as well.
1233
1234  display-graph
1235	When set, the latency tracers (irqsoff, wakeup, etc) will
1236	use function graph tracing instead of function tracing.
1237
1238  stacktrace
1239	When set, a stack trace is recorded after any trace event
1240	is recorded.
1241
1242  branch
1243	Enable branch tracing with the tracer. This enables branch
1244	tracer along with the currently set tracer. Enabling this
1245	with the "nop" tracer is the same as just enabling the
1246	"branch" tracer.
1247
1248.. tip:: Some tracers have their own options. They only appear in this
1249       file when the tracer is active. They always appear in the
1250       options directory.
1251
1252
1253Here are the per tracer options:
1254
1255Options for function tracer:
1256
1257  func_stack_trace
1258	When set, a stack trace is recorded after every
1259	function that is recorded. NOTE! Limit the functions
1260	that are recorded before enabling this, with
1261	"set_ftrace_filter" otherwise the system performance
1262	will be critically degraded. Remember to disable
1263	this option before clearing the function filter.
1264
1265Options for function_graph tracer:
1266
1267 Since the function_graph tracer has a slightly different output
1268 it has its own options to control what is displayed.
1269
1270  funcgraph-overrun
1271	When set, the "overrun" of the graph stack is
1272	displayed after each function traced. The
1273	overrun, is when the stack depth of the calls
1274	is greater than what is reserved for each task.
1275	Each task has a fixed array of functions to
1276	trace in the call graph. If the depth of the
1277	calls exceeds that, the function is not traced.
1278	The overrun is the number of functions missed
1279	due to exceeding this array.
1280
1281  funcgraph-cpu
1282	When set, the CPU number of the CPU where the trace
1283	occurred is displayed.
1284
1285  funcgraph-overhead
1286	When set, if the function takes longer than
1287	A certain amount, then a delay marker is
1288	displayed. See "delay" above, under the
1289	header description.
1290
1291  funcgraph-proc
1292	Unlike other tracers, the process' command line
1293	is not displayed by default, but instead only
1294	when a task is traced in and out during a context
1295	switch. Enabling this options has the command
1296	of each process displayed at every line.
1297
1298  funcgraph-duration
1299	At the end of each function (the return)
1300	the duration of the amount of time in the
1301	function is displayed in microseconds.
1302
1303  funcgraph-abstime
1304	When set, the timestamp is displayed at each line.
1305
1306  funcgraph-irqs
1307	When disabled, functions that happen inside an
1308	interrupt will not be traced.
1309
1310  funcgraph-tail
1311	When set, the return event will include the function
1312	that it represents. By default this is off, and
1313	only a closing curly bracket "}" is displayed for
1314	the return of a function.
1315
1316  sleep-time
1317	When running function graph tracer, to include
1318	the time a task schedules out in its function.
1319	When enabled, it will account time the task has been
1320	scheduled out as part of the function call.
1321
1322  graph-time
1323	When running function profiler with function graph tracer,
1324	to include the time to call nested functions. When this is
1325	not set, the time reported for the function will only
1326	include the time the function itself executed for, not the
1327	time for functions that it called.
1328
1329Options for blk tracer:
1330
1331  blk_classic
1332	Shows a more minimalistic output.
1333
1334
1335irqsoff
1336-------
1337
1338When interrupts are disabled, the CPU can not react to any other
1339external event (besides NMIs and SMIs). This prevents the timer
1340interrupt from triggering or the mouse interrupt from letting
1341the kernel know of a new mouse event. The result is a latency
1342with the reaction time.
1343
1344The irqsoff tracer tracks the time for which interrupts are
1345disabled. When a new maximum latency is hit, the tracer saves
1346the trace leading up to that latency point so that every time a
1347new maximum is reached, the old saved trace is discarded and the
1348new trace is saved.
1349
1350To reset the maximum, echo 0 into tracing_max_latency. Here is
1351an example::
1352
1353  # echo 0 > options/function-trace
1354  # echo irqsoff > current_tracer
1355  # echo 1 > tracing_on
1356  # echo 0 > tracing_max_latency
1357  # ls -ltr
1358  [...]
1359  # echo 0 > tracing_on
1360  # cat trace
1361  # tracer: irqsoff
1362  #
1363  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1364  # --------------------------------------------------------------------
1365  # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1366  #    -----------------
1367  #    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1368  #    -----------------
1369  #  => started at: run_timer_softirq
1370  #  => ended at:   run_timer_softirq
1371  #
1372  #
1373  #                  _------=> CPU#
1374  #                 / _-----=> irqs-off
1375  #                | / _----=> need-resched
1376  #                || / _---=> hardirq/softirq
1377  #                ||| / _--=> preempt-depth
1378  #                |||| /     delay
1379  #  cmd     pid   ||||| time  |   caller
1380  #     \   /      |||||  \    |   /
1381    <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
1382    <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
1383    <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
1384    <idle>-0       0dNs3   25us : <stack trace>
1385   => _raw_spin_unlock_irq
1386   => run_timer_softirq
1387   => __do_softirq
1388   => call_softirq
1389   => do_softirq
1390   => irq_exit
1391   => smp_apic_timer_interrupt
1392   => apic_timer_interrupt
1393   => rcu_idle_exit
1394   => cpu_idle
1395   => rest_init
1396   => start_kernel
1397   => x86_64_start_reservations
1398   => x86_64_start_kernel
1399
1400Here we see that we had a latency of 16 microseconds (which is
1401very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1402interrupts. The difference between the 16 and the displayed
1403timestamp 25us occurred because the clock was incremented
1404between the time of recording the max latency and the time of
1405recording the function that had that latency.
1406
1407Note the above example had function-trace not set. If we set
1408function-trace, we get a much larger output::
1409
1410 with echo 1 > options/function-trace
1411
1412  # tracer: irqsoff
1413  #
1414  # irqsoff latency trace v1.1.5 on 3.8.0-test+
1415  # --------------------------------------------------------------------
1416  # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1417  #    -----------------
1418  #    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1419  #    -----------------
1420  #  => started at: ata_scsi_queuecmd
1421  #  => ended at:   ata_scsi_queuecmd
1422  #
1423  #
1424  #                  _------=> CPU#
1425  #                 / _-----=> irqs-off
1426  #                | / _----=> need-resched
1427  #                || / _---=> hardirq/softirq
1428  #                ||| / _--=> preempt-depth
1429  #                |||| /     delay
1430  #  cmd     pid   ||||| time  |   caller
1431  #     \   /      |||||  \    |   /
1432      bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1433      bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1434      bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1435      bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1436      bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1437      bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1438      bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1439      bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1440      bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1441  [...]
1442      bash-2042    3d..1   67us : delay_tsc <-__delay
1443      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1444      bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1445      bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1446      bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1447      bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1448      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1449      bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1450      bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1451      bash-2042    3d..1  120us : <stack trace>
1452   => _raw_spin_unlock_irqrestore
1453   => ata_scsi_queuecmd
1454   => scsi_dispatch_cmd
1455   => scsi_request_fn
1456   => __blk_run_queue_uncond
1457   => __blk_run_queue
1458   => blk_queue_bio
1459   => submit_bio_noacct
1460   => submit_bio
1461   => submit_bh
1462   => __ext3_get_inode_loc
1463   => ext3_iget
1464   => ext3_lookup
1465   => lookup_real
1466   => __lookup_hash
1467   => walk_component
1468   => lookup_last
1469   => path_lookupat
1470   => filename_lookup
1471   => user_path_at_empty
1472   => user_path_at
1473   => vfs_fstatat
1474   => vfs_stat
1475   => sys_newstat
1476   => system_call_fastpath
1477
1478
1479Here we traced a 71 microsecond latency. But we also see all the
1480functions that were called during that time. Note that by
1481enabling function tracing, we incur an added overhead. This
1482overhead may extend the latency times. But nevertheless, this
1483trace has provided some very helpful debugging information.
1484
1485If we prefer function graph output instead of function, we can set
1486display-graph option::
1487
1488 with echo 1 > options/display-graph
1489
1490  # tracer: irqsoff
1491  #
1492  # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1493  # --------------------------------------------------------------------
1494  # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1495  #    -----------------
1496  #    | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1497  #    -----------------
1498  #  => started at: free_debug_processing
1499  #  => ended at:   return_to_handler
1500  #
1501  #
1502  #                                       _-----=> irqs-off
1503  #                                      / _----=> need-resched
1504  #                                     | / _---=> hardirq/softirq
1505  #                                     || / _--=> preempt-depth
1506  #                                     ||| /
1507  #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
1508  #      |          |     |    |        ||||      |   |                     |   |   |   |
1509          0 us |   0)   bash-1507    |  d... |   0.000 us    |  _raw_spin_lock_irqsave();
1510          0 us |   0)   bash-1507    |  d..1 |   0.378 us    |    do_raw_spin_trylock();
1511          1 us |   0)   bash-1507    |  d..2 |               |    set_track() {
1512          2 us |   0)   bash-1507    |  d..2 |               |      save_stack_trace() {
1513          2 us |   0)   bash-1507    |  d..2 |               |        __save_stack_trace() {
1514          3 us |   0)   bash-1507    |  d..2 |               |          __unwind_start() {
1515          3 us |   0)   bash-1507    |  d..2 |               |            get_stack_info() {
1516          3 us |   0)   bash-1507    |  d..2 |   0.351 us    |              in_task_stack();
1517          4 us |   0)   bash-1507    |  d..2 |   1.107 us    |            }
1518  [...]
1519       3750 us |   0)   bash-1507    |  d..1 |   0.516 us    |      do_raw_spin_unlock();
1520       3750 us |   0)   bash-1507    |  d..1 |   0.000 us    |  _raw_spin_unlock_irqrestore();
1521       3764 us |   0)   bash-1507    |  d..1 |   0.000 us    |  tracer_hardirqs_on();
1522      bash-1507    0d..1 3792us : <stack trace>
1523   => free_debug_processing
1524   => __slab_free
1525   => kmem_cache_free
1526   => vm_area_free
1527   => remove_vma
1528   => exit_mmap
1529   => mmput
1530   => begin_new_exec
1531   => load_elf_binary
1532   => search_binary_handler
1533   => __do_execve_file.isra.32
1534   => __x64_sys_execve
1535   => do_syscall_64
1536   => entry_SYSCALL_64_after_hwframe
1537
1538preemptoff
1539----------
1540
1541When preemption is disabled, we may be able to receive
1542interrupts but the task cannot be preempted and a higher
1543priority task must wait for preemption to be enabled again
1544before it can preempt a lower priority task.
1545
1546The preemptoff tracer traces the places that disable preemption.
1547Like the irqsoff tracer, it records the maximum latency for
1548which preemption was disabled. The control of preemptoff tracer
1549is much like the irqsoff tracer.
1550::
1551
1552  # echo 0 > options/function-trace
1553  # echo preemptoff > current_tracer
1554  # echo 1 > tracing_on
1555  # echo 0 > tracing_max_latency
1556  # ls -ltr
1557  [...]
1558  # echo 0 > tracing_on
1559  # cat trace
1560  # tracer: preemptoff
1561  #
1562  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1563  # --------------------------------------------------------------------
1564  # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1565  #    -----------------
1566  #    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1567  #    -----------------
1568  #  => started at: do_IRQ
1569  #  => ended at:   do_IRQ
1570  #
1571  #
1572  #                  _------=> CPU#
1573  #                 / _-----=> irqs-off
1574  #                | / _----=> need-resched
1575  #                || / _---=> hardirq/softirq
1576  #                ||| / _--=> preempt-depth
1577  #                |||| /     delay
1578  #  cmd     pid   ||||| time  |   caller
1579  #     \   /      |||||  \    |   /
1580      sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1581      sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1582      sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1583      sshd-1991    1d..1   52us : <stack trace>
1584   => sub_preempt_count
1585   => irq_exit
1586   => do_IRQ
1587   => ret_from_intr
1588
1589
1590This has some more changes. Preemption was disabled when an
1591interrupt came in (notice the 'h'), and was enabled on exit.
1592But we also see that interrupts have been disabled when entering
1593the preempt off section and leaving it (the 'd'). We do not know if
1594interrupts were enabled in the mean time or shortly after this
1595was over.
1596::
1597
1598  # tracer: preemptoff
1599  #
1600  # preemptoff latency trace v1.1.5 on 3.8.0-test+
1601  # --------------------------------------------------------------------
1602  # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1603  #    -----------------
1604  #    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1605  #    -----------------
1606  #  => started at: wake_up_new_task
1607  #  => ended at:   task_rq_unlock
1608  #
1609  #
1610  #                  _------=> CPU#
1611  #                 / _-----=> irqs-off
1612  #                | / _----=> need-resched
1613  #                || / _---=> hardirq/softirq
1614  #                ||| / _--=> preempt-depth
1615  #                |||| /     delay
1616  #  cmd     pid   ||||| time  |   caller
1617  #     \   /      |||||  \    |   /
1618      bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1619      bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1620      bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1621      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1622      bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1623  [...]
1624      bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1625      bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1626      bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1627      bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1628      bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1629      bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1630      bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1631      bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1632  [...]
1633      bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1634      bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1635      bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1636      bash-1994    1d..2   36us : do_softirq <-irq_exit
1637      bash-1994    1d..2   36us : __do_softirq <-call_softirq
1638      bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1639      bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1640      bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1641      bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1642      bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1643  [...]
1644      bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1645      bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1646      bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1647      bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1648      bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1649      bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1650      bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1651      bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1652      bash-1994    1.N.1  104us : <stack trace>
1653   => sub_preempt_count
1654   => _raw_spin_unlock_irqrestore
1655   => task_rq_unlock
1656   => wake_up_new_task
1657   => do_fork
1658   => sys_clone
1659   => stub_clone
1660
1661
1662The above is an example of the preemptoff trace with
1663function-trace set. Here we see that interrupts were not disabled
1664the entire time. The irq_enter code lets us know that we entered
1665an interrupt 'h'. Before that, the functions being traced still
1666show that it is not in an interrupt, but we can see from the
1667functions themselves that this is not the case.
1668
1669preemptirqsoff
1670--------------
1671
1672Knowing the locations that have interrupts disabled or
1673preemption disabled for the longest times is helpful. But
1674sometimes we would like to know when either preemption and/or
1675interrupts are disabled.
1676
1677Consider the following code::
1678
1679    local_irq_disable();
1680    call_function_with_irqs_off();
1681    preempt_disable();
1682    call_function_with_irqs_and_preemption_off();
1683    local_irq_enable();
1684    call_function_with_preemption_off();
1685    preempt_enable();
1686
1687The irqsoff tracer will record the total length of
1688call_function_with_irqs_off() and
1689call_function_with_irqs_and_preemption_off().
1690
1691The preemptoff tracer will record the total length of
1692call_function_with_irqs_and_preemption_off() and
1693call_function_with_preemption_off().
1694
1695But neither will trace the time that interrupts and/or
1696preemption is disabled. This total time is the time that we can
1697not schedule. To record this time, use the preemptirqsoff
1698tracer.
1699
1700Again, using this trace is much like the irqsoff and preemptoff
1701tracers.
1702::
1703
1704  # echo 0 > options/function-trace
1705  # echo preemptirqsoff > current_tracer
1706  # echo 1 > tracing_on
1707  # echo 0 > tracing_max_latency
1708  # ls -ltr
1709  [...]
1710  # echo 0 > tracing_on
1711  # cat trace
1712  # tracer: preemptirqsoff
1713  #
1714  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1715  # --------------------------------------------------------------------
1716  # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1717  #    -----------------
1718  #    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1719  #    -----------------
1720  #  => started at: ata_scsi_queuecmd
1721  #  => ended at:   ata_scsi_queuecmd
1722  #
1723  #
1724  #                  _------=> CPU#
1725  #                 / _-----=> irqs-off
1726  #                | / _----=> need-resched
1727  #                || / _---=> hardirq/softirq
1728  #                ||| / _--=> preempt-depth
1729  #                |||| /     delay
1730  #  cmd     pid   ||||| time  |   caller
1731  #     \   /      |||||  \    |   /
1732        ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1733        ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1734        ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1735        ls-2230    3...1  111us : <stack trace>
1736   => sub_preempt_count
1737   => _raw_spin_unlock_irqrestore
1738   => ata_scsi_queuecmd
1739   => scsi_dispatch_cmd
1740   => scsi_request_fn
1741   => __blk_run_queue_uncond
1742   => __blk_run_queue
1743   => blk_queue_bio
1744   => submit_bio_noacct
1745   => submit_bio
1746   => submit_bh
1747   => ext3_bread
1748   => ext3_dir_bread
1749   => htree_dirblock_to_tree
1750   => ext3_htree_fill_tree
1751   => ext3_readdir
1752   => vfs_readdir
1753   => sys_getdents
1754   => system_call_fastpath
1755
1756
1757The trace_hardirqs_off_thunk is called from assembly on x86 when
1758interrupts are disabled in the assembly code. Without the
1759function tracing, we do not know if interrupts were enabled
1760within the preemption points. We do see that it started with
1761preemption enabled.
1762
1763Here is a trace with function-trace set::
1764
1765  # tracer: preemptirqsoff
1766  #
1767  # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1768  # --------------------------------------------------------------------
1769  # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1770  #    -----------------
1771  #    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1772  #    -----------------
1773  #  => started at: schedule
1774  #  => ended at:   mutex_unlock
1775  #
1776  #
1777  #                  _------=> CPU#
1778  #                 / _-----=> irqs-off
1779  #                | / _----=> need-resched
1780  #                || / _---=> hardirq/softirq
1781  #                ||| / _--=> preempt-depth
1782  #                |||| /     delay
1783  #  cmd     pid   ||||| time  |   caller
1784  #     \   /      |||||  \    |   /
1785  kworker/-59      3...1    0us : __schedule <-schedule
1786  kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1787  kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1788  kworker/-59      3d..2    1us : deactivate_task <-__schedule
1789  kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1790  kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1791  kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1792  kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1793  kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1794  kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1795  kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1796  kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1797  kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1798  kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1799  kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1800  kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1801  kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1802  kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1803  kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1804  kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1805  kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1806  kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1807  kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1808  kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1809  kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1810        ls-2269    3d..2    7us : finish_task_switch <-__schedule
1811        ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1812        ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1813        ls-2269    3d..2    8us : irq_enter <-do_IRQ
1814        ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1815        ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1816        ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1817  [...]
1818        ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1819        ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1820        ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1821        ls-2269    3d..3   21us : do_softirq <-irq_exit
1822        ls-2269    3d..3   21us : __do_softirq <-call_softirq
1823        ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1824        ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1825        ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1826        ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1827        ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1828        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1829  [...]
1830        ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1831        ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1832        ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1833        ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1834        ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1835        ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1836  [...]
1837        ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1838        ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1839        ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1840        ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1841        ls-2269    3d..3  159us : idle_cpu <-irq_exit
1842        ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1843        ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1844        ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1845        ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1846        ls-2269    3d...  186us : <stack trace>
1847   => __mutex_unlock_slowpath
1848   => mutex_unlock
1849   => process_output
1850   => n_tty_write
1851   => tty_write
1852   => vfs_write
1853   => sys_write
1854   => system_call_fastpath
1855
1856This is an interesting trace. It started with kworker running and
1857scheduling out and ls taking over. But as soon as ls released the
1858rq lock and enabled interrupts (but not preemption) an interrupt
1859triggered. When the interrupt finished, it started running softirqs.
1860But while the softirq was running, another interrupt triggered.
1861When an interrupt is running inside a softirq, the annotation is 'H'.
1862
1863
1864wakeup
1865------
1866
1867One common case that people are interested in tracing is the
1868time it takes for a task that is woken to actually wake up.
1869Now for non Real-Time tasks, this can be arbitrary. But tracing
1870it none the less can be interesting.
1871
1872Without function tracing::
1873
1874  # echo 0 > options/function-trace
1875  # echo wakeup > current_tracer
1876  # echo 1 > tracing_on
1877  # echo 0 > tracing_max_latency
1878  # chrt -f 5 sleep 1
1879  # echo 0 > tracing_on
1880  # cat trace
1881  # tracer: wakeup
1882  #
1883  # wakeup latency trace v1.1.5 on 3.8.0-test+
1884  # --------------------------------------------------------------------
1885  # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1886  #    -----------------
1887  #    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1888  #    -----------------
1889  #
1890  #                  _------=> CPU#
1891  #                 / _-----=> irqs-off
1892  #                | / _----=> need-resched
1893  #                || / _---=> hardirq/softirq
1894  #                ||| / _--=> preempt-depth
1895  #                |||| /     delay
1896  #  cmd     pid   ||||| time  |   caller
1897  #     \   /      |||||  \    |   /
1898    <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1899    <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1900    <idle>-0       3d..3   15us : __schedule <-schedule
1901    <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1902
1903The tracer only traces the highest priority task in the system
1904to avoid tracing the normal circumstances. Here we see that
1905the kworker with a nice priority of -20 (not very nice), took
1906just 15 microseconds from the time it woke up, to the time it
1907ran.
1908
1909Non Real-Time tasks are not that interesting. A more interesting
1910trace is to concentrate only on Real-Time tasks.
1911
1912wakeup_rt
1913---------
1914
1915In a Real-Time environment it is very important to know the
1916wakeup time it takes for the highest priority task that is woken
1917up to the time that it executes. This is also known as "schedule
1918latency". I stress the point that this is about RT tasks. It is
1919also important to know the scheduling latency of non-RT tasks,
1920but the average schedule latency is better for non-RT tasks.
1921Tools like LatencyTop are more appropriate for such
1922measurements.
1923
1924Real-Time environments are interested in the worst case latency.
1925That is the longest latency it takes for something to happen,
1926and not the average. We can have a very fast scheduler that may
1927only have a large latency once in a while, but that would not
1928work well with Real-Time tasks.  The wakeup_rt tracer was designed
1929to record the worst case wakeups of RT tasks. Non-RT tasks are
1930not recorded because the tracer only records one worst case and
1931tracing non-RT tasks that are unpredictable will overwrite the
1932worst case latency of RT tasks (just run the normal wakeup
1933tracer for a while to see that effect).
1934
1935Since this tracer only deals with RT tasks, we will run this
1936slightly differently than we did with the previous tracers.
1937Instead of performing an 'ls', we will run 'sleep 1' under
1938'chrt' which changes the priority of the task.
1939::
1940
1941  # echo 0 > options/function-trace
1942  # echo wakeup_rt > current_tracer
1943  # echo 1 > tracing_on
1944  # echo 0 > tracing_max_latency
1945  # chrt -f 5 sleep 1
1946  # echo 0 > tracing_on
1947  # cat trace
1948  # tracer: wakeup
1949  #
1950  # tracer: wakeup_rt
1951  #
1952  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1953  # --------------------------------------------------------------------
1954  # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1955  #    -----------------
1956  #    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1957  #    -----------------
1958  #
1959  #                  _------=> CPU#
1960  #                 / _-----=> irqs-off
1961  #                | / _----=> need-resched
1962  #                || / _---=> hardirq/softirq
1963  #                ||| / _--=> preempt-depth
1964  #                |||| /     delay
1965  #  cmd     pid   ||||| time  |   caller
1966  #     \   /      |||||  \    |   /
1967    <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
1968    <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1969    <idle>-0       3d..3    5us : __schedule <-schedule
1970    <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1971
1972
1973Running this on an idle system, we see that it only took 5 microseconds
1974to perform the task switch.  Note, since the trace point in the schedule
1975is before the actual "switch", we stop the tracing when the recorded task
1976is about to schedule in. This may change if we add a new marker at the
1977end of the scheduler.
1978
1979Notice that the recorded task is 'sleep' with the PID of 2389
1980and it has an rt_prio of 5. This priority is user-space priority
1981and not the internal kernel priority. The policy is 1 for
1982SCHED_FIFO and 2 for SCHED_RR.
1983
1984Note, that the trace data shows the internal priority (99 - rtprio).
1985::
1986
1987  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1988
1989The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1990and in the running state 'R'. The sleep task was scheduled in with
19912389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1992and it too is in the running state.
1993
1994Doing the same with chrt -r 5 and function-trace set.
1995::
1996
1997  echo 1 > options/function-trace
1998
1999  # tracer: wakeup_rt
2000  #
2001  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2002  # --------------------------------------------------------------------
2003  # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2004  #    -----------------
2005  #    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
2006  #    -----------------
2007  #
2008  #                  _------=> CPU#
2009  #                 / _-----=> irqs-off
2010  #                | / _----=> need-resched
2011  #                || / _---=> hardirq/softirq
2012  #                ||| / _--=> preempt-depth
2013  #                |||| /     delay
2014  #  cmd     pid   ||||| time  |   caller
2015  #     \   /      |||||  \    |   /
2016    <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
2017    <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2018    <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
2019    <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
2020    <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
2021    <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
2022    <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
2023    <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
2024    <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
2025    <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2026    <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
2027    <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
2028    <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
2029    <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
2030    <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
2031    <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
2032    <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
2033    <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
2034    <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
2035    <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
2036    <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
2037    <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
2038    <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
2039    <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
2040    <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
2041    <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
2042    <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
2043    <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
2044    <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
2045    <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
2046    <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
2047    <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
2048    <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
2049    <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
2050    <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
2051    <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
2052    <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
2053    <idle>-0       3dN.1   15us : calc_load_nohz_stop <-tick_nohz_idle_exit
2054    <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
2055    <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
2056    <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
2057    <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
2058    <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2059    <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
2060    <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
2061    <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
2062    <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
2063    <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
2064    <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
2065    <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
2066    <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
2067    <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2068    <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
2069    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2070    <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
2071    <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2072    <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
2073    <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
2074    <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2075    <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
2076    <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
2077    <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
2078    <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
2079    <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
2080    <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
2081    <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
2082    <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
2083    <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2084    <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
2085    <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
2086    <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
2087    <idle>-0       3.N..   25us : schedule <-cpu_idle
2088    <idle>-0       3.N..   25us : __schedule <-preempt_schedule
2089    <idle>-0       3.N..   26us : add_preempt_count <-__schedule
2090    <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
2091    <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
2092    <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
2093    <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
2094    <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
2095    <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
2096    <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
2097    <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
2098    <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
2099    <idle>-0       3d..3   29us : __schedule <-preempt_schedule
2100    <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
2101
2102This isn't that big of a trace, even with function tracing enabled,
2103so I included the entire trace.
2104
2105The interrupt went off while when the system was idle. Somewhere
2106before task_woken_rt() was called, the NEED_RESCHED flag was set,
2107this is indicated by the first occurrence of the 'N' flag.
2108
2109Latency tracing and events
2110--------------------------
2111As function tracing can induce a much larger latency, but without
2112seeing what happens within the latency it is hard to know what
2113caused it. There is a middle ground, and that is with enabling
2114events.
2115::
2116
2117  # echo 0 > options/function-trace
2118  # echo wakeup_rt > current_tracer
2119  # echo 1 > events/enable
2120  # echo 1 > tracing_on
2121  # echo 0 > tracing_max_latency
2122  # chrt -f 5 sleep 1
2123  # echo 0 > tracing_on
2124  # cat trace
2125  # tracer: wakeup_rt
2126  #
2127  # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2128  # --------------------------------------------------------------------
2129  # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2130  #    -----------------
2131  #    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
2132  #    -----------------
2133  #
2134  #                  _------=> CPU#
2135  #                 / _-----=> irqs-off
2136  #                | / _----=> need-resched
2137  #                || / _---=> hardirq/softirq
2138  #                ||| / _--=> preempt-depth
2139  #                |||| /     delay
2140  #  cmd     pid   ||||| time  |   caller
2141  #     \   /      |||||  \    |   /
2142    <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
2143    <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2144    <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
2145    <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
2146    <idle>-0       2.N.2    2us : power_end: cpu_id=2
2147    <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
2148    <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
2149    <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
2150    <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
2151    <idle>-0       2.N.2    5us : rcu_utilization: End context switch
2152    <idle>-0       2d..3    6us : __schedule <-schedule
2153    <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
2154
2155
2156Hardware Latency Detector
2157-------------------------
2158
2159The hardware latency detector is executed by enabling the "hwlat" tracer.
2160
2161NOTE, this tracer will affect the performance of the system as it will
2162periodically make a CPU constantly busy with interrupts disabled.
2163::
2164
2165  # echo hwlat > current_tracer
2166  # sleep 100
2167  # cat trace
2168  # tracer: hwlat
2169  #
2170  # entries-in-buffer/entries-written: 13/13   #P:8
2171  #
2172  #                              _-----=> irqs-off
2173  #                             / _----=> need-resched
2174  #                            | / _---=> hardirq/softirq
2175  #                            || / _--=> preempt-depth
2176  #                            ||| /     delay
2177  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2178  #              | |       |   ||||       |         |
2179             <...>-1729  [001] d...   678.473449: #1     inner/outer(us):   11/12    ts:1581527483.343962693 count:6
2180             <...>-1729  [004] d...   689.556542: #2     inner/outer(us):   16/9     ts:1581527494.889008092 count:1
2181             <...>-1729  [005] d...   714.756290: #3     inner/outer(us):   16/16    ts:1581527519.678961629 count:5
2182             <...>-1729  [001] d...   718.788247: #4     inner/outer(us):    9/17    ts:1581527523.889012713 count:1
2183             <...>-1729  [002] d...   719.796341: #5     inner/outer(us):   13/9     ts:1581527524.912872606 count:1
2184             <...>-1729  [006] d...   844.787091: #6     inner/outer(us):    9/12    ts:1581527649.889048502 count:2
2185             <...>-1729  [003] d...   849.827033: #7     inner/outer(us):   18/9     ts:1581527654.889013793 count:1
2186             <...>-1729  [007] d...   853.859002: #8     inner/outer(us):    9/12    ts:1581527658.889065736 count:1
2187             <...>-1729  [001] d...   855.874978: #9     inner/outer(us):    9/11    ts:1581527660.861991877 count:1
2188             <...>-1729  [001] d...   863.938932: #10    inner/outer(us):    9/11    ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
2189             <...>-1729  [007] d...   878.050780: #11    inner/outer(us):    9/12    ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
2190             <...>-1729  [007] d...   886.114702: #12    inner/outer(us):    9/12    ts:1581527691.385001600 count:1
2191
2192
2193The above output is somewhat the same in the header. All events will have
2194interrupts disabled 'd'. Under the FUNCTION title there is:
2195
2196 #1
2197	This is the count of events recorded that were greater than the
2198	tracing_threshold (See below).
2199
2200 inner/outer(us):   11/11
2201
2202      This shows two numbers as "inner latency" and "outer latency". The test
2203      runs in a loop checking a timestamp twice. The latency detected within
2204      the two timestamps is the "inner latency" and the latency detected
2205      after the previous timestamp and the next timestamp in the loop is
2206      the "outer latency".
2207
2208 ts:1581527483.343962693
2209
2210      The absolute timestamp that the first latency was recorded in the window.
2211
2212 count:6
2213
2214      The number of times a latency was detected during the window.
2215
2216 nmi-total:7 nmi-count:1
2217
2218      On architectures that support it, if an NMI comes in during the
2219      test, the time spent in NMI is reported in "nmi-total" (in
2220      microseconds).
2221
2222      All architectures that have NMIs will show the "nmi-count" if an
2223      NMI comes in during the test.
2224
2225hwlat files:
2226
2227  tracing_threshold
2228	This gets automatically set to "10" to represent 10
2229	microseconds. This is the threshold of latency that
2230	needs to be detected before the trace will be recorded.
2231
2232	Note, when hwlat tracer is finished (another tracer is
2233	written into "current_tracer"), the original value for
2234	tracing_threshold is placed back into this file.
2235
2236  hwlat_detector/width
2237	The length of time the test runs with interrupts disabled.
2238
2239  hwlat_detector/window
2240	The length of time of the window which the test
2241	runs. That is, the test will run for "width"
2242	microseconds per "window" microseconds
2243
2244  tracing_cpumask
2245	When the test is started. A kernel thread is created that
2246	runs the test. This thread will alternate between CPUs
2247	listed in the tracing_cpumask between each period
2248	(one "window"). To limit the test to specific CPUs
2249	set the mask in this file to only the CPUs that the test
2250	should run on.
2251
2252function
2253--------
2254
2255This tracer is the function tracer. Enabling the function tracer
2256can be done from the debug file system. Make sure the
2257ftrace_enabled is set; otherwise this tracer is a nop.
2258See the "ftrace_enabled" section below.
2259::
2260
2261  # sysctl kernel.ftrace_enabled=1
2262  # echo function > current_tracer
2263  # echo 1 > tracing_on
2264  # usleep 1
2265  # echo 0 > tracing_on
2266  # cat trace
2267  # tracer: function
2268  #
2269  # entries-in-buffer/entries-written: 24799/24799   #P:4
2270  #
2271  #                              _-----=> irqs-off
2272  #                             / _----=> need-resched
2273  #                            | / _---=> hardirq/softirq
2274  #                            || / _--=> preempt-depth
2275  #                            ||| /     delay
2276  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2277  #              | |       |   ||||       |         |
2278              bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
2279              bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2280              bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
2281              bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
2282              bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
2283              bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
2284              bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
2285              bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
2286  [...]
2287
2288
2289Note: function tracer uses ring buffers to store the above
2290entries. The newest data may overwrite the oldest data.
2291Sometimes using echo to stop the trace is not sufficient because
2292the tracing could have overwritten the data that you wanted to
2293record. For this reason, it is sometimes better to disable
2294tracing directly from a program. This allows you to stop the
2295tracing at the point that you hit the part that you are
2296interested in. To disable the tracing directly from a C program,
2297something like following code snippet can be used::
2298
2299	int trace_fd;
2300	[...]
2301	int main(int argc, char *argv[]) {
2302		[...]
2303		trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2304		[...]
2305		if (condition_hit()) {
2306			write(trace_fd, "0", 1);
2307		}
2308		[...]
2309	}
2310
2311
2312Single thread tracing
2313---------------------
2314
2315By writing into set_ftrace_pid you can trace a
2316single thread. For example::
2317
2318  # cat set_ftrace_pid
2319  no pid
2320  # echo 3111 > set_ftrace_pid
2321  # cat set_ftrace_pid
2322  3111
2323  # echo function > current_tracer
2324  # cat trace | head
2325  # tracer: function
2326  #
2327  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2328  #              | |       |          |         |
2329      yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
2330      yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2331      yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2332      yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2333      yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
2334      yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
2335  # echo > set_ftrace_pid
2336  # cat trace |head
2337  # tracer: function
2338  #
2339  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2340  #              | |       |          |         |
2341  ##### CPU 3 buffer started ####
2342      yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
2343      yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
2344      yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
2345      yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
2346      yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
2347
2348If you want to trace a function when executing, you could use
2349something like this simple program.
2350::
2351
2352	#include <stdio.h>
2353	#include <stdlib.h>
2354	#include <sys/types.h>
2355	#include <sys/stat.h>
2356	#include <fcntl.h>
2357	#include <unistd.h>
2358	#include <string.h>
2359
2360	#define _STR(x) #x
2361	#define STR(x) _STR(x)
2362	#define MAX_PATH 256
2363
2364	const char *find_tracefs(void)
2365	{
2366	       static char tracefs[MAX_PATH+1];
2367	       static int tracefs_found;
2368	       char type[100];
2369	       FILE *fp;
2370
2371	       if (tracefs_found)
2372		       return tracefs;
2373
2374	       if ((fp = fopen("/proc/mounts","r")) == NULL) {
2375		       perror("/proc/mounts");
2376		       return NULL;
2377	       }
2378
2379	       while (fscanf(fp, "%*s %"
2380		             STR(MAX_PATH)
2381		             "s %99s %*s %*d %*d\n",
2382		             tracefs, type) == 2) {
2383		       if (strcmp(type, "tracefs") == 0)
2384		               break;
2385	       }
2386	       fclose(fp);
2387
2388	       if (strcmp(type, "tracefs") != 0) {
2389		       fprintf(stderr, "tracefs not mounted");
2390		       return NULL;
2391	       }
2392
2393	       strcat(tracefs, "/tracing/");
2394	       tracefs_found = 1;
2395
2396	       return tracefs;
2397	}
2398
2399	const char *tracing_file(const char *file_name)
2400	{
2401	       static char trace_file[MAX_PATH+1];
2402	       snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2403	       return trace_file;
2404	}
2405
2406	int main (int argc, char **argv)
2407	{
2408		if (argc < 1)
2409		        exit(-1);
2410
2411		if (fork() > 0) {
2412		        int fd, ffd;
2413		        char line[64];
2414		        int s;
2415
2416		        ffd = open(tracing_file("current_tracer"), O_WRONLY);
2417		        if (ffd < 0)
2418		                exit(-1);
2419		        write(ffd, "nop", 3);
2420
2421		        fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2422		        s = sprintf(line, "%d\n", getpid());
2423		        write(fd, line, s);
2424
2425		        write(ffd, "function", 8);
2426
2427		        close(fd);
2428		        close(ffd);
2429
2430		        execvp(argv[1], argv+1);
2431		}
2432
2433		return 0;
2434	}
2435
2436Or this simple script!
2437::
2438
2439  #!/bin/bash
2440
2441  tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2442  echo nop > $tracefs/tracing/current_tracer
2443  echo 0 > $tracefs/tracing/tracing_on
2444  echo $$ > $tracefs/tracing/set_ftrace_pid
2445  echo function > $tracefs/tracing/current_tracer
2446  echo 1 > $tracefs/tracing/tracing_on
2447  exec "$@"
2448
2449
2450function graph tracer
2451---------------------------
2452
2453This tracer is similar to the function tracer except that it
2454probes a function on its entry and its exit. This is done by
2455using a dynamically allocated stack of return addresses in each
2456task_struct. On function entry the tracer overwrites the return
2457address of each function traced to set a custom probe. Thus the
2458original return address is stored on the stack of return address
2459in the task_struct.
2460
2461Probing on both ends of a function leads to special features
2462such as:
2463
2464- measure of a function's time execution
2465- having a reliable call stack to draw function calls graph
2466
2467This tracer is useful in several situations:
2468
2469- you want to find the reason of a strange kernel behavior and
2470  need to see what happens in detail on any areas (or specific
2471  ones).
2472
2473- you are experiencing weird latencies but it's difficult to
2474  find its origin.
2475
2476- you want to find quickly which path is taken by a specific
2477  function
2478
2479- you just want to peek inside a working kernel and want to see
2480  what happens there.
2481
2482::
2483
2484  # tracer: function_graph
2485  #
2486  # CPU  DURATION                  FUNCTION CALLS
2487  # |     |   |                     |   |   |   |
2488
2489   0)               |  sys_open() {
2490   0)               |    do_sys_open() {
2491   0)               |      getname() {
2492   0)               |        kmem_cache_alloc() {
2493   0)   1.382 us    |          __might_sleep();
2494   0)   2.478 us    |        }
2495   0)               |        strncpy_from_user() {
2496   0)               |          might_fault() {
2497   0)   1.389 us    |            __might_sleep();
2498   0)   2.553 us    |          }
2499   0)   3.807 us    |        }
2500   0)   7.876 us    |      }
2501   0)               |      alloc_fd() {
2502   0)   0.668 us    |        _spin_lock();
2503   0)   0.570 us    |        expand_files();
2504   0)   0.586 us    |        _spin_unlock();
2505
2506
2507There are several columns that can be dynamically
2508enabled/disabled. You can use every combination of options you
2509want, depending on your needs.
2510
2511- The cpu number on which the function executed is default
2512  enabled.  It is sometimes better to only trace one cpu (see
2513  tracing_cpu_mask file) or you might sometimes see unordered
2514  function calls while cpu tracing switch.
2515
2516	- hide: echo nofuncgraph-cpu > trace_options
2517	- show: echo funcgraph-cpu > trace_options
2518
2519- The duration (function's time of execution) is displayed on
2520  the closing bracket line of a function or on the same line
2521  than the current function in case of a leaf one. It is default
2522  enabled.
2523
2524	- hide: echo nofuncgraph-duration > trace_options
2525	- show: echo funcgraph-duration > trace_options
2526
2527- The overhead field precedes the duration field in case of
2528  reached duration thresholds.
2529
2530	- hide: echo nofuncgraph-overhead > trace_options
2531	- show: echo funcgraph-overhead > trace_options
2532	- depends on: funcgraph-duration
2533
2534  ie::
2535
2536    3) # 1837.709 us |          } /* __switch_to */
2537    3)               |          finish_task_switch() {
2538    3)   0.313 us    |            _raw_spin_unlock_irq();
2539    3)   3.177 us    |          }
2540    3) # 1889.063 us |        } /* __schedule */
2541    3) ! 140.417 us  |      } /* __schedule */
2542    3) # 2034.948 us |    } /* schedule */
2543    3) * 33998.59 us |  } /* schedule_preempt_disabled */
2544
2545    [...]
2546
2547    1)   0.260 us    |              msecs_to_jiffies();
2548    1)   0.313 us    |              __rcu_read_unlock();
2549    1) + 61.770 us   |            }
2550    1) + 64.479 us   |          }
2551    1)   0.313 us    |          rcu_bh_qs();
2552    1)   0.313 us    |          __local_bh_enable();
2553    1) ! 217.240 us  |        }
2554    1)   0.365 us    |        idle_cpu();
2555    1)               |        rcu_irq_exit() {
2556    1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2557    1)   3.125 us    |        }
2558    1) ! 227.812 us  |      }
2559    1) ! 457.395 us  |    }
2560    1) @ 119760.2 us |  }
2561
2562    [...]
2563
2564    2)               |    handle_IPI() {
2565    1)   6.979 us    |                  }
2566    2)   0.417 us    |      scheduler_ipi();
2567    1)   9.791 us    |                }
2568    1) + 12.917 us   |              }
2569    2)   3.490 us    |    }
2570    1) + 15.729 us   |            }
2571    1) + 18.542 us   |          }
2572    2) $ 3594274 us  |  }
2573
2574Flags::
2575
2576  + means that the function exceeded 10 usecs.
2577  ! means that the function exceeded 100 usecs.
2578  # means that the function exceeded 1000 usecs.
2579  * means that the function exceeded 10 msecs.
2580  @ means that the function exceeded 100 msecs.
2581  $ means that the function exceeded 1 sec.
2582
2583
2584- The task/pid field displays the thread cmdline and pid which
2585  executed the function. It is default disabled.
2586
2587	- hide: echo nofuncgraph-proc > trace_options
2588	- show: echo funcgraph-proc > trace_options
2589
2590  ie::
2591
2592    # tracer: function_graph
2593    #
2594    # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2595    # |    |    |           |   |                     |   |   |   |
2596    0)    sh-4802     |               |                  d_free() {
2597    0)    sh-4802     |               |                    call_rcu() {
2598    0)    sh-4802     |               |                      __call_rcu() {
2599    0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2600    0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2601    0)    sh-4802     |   2.899 us    |                      }
2602    0)    sh-4802     |   4.040 us    |                    }
2603    0)    sh-4802     |   5.151 us    |                  }
2604    0)    sh-4802     | + 49.370 us   |                }
2605
2606
2607- The absolute time field is an absolute timestamp given by the
2608  system clock since it started. A snapshot of this time is
2609  given on each entry/exit of functions
2610
2611	- hide: echo nofuncgraph-abstime > trace_options
2612	- show: echo funcgraph-abstime > trace_options
2613
2614  ie::
2615
2616    #
2617    #      TIME       CPU  DURATION                  FUNCTION CALLS
2618    #       |         |     |   |                     |   |   |   |
2619    360.774522 |   1)   0.541 us    |                                          }
2620    360.774522 |   1)   4.663 us    |                                        }
2621    360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2622    360.774524 |   1)   6.796 us    |                                      }
2623    360.774524 |   1)   7.952 us    |                                    }
2624    360.774525 |   1)   9.063 us    |                                  }
2625    360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2626    360.774527 |   1)   0.578 us    |                                  __brelse();
2627    360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2628    360.774528 |   1)               |                                    unlock_buffer() {
2629    360.774529 |   1)               |                                      wake_up_bit() {
2630    360.774529 |   1)               |                                        bit_waitqueue() {
2631    360.774530 |   1)   0.594 us    |                                          __phys_addr();
2632
2633
2634The function name is always displayed after the closing bracket
2635for a function if the start of that function is not in the
2636trace buffer.
2637
2638Display of the function name after the closing bracket may be
2639enabled for functions whose start is in the trace buffer,
2640allowing easier searching with grep for function durations.
2641It is default disabled.
2642
2643	- hide: echo nofuncgraph-tail > trace_options
2644	- show: echo funcgraph-tail > trace_options
2645
2646  Example with nofuncgraph-tail (default)::
2647
2648    0)               |      putname() {
2649    0)               |        kmem_cache_free() {
2650    0)   0.518 us    |          __phys_addr();
2651    0)   1.757 us    |        }
2652    0)   2.861 us    |      }
2653
2654  Example with funcgraph-tail::
2655
2656    0)               |      putname() {
2657    0)               |        kmem_cache_free() {
2658    0)   0.518 us    |          __phys_addr();
2659    0)   1.757 us    |        } /* kmem_cache_free() */
2660    0)   2.861 us    |      } /* putname() */
2661
2662You can put some comments on specific functions by using
2663trace_printk() For example, if you want to put a comment inside
2664the __might_sleep() function, you just have to include
2665<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2666
2667	trace_printk("I'm a comment!\n")
2668
2669will produce::
2670
2671   1)               |             __might_sleep() {
2672   1)               |                /* I'm a comment! */
2673   1)   1.449 us    |             }
2674
2675
2676You might find other useful features for this tracer in the
2677following "dynamic ftrace" section such as tracing only specific
2678functions or tasks.
2679
2680dynamic ftrace
2681--------------
2682
2683If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2684virtually no overhead when function tracing is disabled. The way
2685this works is the mcount function call (placed at the start of
2686every kernel function, produced by the -pg switch in gcc),
2687starts of pointing to a simple return. (Enabling FTRACE will
2688include the -pg switch in the compiling of the kernel.)
2689
2690At compile time every C file object is run through the
2691recordmcount program (located in the scripts directory). This
2692program will parse the ELF headers in the C object to find all
2693the locations in the .text section that call mcount. Starting
2694with gcc version 4.6, the -mfentry has been added for x86, which
2695calls "__fentry__" instead of "mcount". Which is called before
2696the creation of the stack frame.
2697
2698Note, not all sections are traced. They may be prevented by either
2699a notrace, or blocked another way and all inline functions are not
2700traced. Check the "available_filter_functions" file to see what functions
2701can be traced.
2702
2703A section called "__mcount_loc" is created that holds
2704references to all the mcount/fentry call sites in the .text section.
2705The recordmcount program re-links this section back into the
2706original object. The final linking stage of the kernel will add all these
2707references into a single table.
2708
2709On boot up, before SMP is initialized, the dynamic ftrace code
2710scans this table and updates all the locations into nops. It
2711also records the locations, which are added to the
2712available_filter_functions list.  Modules are processed as they
2713are loaded and before they are executed.  When a module is
2714unloaded, it also removes its functions from the ftrace function
2715list. This is automatic in the module unload code, and the
2716module author does not need to worry about it.
2717
2718When tracing is enabled, the process of modifying the function
2719tracepoints is dependent on architecture. The old method is to use
2720kstop_machine to prevent races with the CPUs executing code being
2721modified (which can cause the CPU to do undesirable things, especially
2722if the modified code crosses cache (or page) boundaries), and the nops are
2723patched back to calls. But this time, they do not call mcount
2724(which is just a function stub). They now call into the ftrace
2725infrastructure.
2726
2727The new method of modifying the function tracepoints is to place
2728a breakpoint at the location to be modified, sync all CPUs, modify
2729the rest of the instruction not covered by the breakpoint. Sync
2730all CPUs again, and then remove the breakpoint with the finished
2731version to the ftrace call site.
2732
2733Some archs do not even need to monkey around with the synchronization,
2734and can just slap the new code on top of the old without any
2735problems with other CPUs executing it at the same time.
2736
2737One special side-effect to the recording of the functions being
2738traced is that we can now selectively choose which functions we
2739wish to trace and which ones we want the mcount calls to remain
2740as nops.
2741
2742Two files are used, one for enabling and one for disabling the
2743tracing of specified functions. They are:
2744
2745  set_ftrace_filter
2746
2747and
2748
2749  set_ftrace_notrace
2750
2751A list of available functions that you can add to these files is
2752listed in:
2753
2754   available_filter_functions
2755
2756::
2757
2758  # cat available_filter_functions
2759  put_prev_task_idle
2760  kmem_cache_create
2761  pick_next_task_rt
2762  get_online_cpus
2763  pick_next_task_fair
2764  mutex_lock
2765  [...]
2766
2767If I am only interested in sys_nanosleep and hrtimer_interrupt::
2768
2769  # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2770  # echo function > current_tracer
2771  # echo 1 > tracing_on
2772  # usleep 1
2773  # echo 0 > tracing_on
2774  # cat trace
2775  # tracer: function
2776  #
2777  # entries-in-buffer/entries-written: 5/5   #P:4
2778  #
2779  #                              _-----=> irqs-off
2780  #                             / _----=> need-resched
2781  #                            | / _---=> hardirq/softirq
2782  #                            || / _--=> preempt-depth
2783  #                            ||| /     delay
2784  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2785  #              | |       |   ||||       |         |
2786            usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2787            <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2788            usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2789            <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2790            <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2791
2792To see which functions are being traced, you can cat the file:
2793::
2794
2795  # cat set_ftrace_filter
2796  hrtimer_interrupt
2797  sys_nanosleep
2798
2799
2800Perhaps this is not enough. The filters also allow glob(7) matching.
2801
2802  ``<match>*``
2803	will match functions that begin with <match>
2804  ``*<match>``
2805	will match functions that end with <match>
2806  ``*<match>*``
2807	will match functions that have <match> in it
2808  ``<match1>*<match2>``
2809	will match functions that begin with <match1> and end with <match2>
2810
2811.. note::
2812      It is better to use quotes to enclose the wild cards,
2813      otherwise the shell may expand the parameters into names
2814      of files in the local directory.
2815
2816::
2817
2818  # echo 'hrtimer_*' > set_ftrace_filter
2819
2820Produces::
2821
2822  # tracer: function
2823  #
2824  # entries-in-buffer/entries-written: 897/897   #P:4
2825  #
2826  #                              _-----=> irqs-off
2827  #                             / _----=> need-resched
2828  #                            | / _---=> hardirq/softirq
2829  #                            || / _--=> preempt-depth
2830  #                            ||| /     delay
2831  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2832  #              | |       |   ||||       |         |
2833            <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2834            <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2835            <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2836            <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2837            <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2838            <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2839            <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2840            <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
2841
2842Notice that we lost the sys_nanosleep.
2843::
2844
2845  # cat set_ftrace_filter
2846  hrtimer_run_queues
2847  hrtimer_run_pending
2848  hrtimer_init
2849  hrtimer_cancel
2850  hrtimer_try_to_cancel
2851  hrtimer_forward
2852  hrtimer_start
2853  hrtimer_reprogram
2854  hrtimer_force_reprogram
2855  hrtimer_get_next_event
2856  hrtimer_interrupt
2857  hrtimer_nanosleep
2858  hrtimer_wakeup
2859  hrtimer_get_remaining
2860  hrtimer_get_res
2861  hrtimer_init_sleeper
2862
2863
2864This is because the '>' and '>>' act just like they do in bash.
2865To rewrite the filters, use '>'
2866To append to the filters, use '>>'
2867
2868To clear out a filter so that all functions will be recorded
2869again::
2870
2871 # echo > set_ftrace_filter
2872 # cat set_ftrace_filter
2873 #
2874
2875Again, now we want to append.
2876
2877::
2878
2879  # echo sys_nanosleep > set_ftrace_filter
2880  # cat set_ftrace_filter
2881  sys_nanosleep
2882  # echo 'hrtimer_*' >> set_ftrace_filter
2883  # cat set_ftrace_filter
2884  hrtimer_run_queues
2885  hrtimer_run_pending
2886  hrtimer_init
2887  hrtimer_cancel
2888  hrtimer_try_to_cancel
2889  hrtimer_forward
2890  hrtimer_start
2891  hrtimer_reprogram
2892  hrtimer_force_reprogram
2893  hrtimer_get_next_event
2894  hrtimer_interrupt
2895  sys_nanosleep
2896  hrtimer_nanosleep
2897  hrtimer_wakeup
2898  hrtimer_get_remaining
2899  hrtimer_get_res
2900  hrtimer_init_sleeper
2901
2902
2903The set_ftrace_notrace prevents those functions from being
2904traced.
2905::
2906
2907  # echo '*preempt*' '*lock*' > set_ftrace_notrace
2908
2909Produces::
2910
2911  # tracer: function
2912  #
2913  # entries-in-buffer/entries-written: 39608/39608   #P:4
2914  #
2915  #                              _-----=> irqs-off
2916  #                             / _----=> need-resched
2917  #                            | / _---=> hardirq/softirq
2918  #                            || / _--=> preempt-depth
2919  #                            ||| /     delay
2920  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2921  #              | |       |   ||||       |         |
2922              bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
2923              bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
2924              bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
2925              bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
2926              bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
2927              bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
2928              bash-1994  [000] ....  4342.324899: do_truncate <-do_last
2929              bash-1994  [000] ....  4342.324899: setattr_should_drop_suidgid <-do_truncate
2930              bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
2931              bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
2932              bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
2933              bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
2934
2935We can see that there's no more lock or preempt tracing.
2936
2937Selecting function filters via index
2938------------------------------------
2939
2940Because processing of strings is expensive (the address of the function
2941needs to be looked up before comparing to the string being passed in),
2942an index can be used as well to enable functions. This is useful in the
2943case of setting thousands of specific functions at a time. By passing
2944in a list of numbers, no string processing will occur. Instead, the function
2945at the specific location in the internal array (which corresponds to the
2946functions in the "available_filter_functions" file), is selected.
2947
2948::
2949
2950  # echo 1 > set_ftrace_filter
2951
2952Will select the first function listed in "available_filter_functions"
2953
2954::
2955
2956  # head -1 available_filter_functions
2957  trace_initcall_finish_cb
2958
2959  # cat set_ftrace_filter
2960  trace_initcall_finish_cb
2961
2962  # head -50 available_filter_functions | tail -1
2963  x86_pmu_commit_txn
2964
2965  # echo 1 50 > set_ftrace_filter
2966  # cat set_ftrace_filter
2967  trace_initcall_finish_cb
2968  x86_pmu_commit_txn
2969
2970Dynamic ftrace with the function graph tracer
2971---------------------------------------------
2972
2973Although what has been explained above concerns both the
2974function tracer and the function-graph-tracer, there are some
2975special features only available in the function-graph tracer.
2976
2977If you want to trace only one function and all of its children,
2978you just have to echo its name into set_graph_function::
2979
2980 echo __do_fault > set_graph_function
2981
2982will produce the following "expanded" trace of the __do_fault()
2983function::
2984
2985   0)               |  __do_fault() {
2986   0)               |    filemap_fault() {
2987   0)               |      find_lock_page() {
2988   0)   0.804 us    |        find_get_page();
2989   0)               |        __might_sleep() {
2990   0)   1.329 us    |        }
2991   0)   3.904 us    |      }
2992   0)   4.979 us    |    }
2993   0)   0.653 us    |    _spin_lock();
2994   0)   0.578 us    |    page_add_file_rmap();
2995   0)   0.525 us    |    native_set_pte_at();
2996   0)   0.585 us    |    _spin_unlock();
2997   0)               |    unlock_page() {
2998   0)   0.541 us    |      page_waitqueue();
2999   0)   0.639 us    |      __wake_up_bit();
3000   0)   2.786 us    |    }
3001   0) + 14.237 us   |  }
3002   0)               |  __do_fault() {
3003   0)               |    filemap_fault() {
3004   0)               |      find_lock_page() {
3005   0)   0.698 us    |        find_get_page();
3006   0)               |        __might_sleep() {
3007   0)   1.412 us    |        }
3008   0)   3.950 us    |      }
3009   0)   5.098 us    |    }
3010   0)   0.631 us    |    _spin_lock();
3011   0)   0.571 us    |    page_add_file_rmap();
3012   0)   0.526 us    |    native_set_pte_at();
3013   0)   0.586 us    |    _spin_unlock();
3014   0)               |    unlock_page() {
3015   0)   0.533 us    |      page_waitqueue();
3016   0)   0.638 us    |      __wake_up_bit();
3017   0)   2.793 us    |    }
3018   0) + 14.012 us   |  }
3019
3020You can also expand several functions at once::
3021
3022 echo sys_open > set_graph_function
3023 echo sys_close >> set_graph_function
3024
3025Now if you want to go back to trace all functions you can clear
3026this special filter via::
3027
3028 echo > set_graph_function
3029
3030
3031ftrace_enabled
3032--------------
3033
3034Note, the proc sysctl ftrace_enable is a big on/off switch for the
3035function tracer. By default it is enabled (when function tracing is
3036enabled in the kernel). If it is disabled, all function tracing is
3037disabled. This includes not only the function tracers for ftrace, but
3038also for any other uses (perf, kprobes, stack tracing, profiling, etc). It
3039cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set
3040registered.
3041
3042Please disable this with care.
3043
3044This can be disable (and enabled) with::
3045
3046  sysctl kernel.ftrace_enabled=0
3047  sysctl kernel.ftrace_enabled=1
3048
3049 or
3050
3051  echo 0 > /proc/sys/kernel/ftrace_enabled
3052  echo 1 > /proc/sys/kernel/ftrace_enabled
3053
3054
3055Filter commands
3056---------------
3057
3058A few commands are supported by the set_ftrace_filter interface.
3059Trace commands have the following format::
3060
3061  <function>:<command>:<parameter>
3062
3063The following commands are supported:
3064
3065- mod:
3066  This command enables function filtering per module. The
3067  parameter defines the module. For example, if only the write*
3068  functions in the ext3 module are desired, run:
3069
3070   echo 'write*:mod:ext3' > set_ftrace_filter
3071
3072  This command interacts with the filter in the same way as
3073  filtering based on function names. Thus, adding more functions
3074  in a different module is accomplished by appending (>>) to the
3075  filter file. Remove specific module functions by prepending
3076  '!'::
3077
3078   echo '!writeback*:mod:ext3' >> set_ftrace_filter
3079
3080  Mod command supports module globbing. Disable tracing for all
3081  functions except a specific module::
3082
3083   echo '!*:mod:!ext3' >> set_ftrace_filter
3084
3085  Disable tracing for all modules, but still trace kernel::
3086
3087   echo '!*:mod:*' >> set_ftrace_filter
3088
3089  Enable filter only for kernel::
3090
3091   echo '*write*:mod:!*' >> set_ftrace_filter
3092
3093  Enable filter for module globbing::
3094
3095   echo '*write*:mod:*snd*' >> set_ftrace_filter
3096
3097- traceon/traceoff:
3098  These commands turn tracing on and off when the specified
3099  functions are hit. The parameter determines how many times the
3100  tracing system is turned on and off. If unspecified, there is
3101  no limit. For example, to disable tracing when a schedule bug
3102  is hit the first 5 times, run::
3103
3104   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
3105
3106  To always disable tracing when __schedule_bug is hit::
3107
3108   echo '__schedule_bug:traceoff' > set_ftrace_filter
3109
3110  These commands are cumulative whether or not they are appended
3111  to set_ftrace_filter. To remove a command, prepend it by '!'
3112  and drop the parameter::
3113
3114   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
3115
3116  The above removes the traceoff command for __schedule_bug
3117  that have a counter. To remove commands without counters::
3118
3119   echo '!__schedule_bug:traceoff' > set_ftrace_filter
3120
3121- snapshot:
3122  Will cause a snapshot to be triggered when the function is hit.
3123  ::
3124
3125   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
3126
3127  To only snapshot once:
3128  ::
3129
3130   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
3131
3132  To remove the above commands::
3133
3134   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
3135   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
3136
3137- enable_event/disable_event:
3138  These commands can enable or disable a trace event. Note, because
3139  function tracing callbacks are very sensitive, when these commands
3140  are registered, the trace point is activated, but disabled in
3141  a "soft" mode. That is, the tracepoint will be called, but
3142  just will not be traced. The event tracepoint stays in this mode
3143  as long as there's a command that triggers it.
3144  ::
3145
3146   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
3147   	 set_ftrace_filter
3148
3149  The format is::
3150
3151    <function>:enable_event:<system>:<event>[:count]
3152    <function>:disable_event:<system>:<event>[:count]
3153
3154  To remove the events commands::
3155
3156   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
3157   	 set_ftrace_filter
3158   echo '!schedule:disable_event:sched:sched_switch' > \
3159   	 set_ftrace_filter
3160
3161- dump:
3162  When the function is hit, it will dump the contents of the ftrace
3163  ring buffer to the console. This is useful if you need to debug
3164  something, and want to dump the trace when a certain function
3165  is hit. Perhaps it's a function that is called before a triple
3166  fault happens and does not allow you to get a regular dump.
3167
3168- cpudump:
3169  When the function is hit, it will dump the contents of the ftrace
3170  ring buffer for the current CPU to the console. Unlike the "dump"
3171  command, it only prints out the contents of the ring buffer for the
3172  CPU that executed the function that triggered the dump.
3173
3174- stacktrace:
3175  When the function is hit, a stack trace is recorded.
3176
3177trace_pipe
3178----------
3179
3180The trace_pipe outputs the same content as the trace file, but
3181the effect on the tracing is different. Every read from
3182trace_pipe is consumed. This means that subsequent reads will be
3183different. The trace is live.
3184::
3185
3186  # echo function > current_tracer
3187  # cat trace_pipe > /tmp/trace.out &
3188  [1] 4153
3189  # echo 1 > tracing_on
3190  # usleep 1
3191  # echo 0 > tracing_on
3192  # cat trace
3193  # tracer: function
3194  #
3195  # entries-in-buffer/entries-written: 0/0   #P:4
3196  #
3197  #                              _-----=> irqs-off
3198  #                             / _----=> need-resched
3199  #                            | / _---=> hardirq/softirq
3200  #                            || / _--=> preempt-depth
3201  #                            ||| /     delay
3202  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3203  #              | |       |   ||||       |         |
3204
3205  #
3206  # cat /tmp/trace.out
3207             bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
3208             bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
3209             bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
3210             bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
3211             bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
3212             bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
3213             bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
3214             bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
3215             bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
3216
3217
3218Note, reading the trace_pipe file will block until more input is
3219added. This is contrary to the trace file. If any process opened
3220the trace file for reading, it will actually disable tracing and
3221prevent new entries from being added. The trace_pipe file does
3222not have this limitation.
3223
3224trace entries
3225-------------
3226
3227Having too much or not enough data can be troublesome in
3228diagnosing an issue in the kernel. The file buffer_size_kb is
3229used to modify the size of the internal trace buffers. The
3230number listed is the number of entries that can be recorded per
3231CPU. To know the full size, multiply the number of possible CPUs
3232with the number of entries.
3233::
3234
3235  # cat buffer_size_kb
3236  1408 (units kilobytes)
3237
3238Or simply read buffer_total_size_kb
3239::
3240
3241  # cat buffer_total_size_kb
3242  5632
3243
3244To modify the buffer, simple echo in a number (in 1024 byte segments).
3245::
3246
3247  # echo 10000 > buffer_size_kb
3248  # cat buffer_size_kb
3249  10000 (units kilobytes)
3250
3251It will try to allocate as much as possible. If you allocate too
3252much, it can cause Out-Of-Memory to trigger.
3253::
3254
3255  # echo 1000000000000 > buffer_size_kb
3256  -bash: echo: write error: Cannot allocate memory
3257  # cat buffer_size_kb
3258  85
3259
3260The per_cpu buffers can be changed individually as well:
3261::
3262
3263  # echo 10000 > per_cpu/cpu0/buffer_size_kb
3264  # echo 100 > per_cpu/cpu1/buffer_size_kb
3265
3266When the per_cpu buffers are not the same, the buffer_size_kb
3267at the top level will just show an X
3268::
3269
3270  # cat buffer_size_kb
3271  X
3272
3273This is where the buffer_total_size_kb is useful:
3274::
3275
3276  # cat buffer_total_size_kb
3277  12916
3278
3279Writing to the top level buffer_size_kb will reset all the buffers
3280to be the same again.
3281
3282Snapshot
3283--------
3284CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3285available to all non latency tracers. (Latency tracers which
3286record max latency, such as "irqsoff" or "wakeup", can't use
3287this feature, since those are already using the snapshot
3288mechanism internally.)
3289
3290Snapshot preserves a current trace buffer at a particular point
3291in time without stopping tracing. Ftrace swaps the current
3292buffer with a spare buffer, and tracing continues in the new
3293current (=previous spare) buffer.
3294
3295The following tracefs files in "tracing" are related to this
3296feature:
3297
3298  snapshot:
3299
3300	This is used to take a snapshot and to read the output
3301	of the snapshot. Echo 1 into this file to allocate a
3302	spare buffer and to take a snapshot (swap), then read
3303	the snapshot from this file in the same format as
3304	"trace" (described above in the section "The File
3305	System"). Both reads snapshot and tracing are executable
3306	in parallel. When the spare buffer is allocated, echoing
3307	0 frees it, and echoing else (positive) values clear the
3308	snapshot contents.
3309	More details are shown in the table below.
3310
3311	+--------------+------------+------------+------------+
3312	|status\\input |     0      |     1      |    else    |
3313	+==============+============+============+============+
3314	|not allocated |(do nothing)| alloc+swap |(do nothing)|
3315	+--------------+------------+------------+------------+
3316	|allocated     |    free    |    swap    |   clear    |
3317	+--------------+------------+------------+------------+
3318
3319Here is an example of using the snapshot feature.
3320::
3321
3322  # echo 1 > events/sched/enable
3323  # echo 1 > snapshot
3324  # cat snapshot
3325  # tracer: nop
3326  #
3327  # entries-in-buffer/entries-written: 71/71   #P:8
3328  #
3329  #                              _-----=> irqs-off
3330  #                             / _----=> need-resched
3331  #                            | / _---=> hardirq/softirq
3332  #                            || / _--=> preempt-depth
3333  #                            ||| /     delay
3334  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3335  #              | |       |   ||||       |         |
3336            <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120   prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3337             sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120   prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3338  [...]
3339          <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3340
3341  # cat trace
3342  # tracer: nop
3343  #
3344  # entries-in-buffer/entries-written: 77/77   #P:8
3345  #
3346  #                              _-----=> irqs-off
3347  #                             / _----=> need-resched
3348  #                            | / _---=> hardirq/softirq
3349  #                            || / _--=> preempt-depth
3350  #                            ||| /     delay
3351  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3352  #              | |       |   ||||       |         |
3353            <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3354   snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3355  [...]
3356
3357
3358If you try to use this snapshot feature when current tracer is
3359one of the latency tracers, you will get the following results.
3360::
3361
3362  # echo wakeup > current_tracer
3363  # echo 1 > snapshot
3364  bash: echo: write error: Device or resource busy
3365  # cat snapshot
3366  cat: snapshot: Device or resource busy
3367
3368
3369Instances
3370---------
3371In the tracefs tracing directory is a directory called "instances".
3372This directory can have new directories created inside of it using
3373mkdir, and removing directories with rmdir. The directory created
3374with mkdir in this directory will already contain files and other
3375directories after it is created.
3376::
3377
3378  # mkdir instances/foo
3379  # ls instances/foo
3380  buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
3381  set_event  snapshot  trace  trace_clock  trace_marker  trace_options
3382  trace_pipe  tracing_on
3383
3384As you can see, the new directory looks similar to the tracing directory
3385itself. In fact, it is very similar, except that the buffer and
3386events are agnostic from the main directory, or from any other
3387instances that are created.
3388
3389The files in the new directory work just like the files with the
3390same name in the tracing directory except the buffer that is used
3391is a separate and new buffer. The files affect that buffer but do not
3392affect the main buffer with the exception of trace_options. Currently,
3393the trace_options affect all instances and the top level buffer
3394the same, but this may change in future releases. That is, options
3395may become specific to the instance they reside in.
3396
3397Notice that none of the function tracer files are there, nor is
3398current_tracer and available_tracers. This is because the buffers
3399can currently only have events enabled for them.
3400::
3401
3402  # mkdir instances/foo
3403  # mkdir instances/bar
3404  # mkdir instances/zoot
3405  # echo 100000 > buffer_size_kb
3406  # echo 1000 > instances/foo/buffer_size_kb
3407  # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3408  # echo function > current_trace
3409  # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3410  # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3411  # echo 1 > instances/foo/events/sched/sched_switch/enable
3412  # echo 1 > instances/bar/events/irq/enable
3413  # echo 1 > instances/zoot/events/syscalls/enable
3414  # cat trace_pipe
3415  CPU:2 [LOST 11745 EVENTS]
3416              bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3417              bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3418              bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3419              bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3420              bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3421              bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3422              bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3423              bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3424              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3425              bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3426              bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3427  [...]
3428
3429  # cat instances/foo/trace_pipe
3430              bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3431              bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3432            <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3433            <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3434       rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3435              bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3436              bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3437              bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3438       kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3439       kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3440  [...]
3441
3442  # cat instances/bar/trace_pipe
3443       migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
3444            <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
3445              bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
3446              bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
3447              bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
3448              bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
3449              bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
3450              bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
3451              sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3452              sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
3453              sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
3454              sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
3455  [...]
3456
3457  # cat instances/zoot/trace
3458  # tracer: nop
3459  #
3460  # entries-in-buffer/entries-written: 18996/18996   #P:4
3461  #
3462  #                              _-----=> irqs-off
3463  #                             / _----=> need-resched
3464  #                            | / _---=> hardirq/softirq
3465  #                            || / _--=> preempt-depth
3466  #                            ||| /     delay
3467  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3468  #              | |       |   ||||       |         |
3469              bash-1998  [000] d...   140.733501: sys_write -> 0x2
3470              bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
3471              bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
3472              bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3473              bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
3474              bash-1998  [000] d...   140.733510: sys_close(fd: a)
3475              bash-1998  [000] d...   140.733510: sys_close -> 0x0
3476              bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3477              bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
3478              bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3479              bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
3480
3481You can see that the trace of the top most trace buffer shows only
3482the function tracing. The foo instance displays wakeups and task
3483switches.
3484
3485To remove the instances, simply delete their directories:
3486::
3487
3488  # rmdir instances/foo
3489  # rmdir instances/bar
3490  # rmdir instances/zoot
3491
3492Note, if a process has a trace file open in one of the instance
3493directories, the rmdir will fail with EBUSY.
3494
3495
3496Stack trace
3497-----------
3498Since the kernel has a fixed sized stack, it is important not to
3499waste it in functions. A kernel developer must be conscience of
3500what they allocate on the stack. If they add too much, the system
3501can be in danger of a stack overflow, and corruption will occur,
3502usually leading to a system panic.
3503
3504There are some tools that check this, usually with interrupts
3505periodically checking usage. But if you can perform a check
3506at every function call that will become very useful. As ftrace provides
3507a function tracer, it makes it convenient to check the stack size
3508at every function call. This is enabled via the stack tracer.
3509
3510CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3511To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3512::
3513
3514 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3515
3516You can also enable it from the kernel command line to trace
3517the stack size of the kernel during boot up, by adding "stacktrace"
3518to the kernel command line parameter.
3519
3520After running it for a few minutes, the output looks like:
3521::
3522
3523  # cat stack_max_size
3524  2928
3525
3526  # cat stack_trace
3527          Depth    Size   Location    (18 entries)
3528          -----    ----   --------
3529    0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
3530    1)     2704     160   find_busiest_group+0x31/0x1f1
3531    2)     2544     256   load_balance+0xd9/0x662
3532    3)     2288      80   idle_balance+0xbb/0x130
3533    4)     2208     128   __schedule+0x26e/0x5b9
3534    5)     2080      16   schedule+0x64/0x66
3535    6)     2064     128   schedule_timeout+0x34/0xe0
3536    7)     1936     112   wait_for_common+0x97/0xf1
3537    8)     1824      16   wait_for_completion+0x1d/0x1f
3538    9)     1808     128   flush_work+0xfe/0x119
3539   10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
3540   11)     1664      48   input_available_p+0x1d/0x5c
3541   12)     1616      48   n_tty_poll+0x6d/0x134
3542   13)     1568      64   tty_poll+0x64/0x7f
3543   14)     1504     880   do_select+0x31e/0x511
3544   15)      624     400   core_sys_select+0x177/0x216
3545   16)      224      96   sys_select+0x91/0xb9
3546   17)      128     128   system_call_fastpath+0x16/0x1b
3547
3548Note, if -mfentry is being used by gcc, functions get traced before
3549they set up the stack frame. This means that leaf level functions
3550are not tested by the stack tracer when -mfentry is used.
3551
3552Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3553
3554More
3555----
3556More details can be found in the source code, in the `kernel/trace/*.c` files.
3557