1======================== 2ftrace - Function Tracer 3======================== 4 5Copyright 2008 Red Hat Inc. 6 7:Author: Steven Rostedt <srostedt@redhat.com> 8:License: The GNU Free Documentation License, Version 1.2 9 (dual licensed under the GPL v2) 10:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 11 John Kacur, and David Teigland. 12 13- Written for: 2.6.28-rc2 14- Updated for: 3.10 15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt 16- Converted to rst format - Changbin Du <changbin.du@intel.com> 17 18Introduction 19------------ 20 21Ftrace is an internal tracer designed to help out developers and 22designers of systems to find what is going on inside the kernel. 23It can be used for debugging or analyzing latencies and 24performance issues that take place outside of user-space. 25 26Although ftrace is typically considered the function tracer, it 27is really a framework of several assorted tracing utilities. 28There's latency tracing to examine what occurs between interrupts 29disabled and enabled, as well as for preemption and from a time 30a task is woken to the task is actually scheduled in. 31 32One of the most common uses of ftrace is the event tracing. 33Throughout the kernel is hundreds of static event points that 34can be enabled via the tracefs file system to see what is 35going on in certain parts of the kernel. 36 37See events.rst for more information. 38 39 40Implementation Details 41---------------------- 42 43See :doc:`ftrace-design` for details for arch porters and such. 44 45 46The File System 47--------------- 48 49Ftrace uses the tracefs file system to hold the control files as 50well as the files to display output. 51 52When tracefs is configured into the kernel (which selecting any ftrace 53option will do) the directory /sys/kernel/tracing will be created. To mount 54this directory, you can add to your /etc/fstab file:: 55 56 tracefs /sys/kernel/tracing tracefs defaults 0 0 57 58Or you can mount it at run time with:: 59 60 mount -t tracefs nodev /sys/kernel/tracing 61 62For quicker access to that directory you may want to make a soft link to 63it:: 64 65 ln -s /sys/kernel/tracing /tracing 66 67.. attention:: 68 69 Before 4.1, all ftrace tracing control files were within the debugfs 70 file system, which is typically located at /sys/kernel/debug/tracing. 71 For backward compatibility, when mounting the debugfs file system, 72 the tracefs file system will be automatically mounted at: 73 74 /sys/kernel/debug/tracing 75 76 All files located in the tracefs file system will be located in that 77 debugfs file system directory as well. 78 79 In order to not automount tracefs in the debugfs filesystem, enable the 80 defconfig option CONFIG_TRACEFS_DISABLE_AUTOMOUNT. 81 82.. attention:: 83 84 Any selected ftrace option will also create the tracefs file system. 85 The rest of the document will assume that you are in the ftrace directory 86 (cd /sys/kernel/tracing) and will only concentrate on the files within that 87 directory and not distract from the content with the extended 88 "/sys/kernel/tracing" path name. 89 90That's it! (assuming that you have ftrace configured into your kernel) 91 92After mounting tracefs you will have access to the control and output files 93of ftrace. Here is a list of some of the key files: 94 95 96 Note: all time values are in microseconds. 97 98 current_tracer: 99 100 This is used to set or display the current tracer 101 that is configured. Changing the current tracer clears 102 the ring buffer content as well as the "snapshot" buffer. 103 104 available_tracers: 105 106 This holds the different types of tracers that 107 have been compiled into the kernel. The 108 tracers listed here can be configured by 109 echoing their name into current_tracer. 110 111 tracing_on: 112 113 This sets or displays whether writing to the trace 114 ring buffer is enabled. Echo 0 into this file to disable 115 the tracer or 1 to enable it. Note, this only disables 116 writing to the ring buffer, the tracing overhead may 117 still be occurring. 118 119 The kernel function tracing_off() can be used within the 120 kernel to disable writing to the ring buffer, which will 121 set this file to "0". User space can re-enable tracing by 122 echoing "1" into the file. 123 124 Note, the function and event trigger "traceoff" will also 125 set this file to zero and stop tracing. Which can also 126 be re-enabled by user space using this file. 127 128 trace: 129 130 This file holds the output of the trace in a human 131 readable format (described below). Opening this file for 132 writing with the O_TRUNC flag clears the ring buffer content. 133 Note, this file is not a consumer. If tracing is off 134 (no tracer running, or tracing_on is zero), it will produce 135 the same output each time it is read. When tracing is on, 136 it may produce inconsistent results as it tries to read 137 the entire buffer without consuming it. 138 139 trace_pipe: 140 141 The output is the same as the "trace" file but this 142 file is meant to be streamed with live tracing. 143 Reads from this file will block until new data is 144 retrieved. Unlike the "trace" file, this file is a 145 consumer. This means reading from this file causes 146 sequential reads to display more current data. Once 147 data is read from this file, it is consumed, and 148 will not be read again with a sequential read. The 149 "trace" file is static, and if the tracer is not 150 adding more data, it will display the same 151 information every time it is read. 152 153 trace_options: 154 155 This file lets the user control the amount of data 156 that is displayed in one of the above output 157 files. Options also exist to modify how a tracer 158 or events work (stack traces, timestamps, etc). 159 160 options: 161 162 This is a directory that has a file for every available 163 trace option (also in trace_options). Options may also be set 164 or cleared by writing a "1" or "0" respectively into the 165 corresponding file with the option name. 166 167 tracing_max_latency: 168 169 Some of the tracers record the max latency. 170 For example, the maximum time that interrupts are disabled. 171 The maximum time is saved in this file. The max trace will also be 172 stored, and displayed by "trace". A new max trace will only be 173 recorded if the latency is greater than the value in this file 174 (in microseconds). 175 176 By echoing in a time into this file, no latency will be recorded 177 unless it is greater than the time in this file. 178 179 tracing_thresh: 180 181 Some latency tracers will record a trace whenever the 182 latency is greater than the number in this file. 183 Only active when the file contains a number greater than 0. 184 (in microseconds) 185 186 buffer_size_kb: 187 188 This sets or displays the number of kilobytes each CPU 189 buffer holds. By default, the trace buffers are the same size 190 for each CPU. The displayed number is the size of the 191 CPU buffer and not total size of all buffers. The 192 trace buffers are allocated in pages (blocks of memory 193 that the kernel uses for allocation, usually 4 KB in size). 194 A few extra pages may be allocated to accommodate buffer management 195 meta-data. If the last page allocated has room for more bytes 196 than requested, the rest of the page will be used, 197 making the actual allocation bigger than requested or shown. 198 ( Note, the size may not be a multiple of the page size 199 due to buffer management meta-data. ) 200 201 Buffer sizes for individual CPUs may vary 202 (see "per_cpu/cpu0/buffer_size_kb" below), and if they do 203 this file will show "X". 204 205 buffer_total_size_kb: 206 207 This displays the total combined size of all the trace buffers. 208 209 free_buffer: 210 211 If a process is performing tracing, and the ring buffer should be 212 shrunk "freed" when the process is finished, even if it were to be 213 killed by a signal, this file can be used for that purpose. On close 214 of this file, the ring buffer will be resized to its minimum size. 215 Having a process that is tracing also open this file, when the process 216 exits its file descriptor for this file will be closed, and in doing so, 217 the ring buffer will be "freed". 218 219 It may also stop tracing if disable_on_free option is set. 220 221 tracing_cpumask: 222 223 This is a mask that lets the user only trace on specified CPUs. 224 The format is a hex string representing the CPUs. 225 226 set_ftrace_filter: 227 228 When dynamic ftrace is configured in (see the 229 section below "dynamic ftrace"), the code is dynamically 230 modified (code text rewrite) to disable calling of the 231 function profiler (mcount). This lets tracing be configured 232 in with practically no overhead in performance. This also 233 has a side effect of enabling or disabling specific functions 234 to be traced. Echoing names of functions into this file 235 will limit the trace to only those functions. 236 This influences the tracers "function" and "function_graph" 237 and thus also function profiling (see "function_profile_enabled"). 238 239 The functions listed in "available_filter_functions" are what 240 can be written into this file. 241 242 This interface also allows for commands to be used. See the 243 "Filter commands" section for more details. 244 245 As a speed up, since processing strings can be quite expensive 246 and requires a check of all functions registered to tracing, instead 247 an index can be written into this file. A number (starting with "1") 248 written will instead select the same corresponding at the line position 249 of the "available_filter_functions" file. 250 251 set_ftrace_notrace: 252 253 This has an effect opposite to that of 254 set_ftrace_filter. Any function that is added here will not 255 be traced. If a function exists in both set_ftrace_filter 256 and set_ftrace_notrace, the function will _not_ be traced. 257 258 set_ftrace_pid: 259 260 Have the function tracer only trace the threads whose PID are 261 listed in this file. 262 263 If the "function-fork" option is set, then when a task whose 264 PID is listed in this file forks, the child's PID will 265 automatically be added to this file, and the child will be 266 traced by the function tracer as well. This option will also 267 cause PIDs of tasks that exit to be removed from the file. 268 269 set_ftrace_notrace_pid: 270 271 Have the function tracer ignore threads whose PID are listed in 272 this file. 273 274 If the "function-fork" option is set, then when a task whose 275 PID is listed in this file forks, the child's PID will 276 automatically be added to this file, and the child will not be 277 traced by the function tracer as well. This option will also 278 cause PIDs of tasks that exit to be removed from the file. 279 280 If a PID is in both this file and "set_ftrace_pid", then this 281 file takes precedence, and the thread will not be traced. 282 283 set_event_pid: 284 285 Have the events only trace a task with a PID listed in this file. 286 Note, sched_switch and sched_wake_up will also trace events 287 listed in this file. 288 289 To have the PIDs of children of tasks with their PID in this file 290 added on fork, enable the "event-fork" option. That option will also 291 cause the PIDs of tasks to be removed from this file when the task 292 exits. 293 294 set_event_notrace_pid: 295 296 Have the events not trace a task with a PID listed in this file. 297 Note, sched_switch and sched_wakeup will trace threads not listed 298 in this file, even if a thread's PID is in the file if the 299 sched_switch or sched_wakeup events also trace a thread that should 300 be traced. 301 302 To have the PIDs of children of tasks with their PID in this file 303 added on fork, enable the "event-fork" option. That option will also 304 cause the PIDs of tasks to be removed from this file when the task 305 exits. 306 307 set_graph_function: 308 309 Functions listed in this file will cause the function graph 310 tracer to only trace these functions and the functions that 311 they call. (See the section "dynamic ftrace" for more details). 312 Note, set_ftrace_filter and set_ftrace_notrace still affects 313 what functions are being traced. 314 315 set_graph_notrace: 316 317 Similar to set_graph_function, but will disable function graph 318 tracing when the function is hit until it exits the function. 319 This makes it possible to ignore tracing functions that are called 320 by a specific function. 321 322 available_filter_functions: 323 324 This lists the functions that ftrace has processed and can trace. 325 These are the function names that you can pass to 326 "set_ftrace_filter", "set_ftrace_notrace", 327 "set_graph_function", or "set_graph_notrace". 328 (See the section "dynamic ftrace" below for more details.) 329 330 dyn_ftrace_total_info: 331 332 This file is for debugging purposes. The number of functions that 333 have been converted to nops and are available to be traced. 334 335 enabled_functions: 336 337 This file is more for debugging ftrace, but can also be useful 338 in seeing if any function has a callback attached to it. 339 Not only does the trace infrastructure use ftrace function 340 trace utility, but other subsystems might too. This file 341 displays all functions that have a callback attached to them 342 as well as the number of callbacks that have been attached. 343 Note, a callback may also call multiple functions which will 344 not be listed in this count. 345 346 If the callback registered to be traced by a function with 347 the "save regs" attribute (thus even more overhead), a 'R' 348 will be displayed on the same line as the function that 349 is returning registers. 350 351 If the callback registered to be traced by a function with 352 the "ip modify" attribute (thus the regs->ip can be changed), 353 an 'I' will be displayed on the same line as the function that 354 can be overridden. 355 356 If the architecture supports it, it will also show what callback 357 is being directly called by the function. If the count is greater 358 than 1 it most likely will be ftrace_ops_list_func(). 359 360 If the callback of the function jumps to a trampoline that is 361 specific to a the callback and not the standard trampoline, 362 its address will be printed as well as the function that the 363 trampoline calls. 364 365 function_profile_enabled: 366 367 When set it will enable all functions with either the function 368 tracer, or if configured, the function graph tracer. It will 369 keep a histogram of the number of functions that were called 370 and if the function graph tracer was configured, it will also keep 371 track of the time spent in those functions. The histogram 372 content can be displayed in the files: 373 374 trace_stat/function<cpu> ( function0, function1, etc). 375 376 trace_stat: 377 378 A directory that holds different tracing stats. 379 380 kprobe_events: 381 382 Enable dynamic trace points. See kprobetrace.rst. 383 384 kprobe_profile: 385 386 Dynamic trace points stats. See kprobetrace.rst. 387 388 max_graph_depth: 389 390 Used with the function graph tracer. This is the max depth 391 it will trace into a function. Setting this to a value of 392 one will show only the first kernel function that is called 393 from user space. 394 395 printk_formats: 396 397 This is for tools that read the raw format files. If an event in 398 the ring buffer references a string, only a pointer to the string 399 is recorded into the buffer and not the string itself. This prevents 400 tools from knowing what that string was. This file displays the string 401 and address for the string allowing tools to map the pointers to what 402 the strings were. 403 404 saved_cmdlines: 405 406 Only the pid of the task is recorded in a trace event unless 407 the event specifically saves the task comm as well. Ftrace 408 makes a cache of pid mappings to comms to try to display 409 comms for events. If a pid for a comm is not listed, then 410 "<...>" is displayed in the output. 411 412 If the option "record-cmd" is set to "0", then comms of tasks 413 will not be saved during recording. By default, it is enabled. 414 415 saved_cmdlines_size: 416 417 By default, 128 comms are saved (see "saved_cmdlines" above). To 418 increase or decrease the amount of comms that are cached, echo 419 the number of comms to cache into this file. 420 421 saved_tgids: 422 423 If the option "record-tgid" is set, on each scheduling context switch 424 the Task Group ID of a task is saved in a table mapping the PID of 425 the thread to its TGID. By default, the "record-tgid" option is 426 disabled. 427 428 snapshot: 429 430 This displays the "snapshot" buffer and also lets the user 431 take a snapshot of the current running trace. 432 See the "Snapshot" section below for more details. 433 434 stack_max_size: 435 436 When the stack tracer is activated, this will display the 437 maximum stack size it has encountered. 438 See the "Stack Trace" section below. 439 440 stack_trace: 441 442 This displays the stack back trace of the largest stack 443 that was encountered when the stack tracer is activated. 444 See the "Stack Trace" section below. 445 446 stack_trace_filter: 447 448 This is similar to "set_ftrace_filter" but it limits what 449 functions the stack tracer will check. 450 451 trace_clock: 452 453 Whenever an event is recorded into the ring buffer, a 454 "timestamp" is added. This stamp comes from a specified 455 clock. By default, ftrace uses the "local" clock. This 456 clock is very fast and strictly per cpu, but on some 457 systems it may not be monotonic with respect to other 458 CPUs. In other words, the local clocks may not be in sync 459 with local clocks on other CPUs. 460 461 Usual clocks for tracing:: 462 463 # cat trace_clock 464 [local] global counter x86-tsc 465 466 The clock with the square brackets around it is the one in effect. 467 468 local: 469 Default clock, but may not be in sync across CPUs 470 471 global: 472 This clock is in sync with all CPUs but may 473 be a bit slower than the local clock. 474 475 counter: 476 This is not a clock at all, but literally an atomic 477 counter. It counts up one by one, but is in sync 478 with all CPUs. This is useful when you need to 479 know exactly the order events occurred with respect to 480 each other on different CPUs. 481 482 uptime: 483 This uses the jiffies counter and the time stamp 484 is relative to the time since boot up. 485 486 perf: 487 This makes ftrace use the same clock that perf uses. 488 Eventually perf will be able to read ftrace buffers 489 and this will help out in interleaving the data. 490 491 x86-tsc: 492 Architectures may define their own clocks. For 493 example, x86 uses its own TSC cycle clock here. 494 495 ppc-tb: 496 This uses the powerpc timebase register value. 497 This is in sync across CPUs and can also be used 498 to correlate events across hypervisor/guest if 499 tb_offset is known. 500 501 mono: 502 This uses the fast monotonic clock (CLOCK_MONOTONIC) 503 which is monotonic and is subject to NTP rate adjustments. 504 505 mono_raw: 506 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW) 507 which is monotonic but is not subject to any rate adjustments 508 and ticks at the same rate as the hardware clocksource. 509 510 boot: 511 This is the boot clock (CLOCK_BOOTTIME) and is based on the 512 fast monotonic clock, but also accounts for time spent in 513 suspend. Since the clock access is designed for use in 514 tracing in the suspend path, some side effects are possible 515 if clock is accessed after the suspend time is accounted before 516 the fast mono clock is updated. In this case, the clock update 517 appears to happen slightly sooner than it normally would have. 518 Also on 32-bit systems, it's possible that the 64-bit boot offset 519 sees a partial update. These effects are rare and post 520 processing should be able to handle them. See comments in the 521 ktime_get_boot_fast_ns() function for more information. 522 523 To set a clock, simply echo the clock name into this file:: 524 525 # echo global > trace_clock 526 527 Setting a clock clears the ring buffer content as well as the 528 "snapshot" buffer. 529 530 trace_marker: 531 532 This is a very useful file for synchronizing user space 533 with events happening in the kernel. Writing strings into 534 this file will be written into the ftrace buffer. 535 536 It is useful in applications to open this file at the start 537 of the application and just reference the file descriptor 538 for the file:: 539 540 void trace_write(const char *fmt, ...) 541 { 542 va_list ap; 543 char buf[256]; 544 int n; 545 546 if (trace_fd < 0) 547 return; 548 549 va_start(ap, fmt); 550 n = vsnprintf(buf, 256, fmt, ap); 551 va_end(ap); 552 553 write(trace_fd, buf, n); 554 } 555 556 start:: 557 558 trace_fd = open("trace_marker", WR_ONLY); 559 560 Note: Writing into the trace_marker file can also initiate triggers 561 that are written into /sys/kernel/tracing/events/ftrace/print/trigger 562 See "Event triggers" in Documentation/trace/events.rst and an 563 example in Documentation/trace/histogram.rst (Section 3.) 564 565 trace_marker_raw: 566 567 This is similar to trace_marker above, but is meant for binary data 568 to be written to it, where a tool can be used to parse the data 569 from trace_pipe_raw. 570 571 uprobe_events: 572 573 Add dynamic tracepoints in programs. 574 See uprobetracer.rst 575 576 uprobe_profile: 577 578 Uprobe statistics. See uprobetrace.txt 579 580 instances: 581 582 This is a way to make multiple trace buffers where different 583 events can be recorded in different buffers. 584 See "Instances" section below. 585 586 events: 587 588 This is the trace event directory. It holds event tracepoints 589 (also known as static tracepoints) that have been compiled 590 into the kernel. It shows what event tracepoints exist 591 and how they are grouped by system. There are "enable" 592 files at various levels that can enable the tracepoints 593 when a "1" is written to them. 594 595 See events.rst for more information. 596 597 set_event: 598 599 By echoing in the event into this file, will enable that event. 600 601 See events.rst for more information. 602 603 available_events: 604 605 A list of events that can be enabled in tracing. 606 607 See events.rst for more information. 608 609 timestamp_mode: 610 611 Certain tracers may change the timestamp mode used when 612 logging trace events into the event buffer. Events with 613 different modes can coexist within a buffer but the mode in 614 effect when an event is logged determines which timestamp mode 615 is used for that event. The default timestamp mode is 616 'delta'. 617 618 Usual timestamp modes for tracing: 619 620 # cat timestamp_mode 621 [delta] absolute 622 623 The timestamp mode with the square brackets around it is the 624 one in effect. 625 626 delta: Default timestamp mode - timestamp is a delta against 627 a per-buffer timestamp. 628 629 absolute: The timestamp is a full timestamp, not a delta 630 against some other value. As such it takes up more 631 space and is less efficient. 632 633 hwlat_detector: 634 635 Directory for the Hardware Latency Detector. 636 See "Hardware Latency Detector" section below. 637 638 per_cpu: 639 640 This is a directory that contains the trace per_cpu information. 641 642 per_cpu/cpu0/buffer_size_kb: 643 644 The ftrace buffer is defined per_cpu. That is, there's a separate 645 buffer for each CPU to allow writes to be done atomically, 646 and free from cache bouncing. These buffers may have different 647 size buffers. This file is similar to the buffer_size_kb 648 file, but it only displays or sets the buffer size for the 649 specific CPU. (here cpu0). 650 651 per_cpu/cpu0/trace: 652 653 This is similar to the "trace" file, but it will only display 654 the data specific for the CPU. If written to, it only clears 655 the specific CPU buffer. 656 657 per_cpu/cpu0/trace_pipe 658 659 This is similar to the "trace_pipe" file, and is a consuming 660 read, but it will only display (and consume) the data specific 661 for the CPU. 662 663 per_cpu/cpu0/trace_pipe_raw 664 665 For tools that can parse the ftrace ring buffer binary format, 666 the trace_pipe_raw file can be used to extract the data 667 from the ring buffer directly. With the use of the splice() 668 system call, the buffer data can be quickly transferred to 669 a file or to the network where a server is collecting the 670 data. 671 672 Like trace_pipe, this is a consuming reader, where multiple 673 reads will always produce different data. 674 675 per_cpu/cpu0/snapshot: 676 677 This is similar to the main "snapshot" file, but will only 678 snapshot the current CPU (if supported). It only displays 679 the content of the snapshot for a given CPU, and if 680 written to, only clears this CPU buffer. 681 682 per_cpu/cpu0/snapshot_raw: 683 684 Similar to the trace_pipe_raw, but will read the binary format 685 from the snapshot buffer for the given CPU. 686 687 per_cpu/cpu0/stats: 688 689 This displays certain stats about the ring buffer: 690 691 entries: 692 The number of events that are still in the buffer. 693 694 overrun: 695 The number of lost events due to overwriting when 696 the buffer was full. 697 698 commit overrun: 699 Should always be zero. 700 This gets set if so many events happened within a nested 701 event (ring buffer is re-entrant), that it fills the 702 buffer and starts dropping events. 703 704 bytes: 705 Bytes actually read (not overwritten). 706 707 oldest event ts: 708 The oldest timestamp in the buffer 709 710 now ts: 711 The current timestamp 712 713 dropped events: 714 Events lost due to overwrite option being off. 715 716 read events: 717 The number of events read. 718 719The Tracers 720----------- 721 722Here is the list of current tracers that may be configured. 723 724 "function" 725 726 Function call tracer to trace all kernel functions. 727 728 "function_graph" 729 730 Similar to the function tracer except that the 731 function tracer probes the functions on their entry 732 whereas the function graph tracer traces on both entry 733 and exit of the functions. It then provides the ability 734 to draw a graph of function calls similar to C code 735 source. 736 737 "blk" 738 739 The block tracer. The tracer used by the blktrace user 740 application. 741 742 "hwlat" 743 744 The Hardware Latency tracer is used to detect if the hardware 745 produces any latency. See "Hardware Latency Detector" section 746 below. 747 748 "irqsoff" 749 750 Traces the areas that disable interrupts and saves 751 the trace with the longest max latency. 752 See tracing_max_latency. When a new max is recorded, 753 it replaces the old trace. It is best to view this 754 trace with the latency-format option enabled, which 755 happens automatically when the tracer is selected. 756 757 "preemptoff" 758 759 Similar to irqsoff but traces and records the amount of 760 time for which preemption is disabled. 761 762 "preemptirqsoff" 763 764 Similar to irqsoff and preemptoff, but traces and 765 records the largest time for which irqs and/or preemption 766 is disabled. 767 768 "wakeup" 769 770 Traces and records the max latency that it takes for 771 the highest priority task to get scheduled after 772 it has been woken up. 773 Traces all tasks as an average developer would expect. 774 775 "wakeup_rt" 776 777 Traces and records the max latency that it takes for just 778 RT tasks (as the current "wakeup" does). This is useful 779 for those interested in wake up timings of RT tasks. 780 781 "wakeup_dl" 782 783 Traces and records the max latency that it takes for 784 a SCHED_DEADLINE task to be woken (as the "wakeup" and 785 "wakeup_rt" does). 786 787 "mmiotrace" 788 789 A special tracer that is used to trace binary module. 790 It will trace all the calls that a module makes to the 791 hardware. Everything it writes and reads from the I/O 792 as well. 793 794 "branch" 795 796 This tracer can be configured when tracing likely/unlikely 797 calls within the kernel. It will trace when a likely and 798 unlikely branch is hit and if it was correct in its prediction 799 of being correct. 800 801 "nop" 802 803 This is the "trace nothing" tracer. To remove all 804 tracers from tracing simply echo "nop" into 805 current_tracer. 806 807Error conditions 808---------------- 809 810 For most ftrace commands, failure modes are obvious and communicated 811 using standard return codes. 812 813 For other more involved commands, extended error information may be 814 available via the tracing/error_log file. For the commands that 815 support it, reading the tracing/error_log file after an error will 816 display more detailed information about what went wrong, if 817 information is available. The tracing/error_log file is a circular 818 error log displaying a small number (currently, 8) of ftrace errors 819 for the last (8) failed commands. 820 821 The extended error information and usage takes the form shown in 822 this example:: 823 824 # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger 825 echo: write error: Invalid argument 826 827 # cat /sys/kernel/debug/tracing/error_log 828 [ 5348.887237] location: error: Couldn't yyy: zzz 829 Command: xxx 830 ^ 831 [ 7517.023364] location: error: Bad rrr: sss 832 Command: ppp qqq 833 ^ 834 835 To clear the error log, echo the empty string into it:: 836 837 # echo > /sys/kernel/debug/tracing/error_log 838 839Examples of using the tracer 840---------------------------- 841 842Here are typical examples of using the tracers when controlling 843them only with the tracefs interface (without using any 844user-land utilities). 845 846Output format: 847-------------- 848 849Here is an example of the output format of the file "trace":: 850 851 # tracer: function 852 # 853 # entries-in-buffer/entries-written: 140080/250280 #P:4 854 # 855 # _-----=> irqs-off 856 # / _----=> need-resched 857 # | / _---=> hardirq/softirq 858 # || / _--=> preempt-depth 859 # ||| / delay 860 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 861 # | | | |||| | | 862 bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath 863 bash-1977 [000] .... 17284.993653: __close_fd <-sys_close 864 bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd 865 sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify 866 bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock 867 bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd 868 bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock 869 bash-1977 [000] .... 17284.993657: filp_close <-__close_fd 870 bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close 871 sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath 872 .... 873 874A header is printed with the tracer name that is represented by 875the trace. In this case the tracer is "function". Then it shows the 876number of events in the buffer as well as the total number of entries 877that were written. The difference is the number of entries that were 878lost due to the buffer filling up (250280 - 140080 = 110200 events 879lost). 880 881The header explains the content of the events. Task name "bash", the task 882PID "1977", the CPU that it was running on "000", the latency format 883(explained below), the timestamp in <secs>.<usecs> format, the 884function name that was traced "sys_close" and the parent function that 885called this function "system_call_fastpath". The timestamp is the time 886at which the function was entered. 887 888Latency trace format 889-------------------- 890 891When the latency-format option is enabled or when one of the latency 892tracers is set, the trace file gives somewhat more information to see 893why a latency happened. Here is a typical trace:: 894 895 # tracer: irqsoff 896 # 897 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 898 # -------------------------------------------------------------------- 899 # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 900 # ----------------- 901 # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) 902 # ----------------- 903 # => started at: __lock_task_sighand 904 # => ended at: _raw_spin_unlock_irqrestore 905 # 906 # 907 # _------=> CPU# 908 # / _-----=> irqs-off 909 # | / _----=> need-resched 910 # || / _---=> hardirq/softirq 911 # ||| / _--=> preempt-depth 912 # |||| / delay 913 # cmd pid ||||| time | caller 914 # \ / ||||| \ | / 915 ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand 916 ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore 917 ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore 918 ps-6143 2d..1 306us : <stack trace> 919 => trace_hardirqs_on_caller 920 => trace_hardirqs_on 921 => _raw_spin_unlock_irqrestore 922 => do_task_stat 923 => proc_tgid_stat 924 => proc_single_show 925 => seq_read 926 => vfs_read 927 => sys_read 928 => system_call_fastpath 929 930 931This shows that the current tracer is "irqsoff" tracing the time 932for which interrupts were disabled. It gives the trace version (which 933never changes) and the version of the kernel upon which this was executed on 934(3.8). Then it displays the max latency in microseconds (259 us). The number 935of trace entries displayed and the total number (both are four: #4/4). 936VP, KP, SP, and HP are always zero and are reserved for later use. 937#P is the number of online CPUs (#P:4). 938 939The task is the process that was running when the latency 940occurred. (ps pid: 6143). 941 942The start and stop (the functions in which the interrupts were 943disabled and enabled respectively) that caused the latencies: 944 945 - __lock_task_sighand is where the interrupts were disabled. 946 - _raw_spin_unlock_irqrestore is where they were enabled again. 947 948The next lines after the header are the trace itself. The header 949explains which is which. 950 951 cmd: The name of the process in the trace. 952 953 pid: The PID of that process. 954 955 CPU#: The CPU which the process was running on. 956 957 irqs-off: 'd' interrupts are disabled. '.' otherwise. 958 .. caution:: If the architecture does not support a way to 959 read the irq flags variable, an 'X' will always 960 be printed here. 961 962 need-resched: 963 - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, 964 - 'n' only TIF_NEED_RESCHED is set, 965 - 'p' only PREEMPT_NEED_RESCHED is set, 966 - '.' otherwise. 967 968 hardirq/softirq: 969 - 'Z' - NMI occurred inside a hardirq 970 - 'z' - NMI is running 971 - 'H' - hard irq occurred inside a softirq. 972 - 'h' - hard irq is running 973 - 's' - soft irq is running 974 - '.' - normal context. 975 976 preempt-depth: The level of preempt_disabled 977 978The above is mostly meaningful for kernel developers. 979 980 time: 981 When the latency-format option is enabled, the trace file 982 output includes a timestamp relative to the start of the 983 trace. This differs from the output when latency-format 984 is disabled, which includes an absolute timestamp. 985 986 delay: 987 This is just to help catch your eye a bit better. And 988 needs to be fixed to be only relative to the same CPU. 989 The marks are determined by the difference between this 990 current trace and the next trace. 991 992 - '$' - greater than 1 second 993 - '@' - greater than 100 millisecond 994 - '*' - greater than 10 millisecond 995 - '#' - greater than 1000 microsecond 996 - '!' - greater than 100 microsecond 997 - '+' - greater than 10 microsecond 998 - ' ' - less than or equal to 10 microsecond. 999 1000 The rest is the same as the 'trace' file. 1001 1002 Note, the latency tracers will usually end with a back trace 1003 to easily find where the latency occurred. 1004 1005trace_options 1006------------- 1007 1008The trace_options file (or the options directory) is used to control 1009what gets printed in the trace output, or manipulate the tracers. 1010To see what is available, simply cat the file:: 1011 1012 cat trace_options 1013 print-parent 1014 nosym-offset 1015 nosym-addr 1016 noverbose 1017 noraw 1018 nohex 1019 nobin 1020 noblock 1021 trace_printk 1022 annotate 1023 nouserstacktrace 1024 nosym-userobj 1025 noprintk-msg-only 1026 context-info 1027 nolatency-format 1028 record-cmd 1029 norecord-tgid 1030 overwrite 1031 nodisable_on_free 1032 irq-info 1033 markers 1034 noevent-fork 1035 function-trace 1036 nofunction-fork 1037 nodisplay-graph 1038 nostacktrace 1039 nobranch 1040 1041To disable one of the options, echo in the option prepended with 1042"no":: 1043 1044 echo noprint-parent > trace_options 1045 1046To enable an option, leave off the "no":: 1047 1048 echo sym-offset > trace_options 1049 1050Here are the available options: 1051 1052 print-parent 1053 On function traces, display the calling (parent) 1054 function as well as the function being traced. 1055 :: 1056 1057 print-parent: 1058 bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul 1059 1060 noprint-parent: 1061 bash-4000 [01] 1477.606694: simple_strtoul 1062 1063 1064 sym-offset 1065 Display not only the function name, but also the 1066 offset in the function. For example, instead of 1067 seeing just "ktime_get", you will see 1068 "ktime_get+0xb/0x20". 1069 :: 1070 1071 sym-offset: 1072 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 1073 1074 sym-addr 1075 This will also display the function address as well 1076 as the function name. 1077 :: 1078 1079 sym-addr: 1080 bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 1081 1082 verbose 1083 This deals with the trace file when the 1084 latency-format option is enabled. 1085 :: 1086 1087 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 1088 (+0.000ms): simple_strtoul (kstrtoul) 1089 1090 raw 1091 This will display raw numbers. This option is best for 1092 use with user applications that can translate the raw 1093 numbers better than having it done in the kernel. 1094 1095 hex 1096 Similar to raw, but the numbers will be in a hexadecimal format. 1097 1098 bin 1099 This will print out the formats in raw binary. 1100 1101 block 1102 When set, reading trace_pipe will not block when polled. 1103 1104 trace_printk 1105 Can disable trace_printk() from writing into the buffer. 1106 1107 annotate 1108 It is sometimes confusing when the CPU buffers are full 1109 and one CPU buffer had a lot of events recently, thus 1110 a shorter time frame, were another CPU may have only had 1111 a few events, which lets it have older events. When 1112 the trace is reported, it shows the oldest events first, 1113 and it may look like only one CPU ran (the one with the 1114 oldest events). When the annotate option is set, it will 1115 display when a new CPU buffer started:: 1116 1117 <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on 1118 <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on 1119 <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore 1120 ##### CPU 2 buffer started #### 1121 <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle 1122 <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog 1123 <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock 1124 1125 userstacktrace 1126 This option changes the trace. It records a 1127 stacktrace of the current user space thread after 1128 each trace event. 1129 1130 sym-userobj 1131 when user stacktrace are enabled, look up which 1132 object the address belongs to, and print a 1133 relative address. This is especially useful when 1134 ASLR is on, otherwise you don't get a chance to 1135 resolve the address to object/file/line after 1136 the app is no longer running 1137 1138 The lookup is performed when you read 1139 trace,trace_pipe. Example:: 1140 1141 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 1142 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 1143 1144 1145 printk-msg-only 1146 When set, trace_printk()s will only show the format 1147 and not their parameters (if trace_bprintk() or 1148 trace_bputs() was used to save the trace_printk()). 1149 1150 context-info 1151 Show only the event data. Hides the comm, PID, 1152 timestamp, CPU, and other useful data. 1153 1154 latency-format 1155 This option changes the trace output. When it is enabled, 1156 the trace displays additional information about the 1157 latency, as described in "Latency trace format". 1158 1159 pause-on-trace 1160 When set, opening the trace file for read, will pause 1161 writing to the ring buffer (as if tracing_on was set to zero). 1162 This simulates the original behavior of the trace file. 1163 When the file is closed, tracing will be enabled again. 1164 1165 record-cmd 1166 When any event or tracer is enabled, a hook is enabled 1167 in the sched_switch trace point to fill comm cache 1168 with mapped pids and comms. But this may cause some 1169 overhead, and if you only care about pids, and not the 1170 name of the task, disabling this option can lower the 1171 impact of tracing. See "saved_cmdlines". 1172 1173 record-tgid 1174 When any event or tracer is enabled, a hook is enabled 1175 in the sched_switch trace point to fill the cache of 1176 mapped Thread Group IDs (TGID) mapping to pids. See 1177 "saved_tgids". 1178 1179 overwrite 1180 This controls what happens when the trace buffer is 1181 full. If "1" (default), the oldest events are 1182 discarded and overwritten. If "0", then the newest 1183 events are discarded. 1184 (see per_cpu/cpu0/stats for overrun and dropped) 1185 1186 disable_on_free 1187 When the free_buffer is closed, tracing will 1188 stop (tracing_on set to 0). 1189 1190 irq-info 1191 Shows the interrupt, preempt count, need resched data. 1192 When disabled, the trace looks like:: 1193 1194 # tracer: function 1195 # 1196 # entries-in-buffer/entries-written: 144405/9452052 #P:4 1197 # 1198 # TASK-PID CPU# TIMESTAMP FUNCTION 1199 # | | | | | 1200 <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up 1201 <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 1202 <idle>-0 [002] 23636.756055: enqueue_task <-activate_task 1203 1204 1205 markers 1206 When set, the trace_marker is writable (only by root). 1207 When disabled, the trace_marker will error with EINVAL 1208 on write. 1209 1210 event-fork 1211 When set, tasks with PIDs listed in set_event_pid will have 1212 the PIDs of their children added to set_event_pid when those 1213 tasks fork. Also, when tasks with PIDs in set_event_pid exit, 1214 their PIDs will be removed from the file. 1215 1216 This affects PIDs listed in set_event_notrace_pid as well. 1217 1218 function-trace 1219 The latency tracers will enable function tracing 1220 if this option is enabled (default it is). When 1221 it is disabled, the latency tracers do not trace 1222 functions. This keeps the overhead of the tracer down 1223 when performing latency tests. 1224 1225 function-fork 1226 When set, tasks with PIDs listed in set_ftrace_pid will 1227 have the PIDs of their children added to set_ftrace_pid 1228 when those tasks fork. Also, when tasks with PIDs in 1229 set_ftrace_pid exit, their PIDs will be removed from the 1230 file. 1231 1232 This affects PIDs in set_ftrace_notrace_pid as well. 1233 1234 display-graph 1235 When set, the latency tracers (irqsoff, wakeup, etc) will 1236 use function graph tracing instead of function tracing. 1237 1238 stacktrace 1239 When set, a stack trace is recorded after any trace event 1240 is recorded. 1241 1242 branch 1243 Enable branch tracing with the tracer. This enables branch 1244 tracer along with the currently set tracer. Enabling this 1245 with the "nop" tracer is the same as just enabling the 1246 "branch" tracer. 1247 1248.. tip:: Some tracers have their own options. They only appear in this 1249 file when the tracer is active. They always appear in the 1250 options directory. 1251 1252 1253Here are the per tracer options: 1254 1255Options for function tracer: 1256 1257 func_stack_trace 1258 When set, a stack trace is recorded after every 1259 function that is recorded. NOTE! Limit the functions 1260 that are recorded before enabling this, with 1261 "set_ftrace_filter" otherwise the system performance 1262 will be critically degraded. Remember to disable 1263 this option before clearing the function filter. 1264 1265Options for function_graph tracer: 1266 1267 Since the function_graph tracer has a slightly different output 1268 it has its own options to control what is displayed. 1269 1270 funcgraph-overrun 1271 When set, the "overrun" of the graph stack is 1272 displayed after each function traced. The 1273 overrun, is when the stack depth of the calls 1274 is greater than what is reserved for each task. 1275 Each task has a fixed array of functions to 1276 trace in the call graph. If the depth of the 1277 calls exceeds that, the function is not traced. 1278 The overrun is the number of functions missed 1279 due to exceeding this array. 1280 1281 funcgraph-cpu 1282 When set, the CPU number of the CPU where the trace 1283 occurred is displayed. 1284 1285 funcgraph-overhead 1286 When set, if the function takes longer than 1287 A certain amount, then a delay marker is 1288 displayed. See "delay" above, under the 1289 header description. 1290 1291 funcgraph-proc 1292 Unlike other tracers, the process' command line 1293 is not displayed by default, but instead only 1294 when a task is traced in and out during a context 1295 switch. Enabling this options has the command 1296 of each process displayed at every line. 1297 1298 funcgraph-duration 1299 At the end of each function (the return) 1300 the duration of the amount of time in the 1301 function is displayed in microseconds. 1302 1303 funcgraph-abstime 1304 When set, the timestamp is displayed at each line. 1305 1306 funcgraph-irqs 1307 When disabled, functions that happen inside an 1308 interrupt will not be traced. 1309 1310 funcgraph-tail 1311 When set, the return event will include the function 1312 that it represents. By default this is off, and 1313 only a closing curly bracket "}" is displayed for 1314 the return of a function. 1315 1316 sleep-time 1317 When running function graph tracer, to include 1318 the time a task schedules out in its function. 1319 When enabled, it will account time the task has been 1320 scheduled out as part of the function call. 1321 1322 graph-time 1323 When running function profiler with function graph tracer, 1324 to include the time to call nested functions. When this is 1325 not set, the time reported for the function will only 1326 include the time the function itself executed for, not the 1327 time for functions that it called. 1328 1329Options for blk tracer: 1330 1331 blk_classic 1332 Shows a more minimalistic output. 1333 1334 1335irqsoff 1336------- 1337 1338When interrupts are disabled, the CPU can not react to any other 1339external event (besides NMIs and SMIs). This prevents the timer 1340interrupt from triggering or the mouse interrupt from letting 1341the kernel know of a new mouse event. The result is a latency 1342with the reaction time. 1343 1344The irqsoff tracer tracks the time for which interrupts are 1345disabled. When a new maximum latency is hit, the tracer saves 1346the trace leading up to that latency point so that every time a 1347new maximum is reached, the old saved trace is discarded and the 1348new trace is saved. 1349 1350To reset the maximum, echo 0 into tracing_max_latency. Here is 1351an example:: 1352 1353 # echo 0 > options/function-trace 1354 # echo irqsoff > current_tracer 1355 # echo 1 > tracing_on 1356 # echo 0 > tracing_max_latency 1357 # ls -ltr 1358 [...] 1359 # echo 0 > tracing_on 1360 # cat trace 1361 # tracer: irqsoff 1362 # 1363 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1364 # -------------------------------------------------------------------- 1365 # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1366 # ----------------- 1367 # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) 1368 # ----------------- 1369 # => started at: run_timer_softirq 1370 # => ended at: run_timer_softirq 1371 # 1372 # 1373 # _------=> CPU# 1374 # / _-----=> irqs-off 1375 # | / _----=> need-resched 1376 # || / _---=> hardirq/softirq 1377 # ||| / _--=> preempt-depth 1378 # |||| / delay 1379 # cmd pid ||||| time | caller 1380 # \ / ||||| \ | / 1381 <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq 1382 <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq 1383 <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq 1384 <idle>-0 0dNs3 25us : <stack trace> 1385 => _raw_spin_unlock_irq 1386 => run_timer_softirq 1387 => __do_softirq 1388 => call_softirq 1389 => do_softirq 1390 => irq_exit 1391 => smp_apic_timer_interrupt 1392 => apic_timer_interrupt 1393 => rcu_idle_exit 1394 => cpu_idle 1395 => rest_init 1396 => start_kernel 1397 => x86_64_start_reservations 1398 => x86_64_start_kernel 1399 1400Here we see that we had a latency of 16 microseconds (which is 1401very good). The _raw_spin_lock_irq in run_timer_softirq disabled 1402interrupts. The difference between the 16 and the displayed 1403timestamp 25us occurred because the clock was incremented 1404between the time of recording the max latency and the time of 1405recording the function that had that latency. 1406 1407Note the above example had function-trace not set. If we set 1408function-trace, we get a much larger output:: 1409 1410 with echo 1 > options/function-trace 1411 1412 # tracer: irqsoff 1413 # 1414 # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1415 # -------------------------------------------------------------------- 1416 # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1417 # ----------------- 1418 # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) 1419 # ----------------- 1420 # => started at: ata_scsi_queuecmd 1421 # => ended at: ata_scsi_queuecmd 1422 # 1423 # 1424 # _------=> CPU# 1425 # / _-----=> irqs-off 1426 # | / _----=> need-resched 1427 # || / _---=> hardirq/softirq 1428 # ||| / _--=> preempt-depth 1429 # |||| / delay 1430 # cmd pid ||||| time | caller 1431 # \ / ||||| \ | / 1432 bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1433 bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave 1434 bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd 1435 bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev 1436 bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev 1437 bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd 1438 bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd 1439 bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd 1440 bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat 1441 [...] 1442 bash-2042 3d..1 67us : delay_tsc <-__delay 1443 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1444 bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc 1445 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1446 bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc 1447 bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue 1448 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1449 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1450 bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd 1451 bash-2042 3d..1 120us : <stack trace> 1452 => _raw_spin_unlock_irqrestore 1453 => ata_scsi_queuecmd 1454 => scsi_dispatch_cmd 1455 => scsi_request_fn 1456 => __blk_run_queue_uncond 1457 => __blk_run_queue 1458 => blk_queue_bio 1459 => submit_bio_noacct 1460 => submit_bio 1461 => submit_bh 1462 => __ext3_get_inode_loc 1463 => ext3_iget 1464 => ext3_lookup 1465 => lookup_real 1466 => __lookup_hash 1467 => walk_component 1468 => lookup_last 1469 => path_lookupat 1470 => filename_lookup 1471 => user_path_at_empty 1472 => user_path_at 1473 => vfs_fstatat 1474 => vfs_stat 1475 => sys_newstat 1476 => system_call_fastpath 1477 1478 1479Here we traced a 71 microsecond latency. But we also see all the 1480functions that were called during that time. Note that by 1481enabling function tracing, we incur an added overhead. This 1482overhead may extend the latency times. But nevertheless, this 1483trace has provided some very helpful debugging information. 1484 1485If we prefer function graph output instead of function, we can set 1486display-graph option:: 1487 1488 with echo 1 > options/display-graph 1489 1490 # tracer: irqsoff 1491 # 1492 # irqsoff latency trace v1.1.5 on 4.20.0-rc6+ 1493 # -------------------------------------------------------------------- 1494 # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) 1495 # ----------------- 1496 # | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0) 1497 # ----------------- 1498 # => started at: free_debug_processing 1499 # => ended at: return_to_handler 1500 # 1501 # 1502 # _-----=> irqs-off 1503 # / _----=> need-resched 1504 # | / _---=> hardirq/softirq 1505 # || / _--=> preempt-depth 1506 # ||| / 1507 # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS 1508 # | | | | |||| | | | | | | 1509 0 us | 0) bash-1507 | d... | 0.000 us | _raw_spin_lock_irqsave(); 1510 0 us | 0) bash-1507 | d..1 | 0.378 us | do_raw_spin_trylock(); 1511 1 us | 0) bash-1507 | d..2 | | set_track() { 1512 2 us | 0) bash-1507 | d..2 | | save_stack_trace() { 1513 2 us | 0) bash-1507 | d..2 | | __save_stack_trace() { 1514 3 us | 0) bash-1507 | d..2 | | __unwind_start() { 1515 3 us | 0) bash-1507 | d..2 | | get_stack_info() { 1516 3 us | 0) bash-1507 | d..2 | 0.351 us | in_task_stack(); 1517 4 us | 0) bash-1507 | d..2 | 1.107 us | } 1518 [...] 1519 3750 us | 0) bash-1507 | d..1 | 0.516 us | do_raw_spin_unlock(); 1520 3750 us | 0) bash-1507 | d..1 | 0.000 us | _raw_spin_unlock_irqrestore(); 1521 3764 us | 0) bash-1507 | d..1 | 0.000 us | tracer_hardirqs_on(); 1522 bash-1507 0d..1 3792us : <stack trace> 1523 => free_debug_processing 1524 => __slab_free 1525 => kmem_cache_free 1526 => vm_area_free 1527 => remove_vma 1528 => exit_mmap 1529 => mmput 1530 => begin_new_exec 1531 => load_elf_binary 1532 => search_binary_handler 1533 => __do_execve_file.isra.32 1534 => __x64_sys_execve 1535 => do_syscall_64 1536 => entry_SYSCALL_64_after_hwframe 1537 1538preemptoff 1539---------- 1540 1541When preemption is disabled, we may be able to receive 1542interrupts but the task cannot be preempted and a higher 1543priority task must wait for preemption to be enabled again 1544before it can preempt a lower priority task. 1545 1546The preemptoff tracer traces the places that disable preemption. 1547Like the irqsoff tracer, it records the maximum latency for 1548which preemption was disabled. The control of preemptoff tracer 1549is much like the irqsoff tracer. 1550:: 1551 1552 # echo 0 > options/function-trace 1553 # echo preemptoff > current_tracer 1554 # echo 1 > tracing_on 1555 # echo 0 > tracing_max_latency 1556 # ls -ltr 1557 [...] 1558 # echo 0 > tracing_on 1559 # cat trace 1560 # tracer: preemptoff 1561 # 1562 # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1563 # -------------------------------------------------------------------- 1564 # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1565 # ----------------- 1566 # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) 1567 # ----------------- 1568 # => started at: do_IRQ 1569 # => ended at: do_IRQ 1570 # 1571 # 1572 # _------=> CPU# 1573 # / _-----=> irqs-off 1574 # | / _----=> need-resched 1575 # || / _---=> hardirq/softirq 1576 # ||| / _--=> preempt-depth 1577 # |||| / delay 1578 # cmd pid ||||| time | caller 1579 # \ / ||||| \ | / 1580 sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ 1581 sshd-1991 1d..1 46us : irq_exit <-do_IRQ 1582 sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ 1583 sshd-1991 1d..1 52us : <stack trace> 1584 => sub_preempt_count 1585 => irq_exit 1586 => do_IRQ 1587 => ret_from_intr 1588 1589 1590This has some more changes. Preemption was disabled when an 1591interrupt came in (notice the 'h'), and was enabled on exit. 1592But we also see that interrupts have been disabled when entering 1593the preempt off section and leaving it (the 'd'). We do not know if 1594interrupts were enabled in the mean time or shortly after this 1595was over. 1596:: 1597 1598 # tracer: preemptoff 1599 # 1600 # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1601 # -------------------------------------------------------------------- 1602 # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1603 # ----------------- 1604 # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) 1605 # ----------------- 1606 # => started at: wake_up_new_task 1607 # => ended at: task_rq_unlock 1608 # 1609 # 1610 # _------=> CPU# 1611 # / _-----=> irqs-off 1612 # | / _----=> need-resched 1613 # || / _---=> hardirq/softirq 1614 # ||| / _--=> preempt-depth 1615 # |||| / delay 1616 # cmd pid ||||| time | caller 1617 # \ / ||||| \ | / 1618 bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task 1619 bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq 1620 bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair 1621 bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1622 bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1623 [...] 1624 bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt 1625 bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter 1626 bash-1994 1d..1 13us : add_preempt_count <-irq_enter 1627 bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt 1628 bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt 1629 bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt 1630 bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock 1631 bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt 1632 [...] 1633 bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event 1634 bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt 1635 bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit 1636 bash-1994 1d..2 36us : do_softirq <-irq_exit 1637 bash-1994 1d..2 36us : __do_softirq <-call_softirq 1638 bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq 1639 bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq 1640 bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq 1641 bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock 1642 bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq 1643 [...] 1644 bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks 1645 bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq 1646 bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable 1647 bash-1994 1dN.2 82us : idle_cpu <-irq_exit 1648 bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit 1649 bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit 1650 bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock 1651 bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock 1652 bash-1994 1.N.1 104us : <stack trace> 1653 => sub_preempt_count 1654 => _raw_spin_unlock_irqrestore 1655 => task_rq_unlock 1656 => wake_up_new_task 1657 => do_fork 1658 => sys_clone 1659 => stub_clone 1660 1661 1662The above is an example of the preemptoff trace with 1663function-trace set. Here we see that interrupts were not disabled 1664the entire time. The irq_enter code lets us know that we entered 1665an interrupt 'h'. Before that, the functions being traced still 1666show that it is not in an interrupt, but we can see from the 1667functions themselves that this is not the case. 1668 1669preemptirqsoff 1670-------------- 1671 1672Knowing the locations that have interrupts disabled or 1673preemption disabled for the longest times is helpful. But 1674sometimes we would like to know when either preemption and/or 1675interrupts are disabled. 1676 1677Consider the following code:: 1678 1679 local_irq_disable(); 1680 call_function_with_irqs_off(); 1681 preempt_disable(); 1682 call_function_with_irqs_and_preemption_off(); 1683 local_irq_enable(); 1684 call_function_with_preemption_off(); 1685 preempt_enable(); 1686 1687The irqsoff tracer will record the total length of 1688call_function_with_irqs_off() and 1689call_function_with_irqs_and_preemption_off(). 1690 1691The preemptoff tracer will record the total length of 1692call_function_with_irqs_and_preemption_off() and 1693call_function_with_preemption_off(). 1694 1695But neither will trace the time that interrupts and/or 1696preemption is disabled. This total time is the time that we can 1697not schedule. To record this time, use the preemptirqsoff 1698tracer. 1699 1700Again, using this trace is much like the irqsoff and preemptoff 1701tracers. 1702:: 1703 1704 # echo 0 > options/function-trace 1705 # echo preemptirqsoff > current_tracer 1706 # echo 1 > tracing_on 1707 # echo 0 > tracing_max_latency 1708 # ls -ltr 1709 [...] 1710 # echo 0 > tracing_on 1711 # cat trace 1712 # tracer: preemptirqsoff 1713 # 1714 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1715 # -------------------------------------------------------------------- 1716 # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1717 # ----------------- 1718 # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) 1719 # ----------------- 1720 # => started at: ata_scsi_queuecmd 1721 # => ended at: ata_scsi_queuecmd 1722 # 1723 # 1724 # _------=> CPU# 1725 # / _-----=> irqs-off 1726 # | / _----=> need-resched 1727 # || / _---=> hardirq/softirq 1728 # ||| / _--=> preempt-depth 1729 # |||| / delay 1730 # cmd pid ||||| time | caller 1731 # \ / ||||| \ | / 1732 ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1733 ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1734 ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd 1735 ls-2230 3...1 111us : <stack trace> 1736 => sub_preempt_count 1737 => _raw_spin_unlock_irqrestore 1738 => ata_scsi_queuecmd 1739 => scsi_dispatch_cmd 1740 => scsi_request_fn 1741 => __blk_run_queue_uncond 1742 => __blk_run_queue 1743 => blk_queue_bio 1744 => submit_bio_noacct 1745 => submit_bio 1746 => submit_bh 1747 => ext3_bread 1748 => ext3_dir_bread 1749 => htree_dirblock_to_tree 1750 => ext3_htree_fill_tree 1751 => ext3_readdir 1752 => vfs_readdir 1753 => sys_getdents 1754 => system_call_fastpath 1755 1756 1757The trace_hardirqs_off_thunk is called from assembly on x86 when 1758interrupts are disabled in the assembly code. Without the 1759function tracing, we do not know if interrupts were enabled 1760within the preemption points. We do see that it started with 1761preemption enabled. 1762 1763Here is a trace with function-trace set:: 1764 1765 # tracer: preemptirqsoff 1766 # 1767 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1768 # -------------------------------------------------------------------- 1769 # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1770 # ----------------- 1771 # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) 1772 # ----------------- 1773 # => started at: schedule 1774 # => ended at: mutex_unlock 1775 # 1776 # 1777 # _------=> CPU# 1778 # / _-----=> irqs-off 1779 # | / _----=> need-resched 1780 # || / _---=> hardirq/softirq 1781 # ||| / _--=> preempt-depth 1782 # |||| / delay 1783 # cmd pid ||||| time | caller 1784 # \ / ||||| \ | / 1785 kworker/-59 3...1 0us : __schedule <-schedule 1786 kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch 1787 kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq 1788 kworker/-59 3d..2 1us : deactivate_task <-__schedule 1789 kworker/-59 3d..2 1us : dequeue_task <-deactivate_task 1790 kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task 1791 kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task 1792 kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair 1793 kworker/-59 3d..2 2us : update_min_vruntime <-update_curr 1794 kworker/-59 3d..2 3us : cpuacct_charge <-update_curr 1795 kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge 1796 kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge 1797 kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair 1798 kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair 1799 kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair 1800 kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair 1801 kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair 1802 kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair 1803 kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule 1804 kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping 1805 kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule 1806 kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task 1807 kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair 1808 kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair 1809 kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity 1810 ls-2269 3d..2 7us : finish_task_switch <-__schedule 1811 ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch 1812 ls-2269 3d..2 8us : do_IRQ <-ret_from_intr 1813 ls-2269 3d..2 8us : irq_enter <-do_IRQ 1814 ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter 1815 ls-2269 3d..2 9us : add_preempt_count <-irq_enter 1816 ls-2269 3d.h2 9us : exit_idle <-do_IRQ 1817 [...] 1818 ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock 1819 ls-2269 3d.h2 20us : irq_exit <-do_IRQ 1820 ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit 1821 ls-2269 3d..3 21us : do_softirq <-irq_exit 1822 ls-2269 3d..3 21us : __do_softirq <-call_softirq 1823 ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq 1824 ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip 1825 ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip 1826 ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr 1827 ls-2269 3d.s5 31us : irq_enter <-do_IRQ 1828 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1829 [...] 1830 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1831 ls-2269 3d.s5 32us : add_preempt_count <-irq_enter 1832 ls-2269 3d.H5 32us : exit_idle <-do_IRQ 1833 ls-2269 3d.H5 32us : handle_irq <-do_IRQ 1834 ls-2269 3d.H5 32us : irq_to_desc <-handle_irq 1835 ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq 1836 [...] 1837 ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll 1838 ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action 1839 ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq 1840 ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable 1841 ls-2269 3d..3 159us : idle_cpu <-irq_exit 1842 ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit 1843 ls-2269 3d..3 160us : sub_preempt_count <-irq_exit 1844 ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock 1845 ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock 1846 ls-2269 3d... 186us : <stack trace> 1847 => __mutex_unlock_slowpath 1848 => mutex_unlock 1849 => process_output 1850 => n_tty_write 1851 => tty_write 1852 => vfs_write 1853 => sys_write 1854 => system_call_fastpath 1855 1856This is an interesting trace. It started with kworker running and 1857scheduling out and ls taking over. But as soon as ls released the 1858rq lock and enabled interrupts (but not preemption) an interrupt 1859triggered. When the interrupt finished, it started running softirqs. 1860But while the softirq was running, another interrupt triggered. 1861When an interrupt is running inside a softirq, the annotation is 'H'. 1862 1863 1864wakeup 1865------ 1866 1867One common case that people are interested in tracing is the 1868time it takes for a task that is woken to actually wake up. 1869Now for non Real-Time tasks, this can be arbitrary. But tracing 1870it none the less can be interesting. 1871 1872Without function tracing:: 1873 1874 # echo 0 > options/function-trace 1875 # echo wakeup > current_tracer 1876 # echo 1 > tracing_on 1877 # echo 0 > tracing_max_latency 1878 # chrt -f 5 sleep 1 1879 # echo 0 > tracing_on 1880 # cat trace 1881 # tracer: wakeup 1882 # 1883 # wakeup latency trace v1.1.5 on 3.8.0-test+ 1884 # -------------------------------------------------------------------- 1885 # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1886 # ----------------- 1887 # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) 1888 # ----------------- 1889 # 1890 # _------=> CPU# 1891 # / _-----=> irqs-off 1892 # | / _----=> need-resched 1893 # || / _---=> hardirq/softirq 1894 # ||| / _--=> preempt-depth 1895 # |||| / delay 1896 # cmd pid ||||| time | caller 1897 # \ / ||||| \ | / 1898 <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H 1899 <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1900 <idle>-0 3d..3 15us : __schedule <-schedule 1901 <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H 1902 1903The tracer only traces the highest priority task in the system 1904to avoid tracing the normal circumstances. Here we see that 1905the kworker with a nice priority of -20 (not very nice), took 1906just 15 microseconds from the time it woke up, to the time it 1907ran. 1908 1909Non Real-Time tasks are not that interesting. A more interesting 1910trace is to concentrate only on Real-Time tasks. 1911 1912wakeup_rt 1913--------- 1914 1915In a Real-Time environment it is very important to know the 1916wakeup time it takes for the highest priority task that is woken 1917up to the time that it executes. This is also known as "schedule 1918latency". I stress the point that this is about RT tasks. It is 1919also important to know the scheduling latency of non-RT tasks, 1920but the average schedule latency is better for non-RT tasks. 1921Tools like LatencyTop are more appropriate for such 1922measurements. 1923 1924Real-Time environments are interested in the worst case latency. 1925That is the longest latency it takes for something to happen, 1926and not the average. We can have a very fast scheduler that may 1927only have a large latency once in a while, but that would not 1928work well with Real-Time tasks. The wakeup_rt tracer was designed 1929to record the worst case wakeups of RT tasks. Non-RT tasks are 1930not recorded because the tracer only records one worst case and 1931tracing non-RT tasks that are unpredictable will overwrite the 1932worst case latency of RT tasks (just run the normal wakeup 1933tracer for a while to see that effect). 1934 1935Since this tracer only deals with RT tasks, we will run this 1936slightly differently than we did with the previous tracers. 1937Instead of performing an 'ls', we will run 'sleep 1' under 1938'chrt' which changes the priority of the task. 1939:: 1940 1941 # echo 0 > options/function-trace 1942 # echo wakeup_rt > current_tracer 1943 # echo 1 > tracing_on 1944 # echo 0 > tracing_max_latency 1945 # chrt -f 5 sleep 1 1946 # echo 0 > tracing_on 1947 # cat trace 1948 # tracer: wakeup 1949 # 1950 # tracer: wakeup_rt 1951 # 1952 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 1953 # -------------------------------------------------------------------- 1954 # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1955 # ----------------- 1956 # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) 1957 # ----------------- 1958 # 1959 # _------=> CPU# 1960 # / _-----=> irqs-off 1961 # | / _----=> need-resched 1962 # || / _---=> hardirq/softirq 1963 # ||| / _--=> preempt-depth 1964 # |||| / delay 1965 # cmd pid ||||| time | caller 1966 # \ / ||||| \ | / 1967 <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep 1968 <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1969 <idle>-0 3d..3 5us : __schedule <-schedule 1970 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1971 1972 1973Running this on an idle system, we see that it only took 5 microseconds 1974to perform the task switch. Note, since the trace point in the schedule 1975is before the actual "switch", we stop the tracing when the recorded task 1976is about to schedule in. This may change if we add a new marker at the 1977end of the scheduler. 1978 1979Notice that the recorded task is 'sleep' with the PID of 2389 1980and it has an rt_prio of 5. This priority is user-space priority 1981and not the internal kernel priority. The policy is 1 for 1982SCHED_FIFO and 2 for SCHED_RR. 1983 1984Note, that the trace data shows the internal priority (99 - rtprio). 1985:: 1986 1987 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1988 1989The 0:120:R means idle was running with a nice priority of 0 (120 - 120) 1990and in the running state 'R'. The sleep task was scheduled in with 19912389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) 1992and it too is in the running state. 1993 1994Doing the same with chrt -r 5 and function-trace set. 1995:: 1996 1997 echo 1 > options/function-trace 1998 1999 # tracer: wakeup_rt 2000 # 2001 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 2002 # -------------------------------------------------------------------- 2003 # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 2004 # ----------------- 2005 # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) 2006 # ----------------- 2007 # 2008 # _------=> CPU# 2009 # / _-----=> irqs-off 2010 # | / _----=> need-resched 2011 # || / _---=> hardirq/softirq 2012 # ||| / _--=> preempt-depth 2013 # |||| / delay 2014 # cmd pid ||||| time | caller 2015 # \ / ||||| \ | / 2016 <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep 2017 <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up 2018 <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup 2019 <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr 2020 <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup 2021 <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up 2022 <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock 2023 <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up 2024 <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up 2025 <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2026 <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer 2027 <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock 2028 <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt 2029 <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock 2030 <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt 2031 <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event 2032 <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event 2033 <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event 2034 <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt 2035 <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit 2036 <idle>-0 3dN.2 9us : idle_cpu <-irq_exit 2037 <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit 2038 <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit 2039 <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit 2040 <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle 2041 <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit 2042 <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle 2043 <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit 2044 <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit 2045 <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit 2046 <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit 2047 <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz 2048 <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock 2049 <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz 2050 <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update 2051 <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz 2052 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock 2053 <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit 2054 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit 2055 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit 2056 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel 2057 <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel 2058 <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 2059 <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave 2060 <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 2061 <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer 2062 <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram 2063 <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event 2064 <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event 2065 <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event 2066 <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel 2067 <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2068 <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit 2069 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 2070 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 2071 <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 2072 <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns 2073 <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns 2074 <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 2075 <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave 2076 <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns 2077 <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns 2078 <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns 2079 <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event 2080 <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event 2081 <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event 2082 <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns 2083 <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2084 <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit 2085 <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks 2086 <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle 2087 <idle>-0 3.N.. 25us : schedule <-cpu_idle 2088 <idle>-0 3.N.. 25us : __schedule <-preempt_schedule 2089 <idle>-0 3.N.. 26us : add_preempt_count <-__schedule 2090 <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule 2091 <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch 2092 <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch 2093 <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule 2094 <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq 2095 <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule 2096 <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task 2097 <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task 2098 <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt 2099 <idle>-0 3d..3 29us : __schedule <-preempt_schedule 2100 <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep 2101 2102This isn't that big of a trace, even with function tracing enabled, 2103so I included the entire trace. 2104 2105The interrupt went off while when the system was idle. Somewhere 2106before task_woken_rt() was called, the NEED_RESCHED flag was set, 2107this is indicated by the first occurrence of the 'N' flag. 2108 2109Latency tracing and events 2110-------------------------- 2111As function tracing can induce a much larger latency, but without 2112seeing what happens within the latency it is hard to know what 2113caused it. There is a middle ground, and that is with enabling 2114events. 2115:: 2116 2117 # echo 0 > options/function-trace 2118 # echo wakeup_rt > current_tracer 2119 # echo 1 > events/enable 2120 # echo 1 > tracing_on 2121 # echo 0 > tracing_max_latency 2122 # chrt -f 5 sleep 1 2123 # echo 0 > tracing_on 2124 # cat trace 2125 # tracer: wakeup_rt 2126 # 2127 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 2128 # -------------------------------------------------------------------- 2129 # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 2130 # ----------------- 2131 # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) 2132 # ----------------- 2133 # 2134 # _------=> CPU# 2135 # / _-----=> irqs-off 2136 # | / _----=> need-resched 2137 # || / _---=> hardirq/softirq 2138 # ||| / _--=> preempt-depth 2139 # |||| / delay 2140 # cmd pid ||||| time | caller 2141 # \ / ||||| \ | / 2142 <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep 2143 <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up 2144 <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 2145 <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 2146 <idle>-0 2.N.2 2us : power_end: cpu_id=2 2147 <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 2148 <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 2149 <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 2150 <idle>-0 2.N.2 5us : rcu_utilization: Start context switch 2151 <idle>-0 2.N.2 5us : rcu_utilization: End context switch 2152 <idle>-0 2d..3 6us : __schedule <-schedule 2153 <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep 2154 2155 2156Hardware Latency Detector 2157------------------------- 2158 2159The hardware latency detector is executed by enabling the "hwlat" tracer. 2160 2161NOTE, this tracer will affect the performance of the system as it will 2162periodically make a CPU constantly busy with interrupts disabled. 2163:: 2164 2165 # echo hwlat > current_tracer 2166 # sleep 100 2167 # cat trace 2168 # tracer: hwlat 2169 # 2170 # entries-in-buffer/entries-written: 13/13 #P:8 2171 # 2172 # _-----=> irqs-off 2173 # / _----=> need-resched 2174 # | / _---=> hardirq/softirq 2175 # || / _--=> preempt-depth 2176 # ||| / delay 2177 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2178 # | | | |||| | | 2179 <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6 2180 <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1 2181 <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5 2182 <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1 2183 <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1 2184 <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2 2185 <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1 2186 <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1 2187 <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1 2188 <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1 2189 <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1 2190 <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1 2191 2192 2193The above output is somewhat the same in the header. All events will have 2194interrupts disabled 'd'. Under the FUNCTION title there is: 2195 2196 #1 2197 This is the count of events recorded that were greater than the 2198 tracing_threshold (See below). 2199 2200 inner/outer(us): 11/11 2201 2202 This shows two numbers as "inner latency" and "outer latency". The test 2203 runs in a loop checking a timestamp twice. The latency detected within 2204 the two timestamps is the "inner latency" and the latency detected 2205 after the previous timestamp and the next timestamp in the loop is 2206 the "outer latency". 2207 2208 ts:1581527483.343962693 2209 2210 The absolute timestamp that the first latency was recorded in the window. 2211 2212 count:6 2213 2214 The number of times a latency was detected during the window. 2215 2216 nmi-total:7 nmi-count:1 2217 2218 On architectures that support it, if an NMI comes in during the 2219 test, the time spent in NMI is reported in "nmi-total" (in 2220 microseconds). 2221 2222 All architectures that have NMIs will show the "nmi-count" if an 2223 NMI comes in during the test. 2224 2225hwlat files: 2226 2227 tracing_threshold 2228 This gets automatically set to "10" to represent 10 2229 microseconds. This is the threshold of latency that 2230 needs to be detected before the trace will be recorded. 2231 2232 Note, when hwlat tracer is finished (another tracer is 2233 written into "current_tracer"), the original value for 2234 tracing_threshold is placed back into this file. 2235 2236 hwlat_detector/width 2237 The length of time the test runs with interrupts disabled. 2238 2239 hwlat_detector/window 2240 The length of time of the window which the test 2241 runs. That is, the test will run for "width" 2242 microseconds per "window" microseconds 2243 2244 tracing_cpumask 2245 When the test is started. A kernel thread is created that 2246 runs the test. This thread will alternate between CPUs 2247 listed in the tracing_cpumask between each period 2248 (one "window"). To limit the test to specific CPUs 2249 set the mask in this file to only the CPUs that the test 2250 should run on. 2251 2252function 2253-------- 2254 2255This tracer is the function tracer. Enabling the function tracer 2256can be done from the debug file system. Make sure the 2257ftrace_enabled is set; otherwise this tracer is a nop. 2258See the "ftrace_enabled" section below. 2259:: 2260 2261 # sysctl kernel.ftrace_enabled=1 2262 # echo function > current_tracer 2263 # echo 1 > tracing_on 2264 # usleep 1 2265 # echo 0 > tracing_on 2266 # cat trace 2267 # tracer: function 2268 # 2269 # entries-in-buffer/entries-written: 24799/24799 #P:4 2270 # 2271 # _-----=> irqs-off 2272 # / _----=> need-resched 2273 # | / _---=> hardirq/softirq 2274 # || / _--=> preempt-depth 2275 # ||| / delay 2276 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2277 # | | | |||| | | 2278 bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write 2279 bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock 2280 bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify 2281 bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify 2282 bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify 2283 bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock 2284 bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock 2285 bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify 2286 [...] 2287 2288 2289Note: function tracer uses ring buffers to store the above 2290entries. The newest data may overwrite the oldest data. 2291Sometimes using echo to stop the trace is not sufficient because 2292the tracing could have overwritten the data that you wanted to 2293record. For this reason, it is sometimes better to disable 2294tracing directly from a program. This allows you to stop the 2295tracing at the point that you hit the part that you are 2296interested in. To disable the tracing directly from a C program, 2297something like following code snippet can be used:: 2298 2299 int trace_fd; 2300 [...] 2301 int main(int argc, char *argv[]) { 2302 [...] 2303 trace_fd = open(tracing_file("tracing_on"), O_WRONLY); 2304 [...] 2305 if (condition_hit()) { 2306 write(trace_fd, "0", 1); 2307 } 2308 [...] 2309 } 2310 2311 2312Single thread tracing 2313--------------------- 2314 2315By writing into set_ftrace_pid you can trace a 2316single thread. For example:: 2317 2318 # cat set_ftrace_pid 2319 no pid 2320 # echo 3111 > set_ftrace_pid 2321 # cat set_ftrace_pid 2322 3111 2323 # echo function > current_tracer 2324 # cat trace | head 2325 # tracer: function 2326 # 2327 # TASK-PID CPU# TIMESTAMP FUNCTION 2328 # | | | | | 2329 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return 2330 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range 2331 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel 2332 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel 2333 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll 2334 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll 2335 # echo > set_ftrace_pid 2336 # cat trace |head 2337 # tracer: function 2338 # 2339 # TASK-PID CPU# TIMESTAMP FUNCTION 2340 # | | | | | 2341 ##### CPU 3 buffer started #### 2342 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait 2343 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry 2344 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry 2345 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit 2346 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit 2347 2348If you want to trace a function when executing, you could use 2349something like this simple program. 2350:: 2351 2352 #include <stdio.h> 2353 #include <stdlib.h> 2354 #include <sys/types.h> 2355 #include <sys/stat.h> 2356 #include <fcntl.h> 2357 #include <unistd.h> 2358 #include <string.h> 2359 2360 #define _STR(x) #x 2361 #define STR(x) _STR(x) 2362 #define MAX_PATH 256 2363 2364 const char *find_tracefs(void) 2365 { 2366 static char tracefs[MAX_PATH+1]; 2367 static int tracefs_found; 2368 char type[100]; 2369 FILE *fp; 2370 2371 if (tracefs_found) 2372 return tracefs; 2373 2374 if ((fp = fopen("/proc/mounts","r")) == NULL) { 2375 perror("/proc/mounts"); 2376 return NULL; 2377 } 2378 2379 while (fscanf(fp, "%*s %" 2380 STR(MAX_PATH) 2381 "s %99s %*s %*d %*d\n", 2382 tracefs, type) == 2) { 2383 if (strcmp(type, "tracefs") == 0) 2384 break; 2385 } 2386 fclose(fp); 2387 2388 if (strcmp(type, "tracefs") != 0) { 2389 fprintf(stderr, "tracefs not mounted"); 2390 return NULL; 2391 } 2392 2393 strcat(tracefs, "/tracing/"); 2394 tracefs_found = 1; 2395 2396 return tracefs; 2397 } 2398 2399 const char *tracing_file(const char *file_name) 2400 { 2401 static char trace_file[MAX_PATH+1]; 2402 snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name); 2403 return trace_file; 2404 } 2405 2406 int main (int argc, char **argv) 2407 { 2408 if (argc < 1) 2409 exit(-1); 2410 2411 if (fork() > 0) { 2412 int fd, ffd; 2413 char line[64]; 2414 int s; 2415 2416 ffd = open(tracing_file("current_tracer"), O_WRONLY); 2417 if (ffd < 0) 2418 exit(-1); 2419 write(ffd, "nop", 3); 2420 2421 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); 2422 s = sprintf(line, "%d\n", getpid()); 2423 write(fd, line, s); 2424 2425 write(ffd, "function", 8); 2426 2427 close(fd); 2428 close(ffd); 2429 2430 execvp(argv[1], argv+1); 2431 } 2432 2433 return 0; 2434 } 2435 2436Or this simple script! 2437:: 2438 2439 #!/bin/bash 2440 2441 tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts` 2442 echo nop > $tracefs/tracing/current_tracer 2443 echo 0 > $tracefs/tracing/tracing_on 2444 echo $$ > $tracefs/tracing/set_ftrace_pid 2445 echo function > $tracefs/tracing/current_tracer 2446 echo 1 > $tracefs/tracing/tracing_on 2447 exec "$@" 2448 2449 2450function graph tracer 2451--------------------------- 2452 2453This tracer is similar to the function tracer except that it 2454probes a function on its entry and its exit. This is done by 2455using a dynamically allocated stack of return addresses in each 2456task_struct. On function entry the tracer overwrites the return 2457address of each function traced to set a custom probe. Thus the 2458original return address is stored on the stack of return address 2459in the task_struct. 2460 2461Probing on both ends of a function leads to special features 2462such as: 2463 2464- measure of a function's time execution 2465- having a reliable call stack to draw function calls graph 2466 2467This tracer is useful in several situations: 2468 2469- you want to find the reason of a strange kernel behavior and 2470 need to see what happens in detail on any areas (or specific 2471 ones). 2472 2473- you are experiencing weird latencies but it's difficult to 2474 find its origin. 2475 2476- you want to find quickly which path is taken by a specific 2477 function 2478 2479- you just want to peek inside a working kernel and want to see 2480 what happens there. 2481 2482:: 2483 2484 # tracer: function_graph 2485 # 2486 # CPU DURATION FUNCTION CALLS 2487 # | | | | | | | 2488 2489 0) | sys_open() { 2490 0) | do_sys_open() { 2491 0) | getname() { 2492 0) | kmem_cache_alloc() { 2493 0) 1.382 us | __might_sleep(); 2494 0) 2.478 us | } 2495 0) | strncpy_from_user() { 2496 0) | might_fault() { 2497 0) 1.389 us | __might_sleep(); 2498 0) 2.553 us | } 2499 0) 3.807 us | } 2500 0) 7.876 us | } 2501 0) | alloc_fd() { 2502 0) 0.668 us | _spin_lock(); 2503 0) 0.570 us | expand_files(); 2504 0) 0.586 us | _spin_unlock(); 2505 2506 2507There are several columns that can be dynamically 2508enabled/disabled. You can use every combination of options you 2509want, depending on your needs. 2510 2511- The cpu number on which the function executed is default 2512 enabled. It is sometimes better to only trace one cpu (see 2513 tracing_cpu_mask file) or you might sometimes see unordered 2514 function calls while cpu tracing switch. 2515 2516 - hide: echo nofuncgraph-cpu > trace_options 2517 - show: echo funcgraph-cpu > trace_options 2518 2519- The duration (function's time of execution) is displayed on 2520 the closing bracket line of a function or on the same line 2521 than the current function in case of a leaf one. It is default 2522 enabled. 2523 2524 - hide: echo nofuncgraph-duration > trace_options 2525 - show: echo funcgraph-duration > trace_options 2526 2527- The overhead field precedes the duration field in case of 2528 reached duration thresholds. 2529 2530 - hide: echo nofuncgraph-overhead > trace_options 2531 - show: echo funcgraph-overhead > trace_options 2532 - depends on: funcgraph-duration 2533 2534 ie:: 2535 2536 3) # 1837.709 us | } /* __switch_to */ 2537 3) | finish_task_switch() { 2538 3) 0.313 us | _raw_spin_unlock_irq(); 2539 3) 3.177 us | } 2540 3) # 1889.063 us | } /* __schedule */ 2541 3) ! 140.417 us | } /* __schedule */ 2542 3) # 2034.948 us | } /* schedule */ 2543 3) * 33998.59 us | } /* schedule_preempt_disabled */ 2544 2545 [...] 2546 2547 1) 0.260 us | msecs_to_jiffies(); 2548 1) 0.313 us | __rcu_read_unlock(); 2549 1) + 61.770 us | } 2550 1) + 64.479 us | } 2551 1) 0.313 us | rcu_bh_qs(); 2552 1) 0.313 us | __local_bh_enable(); 2553 1) ! 217.240 us | } 2554 1) 0.365 us | idle_cpu(); 2555 1) | rcu_irq_exit() { 2556 1) 0.417 us | rcu_eqs_enter_common.isra.47(); 2557 1) 3.125 us | } 2558 1) ! 227.812 us | } 2559 1) ! 457.395 us | } 2560 1) @ 119760.2 us | } 2561 2562 [...] 2563 2564 2) | handle_IPI() { 2565 1) 6.979 us | } 2566 2) 0.417 us | scheduler_ipi(); 2567 1) 9.791 us | } 2568 1) + 12.917 us | } 2569 2) 3.490 us | } 2570 1) + 15.729 us | } 2571 1) + 18.542 us | } 2572 2) $ 3594274 us | } 2573 2574Flags:: 2575 2576 + means that the function exceeded 10 usecs. 2577 ! means that the function exceeded 100 usecs. 2578 # means that the function exceeded 1000 usecs. 2579 * means that the function exceeded 10 msecs. 2580 @ means that the function exceeded 100 msecs. 2581 $ means that the function exceeded 1 sec. 2582 2583 2584- The task/pid field displays the thread cmdline and pid which 2585 executed the function. It is default disabled. 2586 2587 - hide: echo nofuncgraph-proc > trace_options 2588 - show: echo funcgraph-proc > trace_options 2589 2590 ie:: 2591 2592 # tracer: function_graph 2593 # 2594 # CPU TASK/PID DURATION FUNCTION CALLS 2595 # | | | | | | | | | 2596 0) sh-4802 | | d_free() { 2597 0) sh-4802 | | call_rcu() { 2598 0) sh-4802 | | __call_rcu() { 2599 0) sh-4802 | 0.616 us | rcu_process_gp_end(); 2600 0) sh-4802 | 0.586 us | check_for_new_grace_period(); 2601 0) sh-4802 | 2.899 us | } 2602 0) sh-4802 | 4.040 us | } 2603 0) sh-4802 | 5.151 us | } 2604 0) sh-4802 | + 49.370 us | } 2605 2606 2607- The absolute time field is an absolute timestamp given by the 2608 system clock since it started. A snapshot of this time is 2609 given on each entry/exit of functions 2610 2611 - hide: echo nofuncgraph-abstime > trace_options 2612 - show: echo funcgraph-abstime > trace_options 2613 2614 ie:: 2615 2616 # 2617 # TIME CPU DURATION FUNCTION CALLS 2618 # | | | | | | | | 2619 360.774522 | 1) 0.541 us | } 2620 360.774522 | 1) 4.663 us | } 2621 360.774523 | 1) 0.541 us | __wake_up_bit(); 2622 360.774524 | 1) 6.796 us | } 2623 360.774524 | 1) 7.952 us | } 2624 360.774525 | 1) 9.063 us | } 2625 360.774525 | 1) 0.615 us | journal_mark_dirty(); 2626 360.774527 | 1) 0.578 us | __brelse(); 2627 360.774528 | 1) | reiserfs_prepare_for_journal() { 2628 360.774528 | 1) | unlock_buffer() { 2629 360.774529 | 1) | wake_up_bit() { 2630 360.774529 | 1) | bit_waitqueue() { 2631 360.774530 | 1) 0.594 us | __phys_addr(); 2632 2633 2634The function name is always displayed after the closing bracket 2635for a function if the start of that function is not in the 2636trace buffer. 2637 2638Display of the function name after the closing bracket may be 2639enabled for functions whose start is in the trace buffer, 2640allowing easier searching with grep for function durations. 2641It is default disabled. 2642 2643 - hide: echo nofuncgraph-tail > trace_options 2644 - show: echo funcgraph-tail > trace_options 2645 2646 Example with nofuncgraph-tail (default):: 2647 2648 0) | putname() { 2649 0) | kmem_cache_free() { 2650 0) 0.518 us | __phys_addr(); 2651 0) 1.757 us | } 2652 0) 2.861 us | } 2653 2654 Example with funcgraph-tail:: 2655 2656 0) | putname() { 2657 0) | kmem_cache_free() { 2658 0) 0.518 us | __phys_addr(); 2659 0) 1.757 us | } /* kmem_cache_free() */ 2660 0) 2.861 us | } /* putname() */ 2661 2662You can put some comments on specific functions by using 2663trace_printk() For example, if you want to put a comment inside 2664the __might_sleep() function, you just have to include 2665<linux/ftrace.h> and call trace_printk() inside __might_sleep():: 2666 2667 trace_printk("I'm a comment!\n") 2668 2669will produce:: 2670 2671 1) | __might_sleep() { 2672 1) | /* I'm a comment! */ 2673 1) 1.449 us | } 2674 2675 2676You might find other useful features for this tracer in the 2677following "dynamic ftrace" section such as tracing only specific 2678functions or tasks. 2679 2680dynamic ftrace 2681-------------- 2682 2683If CONFIG_DYNAMIC_FTRACE is set, the system will run with 2684virtually no overhead when function tracing is disabled. The way 2685this works is the mcount function call (placed at the start of 2686every kernel function, produced by the -pg switch in gcc), 2687starts of pointing to a simple return. (Enabling FTRACE will 2688include the -pg switch in the compiling of the kernel.) 2689 2690At compile time every C file object is run through the 2691recordmcount program (located in the scripts directory). This 2692program will parse the ELF headers in the C object to find all 2693the locations in the .text section that call mcount. Starting 2694with gcc version 4.6, the -mfentry has been added for x86, which 2695calls "__fentry__" instead of "mcount". Which is called before 2696the creation of the stack frame. 2697 2698Note, not all sections are traced. They may be prevented by either 2699a notrace, or blocked another way and all inline functions are not 2700traced. Check the "available_filter_functions" file to see what functions 2701can be traced. 2702 2703A section called "__mcount_loc" is created that holds 2704references to all the mcount/fentry call sites in the .text section. 2705The recordmcount program re-links this section back into the 2706original object. The final linking stage of the kernel will add all these 2707references into a single table. 2708 2709On boot up, before SMP is initialized, the dynamic ftrace code 2710scans this table and updates all the locations into nops. It 2711also records the locations, which are added to the 2712available_filter_functions list. Modules are processed as they 2713are loaded and before they are executed. When a module is 2714unloaded, it also removes its functions from the ftrace function 2715list. This is automatic in the module unload code, and the 2716module author does not need to worry about it. 2717 2718When tracing is enabled, the process of modifying the function 2719tracepoints is dependent on architecture. The old method is to use 2720kstop_machine to prevent races with the CPUs executing code being 2721modified (which can cause the CPU to do undesirable things, especially 2722if the modified code crosses cache (or page) boundaries), and the nops are 2723patched back to calls. But this time, they do not call mcount 2724(which is just a function stub). They now call into the ftrace 2725infrastructure. 2726 2727The new method of modifying the function tracepoints is to place 2728a breakpoint at the location to be modified, sync all CPUs, modify 2729the rest of the instruction not covered by the breakpoint. Sync 2730all CPUs again, and then remove the breakpoint with the finished 2731version to the ftrace call site. 2732 2733Some archs do not even need to monkey around with the synchronization, 2734and can just slap the new code on top of the old without any 2735problems with other CPUs executing it at the same time. 2736 2737One special side-effect to the recording of the functions being 2738traced is that we can now selectively choose which functions we 2739wish to trace and which ones we want the mcount calls to remain 2740as nops. 2741 2742Two files are used, one for enabling and one for disabling the 2743tracing of specified functions. They are: 2744 2745 set_ftrace_filter 2746 2747and 2748 2749 set_ftrace_notrace 2750 2751A list of available functions that you can add to these files is 2752listed in: 2753 2754 available_filter_functions 2755 2756:: 2757 2758 # cat available_filter_functions 2759 put_prev_task_idle 2760 kmem_cache_create 2761 pick_next_task_rt 2762 get_online_cpus 2763 pick_next_task_fair 2764 mutex_lock 2765 [...] 2766 2767If I am only interested in sys_nanosleep and hrtimer_interrupt:: 2768 2769 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter 2770 # echo function > current_tracer 2771 # echo 1 > tracing_on 2772 # usleep 1 2773 # echo 0 > tracing_on 2774 # cat trace 2775 # tracer: function 2776 # 2777 # entries-in-buffer/entries-written: 5/5 #P:4 2778 # 2779 # _-----=> irqs-off 2780 # / _----=> need-resched 2781 # | / _---=> hardirq/softirq 2782 # || / _--=> preempt-depth 2783 # ||| / delay 2784 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2785 # | | | |||| | | 2786 usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath 2787 <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt 2788 usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2789 <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2790 <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt 2791 2792To see which functions are being traced, you can cat the file: 2793:: 2794 2795 # cat set_ftrace_filter 2796 hrtimer_interrupt 2797 sys_nanosleep 2798 2799 2800Perhaps this is not enough. The filters also allow glob(7) matching. 2801 2802 ``<match>*`` 2803 will match functions that begin with <match> 2804 ``*<match>`` 2805 will match functions that end with <match> 2806 ``*<match>*`` 2807 will match functions that have <match> in it 2808 ``<match1>*<match2>`` 2809 will match functions that begin with <match1> and end with <match2> 2810 2811.. note:: 2812 It is better to use quotes to enclose the wild cards, 2813 otherwise the shell may expand the parameters into names 2814 of files in the local directory. 2815 2816:: 2817 2818 # echo 'hrtimer_*' > set_ftrace_filter 2819 2820Produces:: 2821 2822 # tracer: function 2823 # 2824 # entries-in-buffer/entries-written: 897/897 #P:4 2825 # 2826 # _-----=> irqs-off 2827 # / _----=> need-resched 2828 # | / _---=> hardirq/softirq 2829 # || / _--=> preempt-depth 2830 # ||| / delay 2831 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2832 # | | | |||| | | 2833 <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit 2834 <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel 2835 <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer 2836 <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit 2837 <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 2838 <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt 2839 <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter 2840 <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem 2841 2842Notice that we lost the sys_nanosleep. 2843:: 2844 2845 # cat set_ftrace_filter 2846 hrtimer_run_queues 2847 hrtimer_run_pending 2848 hrtimer_init 2849 hrtimer_cancel 2850 hrtimer_try_to_cancel 2851 hrtimer_forward 2852 hrtimer_start 2853 hrtimer_reprogram 2854 hrtimer_force_reprogram 2855 hrtimer_get_next_event 2856 hrtimer_interrupt 2857 hrtimer_nanosleep 2858 hrtimer_wakeup 2859 hrtimer_get_remaining 2860 hrtimer_get_res 2861 hrtimer_init_sleeper 2862 2863 2864This is because the '>' and '>>' act just like they do in bash. 2865To rewrite the filters, use '>' 2866To append to the filters, use '>>' 2867 2868To clear out a filter so that all functions will be recorded 2869again:: 2870 2871 # echo > set_ftrace_filter 2872 # cat set_ftrace_filter 2873 # 2874 2875Again, now we want to append. 2876 2877:: 2878 2879 # echo sys_nanosleep > set_ftrace_filter 2880 # cat set_ftrace_filter 2881 sys_nanosleep 2882 # echo 'hrtimer_*' >> set_ftrace_filter 2883 # cat set_ftrace_filter 2884 hrtimer_run_queues 2885 hrtimer_run_pending 2886 hrtimer_init 2887 hrtimer_cancel 2888 hrtimer_try_to_cancel 2889 hrtimer_forward 2890 hrtimer_start 2891 hrtimer_reprogram 2892 hrtimer_force_reprogram 2893 hrtimer_get_next_event 2894 hrtimer_interrupt 2895 sys_nanosleep 2896 hrtimer_nanosleep 2897 hrtimer_wakeup 2898 hrtimer_get_remaining 2899 hrtimer_get_res 2900 hrtimer_init_sleeper 2901 2902 2903The set_ftrace_notrace prevents those functions from being 2904traced. 2905:: 2906 2907 # echo '*preempt*' '*lock*' > set_ftrace_notrace 2908 2909Produces:: 2910 2911 # tracer: function 2912 # 2913 # entries-in-buffer/entries-written: 39608/39608 #P:4 2914 # 2915 # _-----=> irqs-off 2916 # / _----=> need-resched 2917 # | / _---=> hardirq/softirq 2918 # || / _--=> preempt-depth 2919 # ||| / delay 2920 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2921 # | | | |||| | | 2922 bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open 2923 bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last 2924 bash-1994 [000] .... 4342.324897: ima_file_check <-do_last 2925 bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check 2926 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement 2927 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action 2928 bash-1994 [000] .... 4342.324899: do_truncate <-do_last 2929 bash-1994 [000] .... 4342.324899: setattr_should_drop_suidgid <-do_truncate 2930 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate 2931 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change 2932 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time 2933 bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time 2934 2935We can see that there's no more lock or preempt tracing. 2936 2937Selecting function filters via index 2938------------------------------------ 2939 2940Because processing of strings is expensive (the address of the function 2941needs to be looked up before comparing to the string being passed in), 2942an index can be used as well to enable functions. This is useful in the 2943case of setting thousands of specific functions at a time. By passing 2944in a list of numbers, no string processing will occur. Instead, the function 2945at the specific location in the internal array (which corresponds to the 2946functions in the "available_filter_functions" file), is selected. 2947 2948:: 2949 2950 # echo 1 > set_ftrace_filter 2951 2952Will select the first function listed in "available_filter_functions" 2953 2954:: 2955 2956 # head -1 available_filter_functions 2957 trace_initcall_finish_cb 2958 2959 # cat set_ftrace_filter 2960 trace_initcall_finish_cb 2961 2962 # head -50 available_filter_functions | tail -1 2963 x86_pmu_commit_txn 2964 2965 # echo 1 50 > set_ftrace_filter 2966 # cat set_ftrace_filter 2967 trace_initcall_finish_cb 2968 x86_pmu_commit_txn 2969 2970Dynamic ftrace with the function graph tracer 2971--------------------------------------------- 2972 2973Although what has been explained above concerns both the 2974function tracer and the function-graph-tracer, there are some 2975special features only available in the function-graph tracer. 2976 2977If you want to trace only one function and all of its children, 2978you just have to echo its name into set_graph_function:: 2979 2980 echo __do_fault > set_graph_function 2981 2982will produce the following "expanded" trace of the __do_fault() 2983function:: 2984 2985 0) | __do_fault() { 2986 0) | filemap_fault() { 2987 0) | find_lock_page() { 2988 0) 0.804 us | find_get_page(); 2989 0) | __might_sleep() { 2990 0) 1.329 us | } 2991 0) 3.904 us | } 2992 0) 4.979 us | } 2993 0) 0.653 us | _spin_lock(); 2994 0) 0.578 us | page_add_file_rmap(); 2995 0) 0.525 us | native_set_pte_at(); 2996 0) 0.585 us | _spin_unlock(); 2997 0) | unlock_page() { 2998 0) 0.541 us | page_waitqueue(); 2999 0) 0.639 us | __wake_up_bit(); 3000 0) 2.786 us | } 3001 0) + 14.237 us | } 3002 0) | __do_fault() { 3003 0) | filemap_fault() { 3004 0) | find_lock_page() { 3005 0) 0.698 us | find_get_page(); 3006 0) | __might_sleep() { 3007 0) 1.412 us | } 3008 0) 3.950 us | } 3009 0) 5.098 us | } 3010 0) 0.631 us | _spin_lock(); 3011 0) 0.571 us | page_add_file_rmap(); 3012 0) 0.526 us | native_set_pte_at(); 3013 0) 0.586 us | _spin_unlock(); 3014 0) | unlock_page() { 3015 0) 0.533 us | page_waitqueue(); 3016 0) 0.638 us | __wake_up_bit(); 3017 0) 2.793 us | } 3018 0) + 14.012 us | } 3019 3020You can also expand several functions at once:: 3021 3022 echo sys_open > set_graph_function 3023 echo sys_close >> set_graph_function 3024 3025Now if you want to go back to trace all functions you can clear 3026this special filter via:: 3027 3028 echo > set_graph_function 3029 3030 3031ftrace_enabled 3032-------------- 3033 3034Note, the proc sysctl ftrace_enable is a big on/off switch for the 3035function tracer. By default it is enabled (when function tracing is 3036enabled in the kernel). If it is disabled, all function tracing is 3037disabled. This includes not only the function tracers for ftrace, but 3038also for any other uses (perf, kprobes, stack tracing, profiling, etc). It 3039cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set 3040registered. 3041 3042Please disable this with care. 3043 3044This can be disable (and enabled) with:: 3045 3046 sysctl kernel.ftrace_enabled=0 3047 sysctl kernel.ftrace_enabled=1 3048 3049 or 3050 3051 echo 0 > /proc/sys/kernel/ftrace_enabled 3052 echo 1 > /proc/sys/kernel/ftrace_enabled 3053 3054 3055Filter commands 3056--------------- 3057 3058A few commands are supported by the set_ftrace_filter interface. 3059Trace commands have the following format:: 3060 3061 <function>:<command>:<parameter> 3062 3063The following commands are supported: 3064 3065- mod: 3066 This command enables function filtering per module. The 3067 parameter defines the module. For example, if only the write* 3068 functions in the ext3 module are desired, run: 3069 3070 echo 'write*:mod:ext3' > set_ftrace_filter 3071 3072 This command interacts with the filter in the same way as 3073 filtering based on function names. Thus, adding more functions 3074 in a different module is accomplished by appending (>>) to the 3075 filter file. Remove specific module functions by prepending 3076 '!':: 3077 3078 echo '!writeback*:mod:ext3' >> set_ftrace_filter 3079 3080 Mod command supports module globbing. Disable tracing for all 3081 functions except a specific module:: 3082 3083 echo '!*:mod:!ext3' >> set_ftrace_filter 3084 3085 Disable tracing for all modules, but still trace kernel:: 3086 3087 echo '!*:mod:*' >> set_ftrace_filter 3088 3089 Enable filter only for kernel:: 3090 3091 echo '*write*:mod:!*' >> set_ftrace_filter 3092 3093 Enable filter for module globbing:: 3094 3095 echo '*write*:mod:*snd*' >> set_ftrace_filter 3096 3097- traceon/traceoff: 3098 These commands turn tracing on and off when the specified 3099 functions are hit. The parameter determines how many times the 3100 tracing system is turned on and off. If unspecified, there is 3101 no limit. For example, to disable tracing when a schedule bug 3102 is hit the first 5 times, run:: 3103 3104 echo '__schedule_bug:traceoff:5' > set_ftrace_filter 3105 3106 To always disable tracing when __schedule_bug is hit:: 3107 3108 echo '__schedule_bug:traceoff' > set_ftrace_filter 3109 3110 These commands are cumulative whether or not they are appended 3111 to set_ftrace_filter. To remove a command, prepend it by '!' 3112 and drop the parameter:: 3113 3114 echo '!__schedule_bug:traceoff:0' > set_ftrace_filter 3115 3116 The above removes the traceoff command for __schedule_bug 3117 that have a counter. To remove commands without counters:: 3118 3119 echo '!__schedule_bug:traceoff' > set_ftrace_filter 3120 3121- snapshot: 3122 Will cause a snapshot to be triggered when the function is hit. 3123 :: 3124 3125 echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter 3126 3127 To only snapshot once: 3128 :: 3129 3130 echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter 3131 3132 To remove the above commands:: 3133 3134 echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter 3135 echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter 3136 3137- enable_event/disable_event: 3138 These commands can enable or disable a trace event. Note, because 3139 function tracing callbacks are very sensitive, when these commands 3140 are registered, the trace point is activated, but disabled in 3141 a "soft" mode. That is, the tracepoint will be called, but 3142 just will not be traced. The event tracepoint stays in this mode 3143 as long as there's a command that triggers it. 3144 :: 3145 3146 echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ 3147 set_ftrace_filter 3148 3149 The format is:: 3150 3151 <function>:enable_event:<system>:<event>[:count] 3152 <function>:disable_event:<system>:<event>[:count] 3153 3154 To remove the events commands:: 3155 3156 echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ 3157 set_ftrace_filter 3158 echo '!schedule:disable_event:sched:sched_switch' > \ 3159 set_ftrace_filter 3160 3161- dump: 3162 When the function is hit, it will dump the contents of the ftrace 3163 ring buffer to the console. This is useful if you need to debug 3164 something, and want to dump the trace when a certain function 3165 is hit. Perhaps it's a function that is called before a triple 3166 fault happens and does not allow you to get a regular dump. 3167 3168- cpudump: 3169 When the function is hit, it will dump the contents of the ftrace 3170 ring buffer for the current CPU to the console. Unlike the "dump" 3171 command, it only prints out the contents of the ring buffer for the 3172 CPU that executed the function that triggered the dump. 3173 3174- stacktrace: 3175 When the function is hit, a stack trace is recorded. 3176 3177trace_pipe 3178---------- 3179 3180The trace_pipe outputs the same content as the trace file, but 3181the effect on the tracing is different. Every read from 3182trace_pipe is consumed. This means that subsequent reads will be 3183different. The trace is live. 3184:: 3185 3186 # echo function > current_tracer 3187 # cat trace_pipe > /tmp/trace.out & 3188 [1] 4153 3189 # echo 1 > tracing_on 3190 # usleep 1 3191 # echo 0 > tracing_on 3192 # cat trace 3193 # tracer: function 3194 # 3195 # entries-in-buffer/entries-written: 0/0 #P:4 3196 # 3197 # _-----=> irqs-off 3198 # / _----=> need-resched 3199 # | / _---=> hardirq/softirq 3200 # || / _--=> preempt-depth 3201 # ||| / delay 3202 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3203 # | | | |||| | | 3204 3205 # 3206 # cat /tmp/trace.out 3207 bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write 3208 bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock 3209 bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify 3210 bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify 3211 bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify 3212 bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock 3213 bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock 3214 bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify 3215 bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath 3216 3217 3218Note, reading the trace_pipe file will block until more input is 3219added. This is contrary to the trace file. If any process opened 3220the trace file for reading, it will actually disable tracing and 3221prevent new entries from being added. The trace_pipe file does 3222not have this limitation. 3223 3224trace entries 3225------------- 3226 3227Having too much or not enough data can be troublesome in 3228diagnosing an issue in the kernel. The file buffer_size_kb is 3229used to modify the size of the internal trace buffers. The 3230number listed is the number of entries that can be recorded per 3231CPU. To know the full size, multiply the number of possible CPUs 3232with the number of entries. 3233:: 3234 3235 # cat buffer_size_kb 3236 1408 (units kilobytes) 3237 3238Or simply read buffer_total_size_kb 3239:: 3240 3241 # cat buffer_total_size_kb 3242 5632 3243 3244To modify the buffer, simple echo in a number (in 1024 byte segments). 3245:: 3246 3247 # echo 10000 > buffer_size_kb 3248 # cat buffer_size_kb 3249 10000 (units kilobytes) 3250 3251It will try to allocate as much as possible. If you allocate too 3252much, it can cause Out-Of-Memory to trigger. 3253:: 3254 3255 # echo 1000000000000 > buffer_size_kb 3256 -bash: echo: write error: Cannot allocate memory 3257 # cat buffer_size_kb 3258 85 3259 3260The per_cpu buffers can be changed individually as well: 3261:: 3262 3263 # echo 10000 > per_cpu/cpu0/buffer_size_kb 3264 # echo 100 > per_cpu/cpu1/buffer_size_kb 3265 3266When the per_cpu buffers are not the same, the buffer_size_kb 3267at the top level will just show an X 3268:: 3269 3270 # cat buffer_size_kb 3271 X 3272 3273This is where the buffer_total_size_kb is useful: 3274:: 3275 3276 # cat buffer_total_size_kb 3277 12916 3278 3279Writing to the top level buffer_size_kb will reset all the buffers 3280to be the same again. 3281 3282Snapshot 3283-------- 3284CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature 3285available to all non latency tracers. (Latency tracers which 3286record max latency, such as "irqsoff" or "wakeup", can't use 3287this feature, since those are already using the snapshot 3288mechanism internally.) 3289 3290Snapshot preserves a current trace buffer at a particular point 3291in time without stopping tracing. Ftrace swaps the current 3292buffer with a spare buffer, and tracing continues in the new 3293current (=previous spare) buffer. 3294 3295The following tracefs files in "tracing" are related to this 3296feature: 3297 3298 snapshot: 3299 3300 This is used to take a snapshot and to read the output 3301 of the snapshot. Echo 1 into this file to allocate a 3302 spare buffer and to take a snapshot (swap), then read 3303 the snapshot from this file in the same format as 3304 "trace" (described above in the section "The File 3305 System"). Both reads snapshot and tracing are executable 3306 in parallel. When the spare buffer is allocated, echoing 3307 0 frees it, and echoing else (positive) values clear the 3308 snapshot contents. 3309 More details are shown in the table below. 3310 3311 +--------------+------------+------------+------------+ 3312 |status\\input | 0 | 1 | else | 3313 +==============+============+============+============+ 3314 |not allocated |(do nothing)| alloc+swap |(do nothing)| 3315 +--------------+------------+------------+------------+ 3316 |allocated | free | swap | clear | 3317 +--------------+------------+------------+------------+ 3318 3319Here is an example of using the snapshot feature. 3320:: 3321 3322 # echo 1 > events/sched/enable 3323 # echo 1 > snapshot 3324 # cat snapshot 3325 # tracer: nop 3326 # 3327 # entries-in-buffer/entries-written: 71/71 #P:8 3328 # 3329 # _-----=> irqs-off 3330 # / _----=> need-resched 3331 # | / _---=> hardirq/softirq 3332 # || / _--=> preempt-depth 3333 # ||| / delay 3334 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3335 # | | | |||| | | 3336 <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120 3337 sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120 3338 [...] 3339 <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120 3340 3341 # cat trace 3342 # tracer: nop 3343 # 3344 # entries-in-buffer/entries-written: 77/77 #P:8 3345 # 3346 # _-----=> irqs-off 3347 # / _----=> need-resched 3348 # | / _---=> hardirq/softirq 3349 # || / _--=> preempt-depth 3350 # ||| / delay 3351 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3352 # | | | |||| | | 3353 <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120 3354 snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 3355 [...] 3356 3357 3358If you try to use this snapshot feature when current tracer is 3359one of the latency tracers, you will get the following results. 3360:: 3361 3362 # echo wakeup > current_tracer 3363 # echo 1 > snapshot 3364 bash: echo: write error: Device or resource busy 3365 # cat snapshot 3366 cat: snapshot: Device or resource busy 3367 3368 3369Instances 3370--------- 3371In the tracefs tracing directory is a directory called "instances". 3372This directory can have new directories created inside of it using 3373mkdir, and removing directories with rmdir. The directory created 3374with mkdir in this directory will already contain files and other 3375directories after it is created. 3376:: 3377 3378 # mkdir instances/foo 3379 # ls instances/foo 3380 buffer_size_kb buffer_total_size_kb events free_buffer per_cpu 3381 set_event snapshot trace trace_clock trace_marker trace_options 3382 trace_pipe tracing_on 3383 3384As you can see, the new directory looks similar to the tracing directory 3385itself. In fact, it is very similar, except that the buffer and 3386events are agnostic from the main directory, or from any other 3387instances that are created. 3388 3389The files in the new directory work just like the files with the 3390same name in the tracing directory except the buffer that is used 3391is a separate and new buffer. The files affect that buffer but do not 3392affect the main buffer with the exception of trace_options. Currently, 3393the trace_options affect all instances and the top level buffer 3394the same, but this may change in future releases. That is, options 3395may become specific to the instance they reside in. 3396 3397Notice that none of the function tracer files are there, nor is 3398current_tracer and available_tracers. This is because the buffers 3399can currently only have events enabled for them. 3400:: 3401 3402 # mkdir instances/foo 3403 # mkdir instances/bar 3404 # mkdir instances/zoot 3405 # echo 100000 > buffer_size_kb 3406 # echo 1000 > instances/foo/buffer_size_kb 3407 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb 3408 # echo function > current_trace 3409 # echo 1 > instances/foo/events/sched/sched_wakeup/enable 3410 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable 3411 # echo 1 > instances/foo/events/sched/sched_switch/enable 3412 # echo 1 > instances/bar/events/irq/enable 3413 # echo 1 > instances/zoot/events/syscalls/enable 3414 # cat trace_pipe 3415 CPU:2 [LOST 11745 EVENTS] 3416 bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist 3417 bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave 3418 bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist 3419 bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist 3420 bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock 3421 bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype 3422 bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist 3423 bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist 3424 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3425 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3426 bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process 3427 [...] 3428 3429 # cat instances/foo/trace_pipe 3430 bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3431 bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3432 <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 3433 <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 3434 rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 3435 bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3436 bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3437 bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 3438 kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 3439 kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 3440 [...] 3441 3442 # cat instances/bar/trace_pipe 3443 migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] 3444 <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] 3445 bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] 3446 bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] 3447 bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] 3448 bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] 3449 bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] 3450 bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] 3451 sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 3452 sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled 3453 sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 3454 sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled 3455 [...] 3456 3457 # cat instances/zoot/trace 3458 # tracer: nop 3459 # 3460 # entries-in-buffer/entries-written: 18996/18996 #P:4 3461 # 3462 # _-----=> irqs-off 3463 # / _----=> need-resched 3464 # | / _---=> hardirq/softirq 3465 # || / _--=> preempt-depth 3466 # ||| / delay 3467 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3468 # | | | |||| | | 3469 bash-1998 [000] d... 140.733501: sys_write -> 0x2 3470 bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) 3471 bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 3472 bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) 3473 bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 3474 bash-1998 [000] d... 140.733510: sys_close(fd: a) 3475 bash-1998 [000] d... 140.733510: sys_close -> 0x0 3476 bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) 3477 bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 3478 bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) 3479 bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 3480 3481You can see that the trace of the top most trace buffer shows only 3482the function tracing. The foo instance displays wakeups and task 3483switches. 3484 3485To remove the instances, simply delete their directories: 3486:: 3487 3488 # rmdir instances/foo 3489 # rmdir instances/bar 3490 # rmdir instances/zoot 3491 3492Note, if a process has a trace file open in one of the instance 3493directories, the rmdir will fail with EBUSY. 3494 3495 3496Stack trace 3497----------- 3498Since the kernel has a fixed sized stack, it is important not to 3499waste it in functions. A kernel developer must be conscience of 3500what they allocate on the stack. If they add too much, the system 3501can be in danger of a stack overflow, and corruption will occur, 3502usually leading to a system panic. 3503 3504There are some tools that check this, usually with interrupts 3505periodically checking usage. But if you can perform a check 3506at every function call that will become very useful. As ftrace provides 3507a function tracer, it makes it convenient to check the stack size 3508at every function call. This is enabled via the stack tracer. 3509 3510CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. 3511To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. 3512:: 3513 3514 # echo 1 > /proc/sys/kernel/stack_tracer_enabled 3515 3516You can also enable it from the kernel command line to trace 3517the stack size of the kernel during boot up, by adding "stacktrace" 3518to the kernel command line parameter. 3519 3520After running it for a few minutes, the output looks like: 3521:: 3522 3523 # cat stack_max_size 3524 2928 3525 3526 # cat stack_trace 3527 Depth Size Location (18 entries) 3528 ----- ---- -------- 3529 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac 3530 1) 2704 160 find_busiest_group+0x31/0x1f1 3531 2) 2544 256 load_balance+0xd9/0x662 3532 3) 2288 80 idle_balance+0xbb/0x130 3533 4) 2208 128 __schedule+0x26e/0x5b9 3534 5) 2080 16 schedule+0x64/0x66 3535 6) 2064 128 schedule_timeout+0x34/0xe0 3536 7) 1936 112 wait_for_common+0x97/0xf1 3537 8) 1824 16 wait_for_completion+0x1d/0x1f 3538 9) 1808 128 flush_work+0xfe/0x119 3539 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 3540 11) 1664 48 input_available_p+0x1d/0x5c 3541 12) 1616 48 n_tty_poll+0x6d/0x134 3542 13) 1568 64 tty_poll+0x64/0x7f 3543 14) 1504 880 do_select+0x31e/0x511 3544 15) 624 400 core_sys_select+0x177/0x216 3545 16) 224 96 sys_select+0x91/0xb9 3546 17) 128 128 system_call_fastpath+0x16/0x1b 3547 3548Note, if -mfentry is being used by gcc, functions get traced before 3549they set up the stack frame. This means that leaf level functions 3550are not tested by the stack tracer when -mfentry is used. 3551 3552Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. 3553 3554More 3555---- 3556More details can be found in the source code, in the `kernel/trace/*.c` files. 3557