1perf-intel-pt(1) 2================ 3 4NAME 5---- 6perf-intel-pt - Support for Intel Processor Trace within perf tools 7 8SYNOPSIS 9-------- 10[verse] 11'perf record' -e intel_pt// 12 13DESCRIPTION 14----------- 15 16Intel Processor Trace (Intel PT) is an extension of Intel Architecture that 17collects information about software execution such as control flow, execution 18modes and timings and formats it into highly compressed binary packets. 19Technical details are documented in the Intel 64 and IA-32 Architectures 20Software Developer Manuals, Chapter 36 Intel Processor Trace. 21 22Intel PT is first supported in Intel Core M and 5th generation Intel Core 23processors that are based on the Intel micro-architecture code name Broadwell. 24 25Trace data is collected by 'perf record' and stored within the perf.data file. 26See below for options to 'perf record'. 27 28Trace data must be 'decoded' which involves walking the object code and matching 29the trace data packets. For example a TNT packet only tells whether a 30conditional branch was taken or not taken, so to make use of that packet the 31decoder must know precisely which instruction was being executed. 32 33Decoding is done on-the-fly. The decoder outputs samples in the same format as 34samples output by perf hardware events, for example as though the "instructions" 35or "branches" events had been recorded. Presently 3 tools support this: 36'perf script', 'perf report' and 'perf inject'. See below for more information 37on using those tools. 38 39The main distinguishing feature of Intel PT is that the decoder can determine 40the exact flow of software execution. Intel PT can be used to understand why 41and how did software get to a certain point, or behave a certain way. The 42software does not have to be recompiled, so Intel PT works with debug or release 43builds, however the executed images are needed - which makes use in JIT-compiled 44environments, or with self-modified code, a challenge. Also symbols need to be 45provided to make sense of addresses. 46 47A limitation of Intel PT is that it produces huge amounts of trace data 48(hundreds of megabytes per second per core) which takes a long time to decode, 49for example two or three orders of magnitude longer than it took to collect. 50Another limitation is the performance impact of tracing, something that will 51vary depending on the use-case and architecture. 52 53 54Quickstart 55---------- 56 57It is important to start small. That is because it is easy to capture vastly 58more data than can possibly be processed. 59 60The simplest thing to do with Intel PT is userspace profiling of small programs. 61Data is captured with 'perf record' e.g. to trace 'ls' userspace-only: 62 63 perf record -e intel_pt//u ls 64 65And profiled with 'perf report' e.g. 66 67 perf report 68 69To also trace kernel space presents a problem, namely kernel self-modifying 70code. A fairly good kernel image is available in /proc/kcore but to get an 71accurate image a copy of /proc/kcore needs to be made under the same conditions 72as the data capture. 'perf record' can make a copy of /proc/kcore if the option 73--kcore is used, but access to /proc/kcore is restricted e.g. 74 75 sudo perf record -o pt_ls --kcore -e intel_pt// -- ls 76 77which will create a directory named 'pt_ls' and put the perf.data file (named 78simply 'data') and copies of /proc/kcore, /proc/kallsyms and /proc/modules into 79it. The other tools understand the directory format, so to use 'perf report' 80becomes: 81 82 sudo perf report -i pt_ls 83 84Because samples are synthesized after-the-fact, the sampling period can be 85selected for reporting. e.g. sample every microsecond 86 87 sudo perf report pt_ls --itrace=i1usge 88 89See the sections below for more information about the --itrace option. 90 91Beware the smaller the period, the more samples that are produced, and the 92longer it takes to process them. 93 94Also note that the coarseness of Intel PT timing information will start to 95distort the statistical value of the sampling as the sampling period becomes 96smaller. 97 98To represent software control flow, "branches" samples are produced. By default 99a branch sample is synthesized for every single branch. To get an idea what 100data is available you can use the 'perf script' tool with all itrace sampling 101options, which will list all the samples. 102 103 perf record -e intel_pt//u ls 104 perf script --itrace=ibxwpe 105 106An interesting field that is not printed by default is 'flags' which can be 107displayed as follows: 108 109 perf script --itrace=ibxwpe -F+flags 110 111The flags are "bcrosyiABExghDt" which stand for branch, call, return, conditional, 112system, asynchronous, interrupt, transaction abort, trace begin, trace end, 113in transaction, VM-entry, VM-exit, interrupt disabled, and interrupt disable 114toggle respectively. 115 116perf script also supports higher level ways to dump instruction traces: 117 118 perf script --insn-trace --xed 119 120Dump all instructions. This requires installing the xed tool (see XED below) 121Dumping all instructions in a long trace can be fairly slow. It is usually better 122to start with higher level decoding, like 123 124 perf script --call-trace 125 126or 127 128 perf script --call-ret-trace 129 130and then select a time range of interest. The time range can then be examined 131in detail with 132 133 perf script --time starttime,stoptime --insn-trace --xed 134 135While examining the trace it's also useful to filter on specific CPUs using 136the -C option 137 138 perf script --time starttime,stoptime --insn-trace --xed -C 1 139 140Dump all instructions in time range on CPU 1. 141 142Another interesting field that is not printed by default is 'ipc' which can be 143displayed as follows: 144 145 perf script --itrace=be -F+ipc 146 147There are two ways that instructions-per-cycle (IPC) can be calculated depending 148on the recording. 149 150If the 'cyc' config term (see config terms section below) was used, then IPC is 151calculated using the cycle count from CYC packets, otherwise MTC packets are 152used - refer to the 'mtc' config term. When MTC is used, however, the values 153are less accurate because the timing is less accurate. 154 155Because Intel PT does not update the cycle count on every branch or instruction, 156the values will often be zero. When there are values, they will be the number 157of instructions and number of cycles since the last update, and thus represent 158the average IPC since the last IPC for that event type. Note IPC for "branches" 159events is calculated separately from IPC for "instructions" events. 160 161Also note that the IPC instruction count may or may not include the current 162instruction. If the cycle count is associated with an asynchronous branch 163(e.g. page fault or interrupt), then the instruction count does not include the 164current instruction, otherwise it does. That is consistent with whether or not 165that instruction has retired when the cycle count is updated. 166 167Another note, in the case of "branches" events, non-taken branches are not 168presently sampled, so IPC values for them do not appear e.g. a CYC packet with a 169TNT packet that starts with a non-taken branch. To see every possible IPC 170value, "instructions" events can be used e.g. --itrace=i0ns 171 172While it is possible to create scripts to analyze the data, an alternative 173approach is available to export the data to a sqlite or postgresql database. 174Refer to script export-to-sqlite.py or export-to-postgresql.py for more details, 175and to script exported-sql-viewer.py for an example of using the database. 176 177There is also script intel-pt-events.py which provides an example of how to 178unpack the raw data for power events and PTWRITE. The script also displays 179branches, and supports 2 additional modes selected by option: 180 181 --insn-trace - instruction trace 182 --src-trace - source trace 183 184As mentioned above, it is easy to capture too much data. One way to limit the 185data captured is to use 'snapshot' mode which is explained further below. 186Refer to 'new snapshot option' and 'Intel PT modes of operation' further below. 187 188Another problem that will be experienced is decoder errors. They can be caused 189by inability to access the executed image, self-modified or JIT-ed code, or the 190inability to match side-band information (such as context switches and mmaps) 191which results in the decoder not knowing what code was executed. 192 193There is also the problem of perf not being able to copy the data fast enough, 194resulting in data lost because the buffer was full. See 'Buffer handling' below 195for more details. 196 197 198perf record 199----------- 200 201new event 202~~~~~~~~~ 203 204The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are 205selected by providing the PMU name followed by the "config" separated by slashes. 206An enhancement has been made to allow default "config" e.g. the option 207 208 -e intel_pt// 209 210will use a default config value. Currently that is the same as 211 212 -e intel_pt/tsc,noretcomp=0/ 213 214which is the same as 215 216 -e intel_pt/tsc=1,noretcomp=0/ 217 218Note there are now new config terms - see section 'config terms' further below. 219 220The config terms are listed in /sys/devices/intel_pt/format. They are bit 221fields within the config member of the struct perf_event_attr which is 222passed to the kernel by the perf_event_open system call. They correspond to bit 223fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions: 224 225 $ grep -H . /sys/bus/event_source/devices/intel_pt/format/* 226 /sys/bus/event_source/devices/intel_pt/format/cyc:config:1 227 /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22 228 /sys/bus/event_source/devices/intel_pt/format/mtc:config:9 229 /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17 230 /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11 231 /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27 232 /sys/bus/event_source/devices/intel_pt/format/tsc:config:10 233 234Note that the default config must be overridden for each term i.e. 235 236 -e intel_pt/noretcomp=0/ 237 238is the same as: 239 240 -e intel_pt/tsc=1,noretcomp=0/ 241 242So, to disable TSC packets use: 243 244 -e intel_pt/tsc=0/ 245 246It is also possible to specify the config value explicitly: 247 248 -e intel_pt/config=0x400/ 249 250Note that, as with all events, the event is suffixed with event modifiers: 251 252 u userspace 253 k kernel 254 h hypervisor 255 G guest 256 H host 257 p precise ip 258 259'h', 'G' and 'H' are for virtualization which is not supported by Intel PT. 260'p' is also not relevant to Intel PT. So only options 'u' and 'k' are 261meaningful for Intel PT. 262 263perf_event_attr is displayed if the -vv option is used e.g. 264 265 ------------------------------------------------------------ 266 perf_event_attr: 267 type 6 268 size 112 269 config 0x400 270 { sample_period, sample_freq } 1 271 sample_type IP|TID|TIME|CPU|IDENTIFIER 272 read_format ID 273 disabled 1 274 inherit 1 275 exclude_kernel 1 276 exclude_hv 1 277 enable_on_exec 1 278 sample_id_all 1 279 ------------------------------------------------------------ 280 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 281 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 282 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 283 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 284 ------------------------------------------------------------ 285 286 287config terms 288~~~~~~~~~~~~ 289 290The June 2015 version of Intel 64 and IA-32 Architectures Software Developer 291Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features. 292Some of the features are reflect in new config terms. All the config terms are 293described below. 294 295tsc Always supported. Produces TSC timestamp packets to provide 296 timing information. In some cases it is possible to decode 297 without timing information, for example a per-thread context 298 that does not overlap executable memory maps. 299 300 The default config selects tsc (i.e. tsc=1). 301 302noretcomp Always supported. Disables "return compression" so a TIP packet 303 is produced when a function returns. Causes more packets to be 304 produced but might make decoding more reliable. 305 306 The default config does not select noretcomp (i.e. noretcomp=0). 307 308psb_period Allows the frequency of PSB packets to be specified. 309 310 The PSB packet is a synchronization packet that provides a 311 starting point for decoding or recovery from errors. 312 313 Support for psb_period is indicated by: 314 315 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc 316 317 which contains "1" if the feature is supported and "0" 318 otherwise. 319 320 Valid values are given by: 321 322 /sys/bus/event_source/devices/intel_pt/caps/psb_periods 323 324 which contains a hexadecimal value, the bits of which represent 325 valid values e.g. bit 2 set means value 2 is valid. 326 327 The psb_period value is converted to the approximate number of 328 trace bytes between PSB packets as: 329 330 2 ^ (value + 11) 331 332 e.g. value 3 means 16KiB bytes between PSBs 333 334 If an invalid value is entered, the error message 335 will give a list of valid values e.g. 336 337 $ perf record -e intel_pt/psb_period=15/u uname 338 Invalid psb_period for intel_pt. Valid values are: 0-5 339 340 If MTC packets are selected, the default config selects a value 341 of 3 (i.e. psb_period=3) or the nearest lower value that is 342 supported (0 is always supported). Otherwise the default is 0. 343 344 If decoding is expected to be reliable and the buffer is large 345 then a large PSB period can be used. 346 347 Because a TSC packet is produced with PSB, the PSB period can 348 also affect the granularity to timing information in the absence 349 of MTC or CYC. 350 351mtc Produces MTC timing packets. 352 353 MTC packets provide finer grain timestamp information than TSC 354 packets. MTC packets record time using the hardware crystal 355 clock (CTC) which is related to TSC packets using a TMA packet. 356 357 Support for this feature is indicated by: 358 359 /sys/bus/event_source/devices/intel_pt/caps/mtc 360 361 which contains "1" if the feature is supported and 362 "0" otherwise. 363 364 The frequency of MTC packets can also be specified - see 365 mtc_period below. 366 367mtc_period Specifies how frequently MTC packets are produced - see mtc 368 above for how to determine if MTC packets are supported. 369 370 Valid values are given by: 371 372 /sys/bus/event_source/devices/intel_pt/caps/mtc_periods 373 374 which contains a hexadecimal value, the bits of which represent 375 valid values e.g. bit 2 set means value 2 is valid. 376 377 The mtc_period value is converted to the MTC frequency as: 378 379 CTC-frequency / (2 ^ value) 380 381 e.g. value 3 means one eighth of CTC-frequency 382 383 Where CTC is the hardware crystal clock, the frequency of which 384 can be related to TSC via values provided in cpuid leaf 0x15. 385 386 If an invalid value is entered, the error message 387 will give a list of valid values e.g. 388 389 $ perf record -e intel_pt/mtc_period=15/u uname 390 Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9 391 392 The default value is 3 or the nearest lower value 393 that is supported (0 is always supported). 394 395cyc Produces CYC timing packets. 396 397 CYC packets provide even finer grain timestamp information than 398 MTC and TSC packets. A CYC packet contains the number of CPU 399 cycles since the last CYC packet. Unlike MTC and TSC packets, 400 CYC packets are only sent when another packet is also sent. 401 402 Support for this feature is indicated by: 403 404 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc 405 406 which contains "1" if the feature is supported and 407 "0" otherwise. 408 409 The number of CYC packets produced can be reduced by specifying 410 a threshold - see cyc_thresh below. 411 412cyc_thresh Specifies how frequently CYC packets are produced - see cyc 413 above for how to determine if CYC packets are supported. 414 415 Valid cyc_thresh values are given by: 416 417 /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds 418 419 which contains a hexadecimal value, the bits of which represent 420 valid values e.g. bit 2 set means value 2 is valid. 421 422 The cyc_thresh value represents the minimum number of CPU cycles 423 that must have passed before a CYC packet can be sent. The 424 number of CPU cycles is: 425 426 2 ^ (value - 1) 427 428 e.g. value 4 means 8 CPU cycles must pass before a CYC packet 429 can be sent. Note a CYC packet is still only sent when another 430 packet is sent, not at, e.g. every 8 CPU cycles. 431 432 If an invalid value is entered, the error message 433 will give a list of valid values e.g. 434 435 $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname 436 Invalid cyc_thresh for intel_pt. Valid values are: 0-12 437 438 CYC packets are not requested by default. 439 440pt Specifies pass-through which enables the 'branch' config term. 441 442 The default config selects 'pt' if it is available, so a user will 443 never need to specify this term. 444 445branch Enable branch tracing. Branch tracing is enabled by default so to 446 disable branch tracing use 'branch=0'. 447 448 The default config selects 'branch' if it is available. 449 450ptw Enable PTWRITE packets which are produced when a ptwrite instruction 451 is executed. 452 453 Support for this feature is indicated by: 454 455 /sys/bus/event_source/devices/intel_pt/caps/ptwrite 456 457 which contains "1" if the feature is supported and 458 "0" otherwise. 459 460 As an alternative, refer to "Emulated PTWRITE" further below. 461 462fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet 463 provides the address of the ptwrite instruction. In the absence of 464 fup_on_ptw, the decoder will use the address of the previous branch 465 if branch tracing is enabled, otherwise the address will be zero. 466 Note that fup_on_ptw will work even when branch tracing is disabled. 467 468pwr_evt Enable power events. The power events provide information about 469 changes to the CPU C-state. 470 471 Support for this feature is indicated by: 472 473 /sys/bus/event_source/devices/intel_pt/caps/power_event_trace 474 475 which contains "1" if the feature is supported and 476 "0" otherwise. 477 478event Enable Event Trace. The events provide information about asynchronous 479 events. 480 481 Support for this feature is indicated by: 482 483 /sys/bus/event_source/devices/intel_pt/caps/event_trace 484 485 which contains "1" if the feature is supported and 486 "0" otherwise. 487 488notnt Disable TNT packets. Without TNT packets, it is not possible to walk 489 executable code to reconstruct control flow, however FUP, TIP, TIP.PGE 490 and TIP.PGD packets still indicate asynchronous control flow, and (if 491 return compression is disabled - see noretcomp) return statements. 492 The advantage of eliminating TNT packets is reducing the size of the 493 trace and corresponding tracing overhead. 494 495 Support for this feature is indicated by: 496 497 /sys/bus/event_source/devices/intel_pt/caps/tnt_disable 498 499 which contains "1" if the feature is supported and 500 "0" otherwise. 501 502 503AUX area sampling option 504~~~~~~~~~~~~~~~~~~~~~~~~ 505 506To select Intel PT "sampling" the AUX area sampling option can be used: 507 508 --aux-sample 509 510Optionally it can be followed by the sample size in bytes e.g. 511 512 --aux-sample=8192 513 514In addition, the Intel PT event to sample must be defined e.g. 515 516 -e intel_pt//u 517 518Samples on other events will be created containing Intel PT data e.g. the 519following will create Intel PT samples on the branch-misses event, note the 520events must be grouped using {}: 521 522 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' 523 524An alternative to '--aux-sample' is to add the config term 'aux-sample-size' to 525events. In this case, the grouping is implied e.g. 526 527 perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u 528 529is the same as: 530 531 perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}' 532 533but allows for also using an address filter e.g.: 534 535 perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls 536 537It is important to select a sample size that is big enough to contain at least 538one PSB packet. If not a warning will be displayed: 539 540 Intel PT sample size (%zu) may be too small for PSB period (%zu) 541 542The calculation used for that is: if sample_size <= psb_period + 256 display the 543warning. When sampling is used, psb_period defaults to 0 (2KiB). 544 545The default sample size is 4KiB. 546 547The sample size is passed in aux_sample_size in struct perf_event_attr. The 548sample size is limited by the maximum event size which is 64KiB. It is 549difficult to know how big the event might be without the trace sample attached, 550but the tool validates that the sample size is not greater than 60KiB. 551 552 553new snapshot option 554~~~~~~~~~~~~~~~~~~~ 555 556The difference between full trace and snapshot from the kernel's perspective is 557that in full trace we don't overwrite trace data that the user hasn't collected 558yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let 559the trace run and overwrite older data in the buffer so that whenever something 560interesting happens, we can stop it and grab a snapshot of what was going on 561around that interesting moment. 562 563To select snapshot mode a new option has been added: 564 565 -S 566 567Optionally it can be followed by the snapshot size e.g. 568 569 -S0x100000 570 571The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size 572nor snapshot size is specified, then the default is 4MiB for privileged users 573(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. 574If an unprivileged user does not specify mmap pages, the mmap pages will be 575reduced as described in the 'new auxtrace mmap size option' section below. 576 577The snapshot size is displayed if the option -vv is used e.g. 578 579 Intel PT snapshot size: %zu 580 581 582new auxtrace mmap size option 583~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 584 585Intel PT buffer size is specified by an addition to the -m option e.g. 586 587 -m,16 588 589selects a buffer size of 16 pages i.e. 64KiB. 590 591Note that the existing functionality of -m is unchanged. The auxtrace mmap size 592is specified by the optional addition of a comma and the value. 593 594The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users 595(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. 596If an unprivileged user does not specify mmap pages, the mmap pages will be 597reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the 598user is likely to get an error as they exceed their mlock limit (Max locked 599memory as shown in /proc/self/limits). Note that perf does not count the first 600512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu 601against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus 602their mlock limit (which defaults to 64KiB but is not multiplied by the number 603of cpus). 604 605In full-trace mode, powers of two are allowed for buffer size, with a minimum 606size of 2 pages. In snapshot mode or sampling mode, it is the same but the 607minimum size is 1 page. 608 609The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g. 610 611 mmap length 528384 612 auxtrace mmap length 4198400 613 614 615Intel PT modes of operation 616~~~~~~~~~~~~~~~~~~~~~~~~~~~ 617 618Intel PT can be used in 3 modes: 619 full-trace mode 620 sample mode 621 snapshot mode 622 623Full-trace mode traces continuously e.g. 624 625 perf record -e intel_pt//u uname 626 627Sample mode attaches a Intel PT sample to other events e.g. 628 629 perf record --aux-sample -e intel_pt//u -e branch-misses:u 630 631Snapshot mode captures the available data when a signal is sent or "snapshot" 632control command is issued. e.g. using a signal 633 634 perf record -v -e intel_pt//u -S ./loopy 1000000000 & 635 [1] 11435 636 kill -USR2 11435 637 Recording AUX area tracing snapshot 638 639Note that the signal sent is SIGUSR2. 640Note that "Recording AUX area tracing snapshot" is displayed because the -v 641option is used. 642 643The advantage of using "snapshot" control command is that the access is 644controlled by access to a FIFO e.g. 645 646 $ mkfifo perf.control 647 $ mkfifo perf.ack 648 $ cat perf.ack & 649 [1] 15235 650 $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 & 651 [2] 15243 652 $ ps -e | grep perf 653 15244 pts/1 00:00:00 perf 654 $ kill -USR2 15244 655 bash: kill: (15244) - Operation not permitted 656 $ echo snapshot > perf.control 657 ack 658 659The 3 Intel PT modes of operation cannot be used together. 660 661 662Buffer handling 663~~~~~~~~~~~~~~~ 664 665There may be buffer limitations (i.e. single ToPa entry) which means that actual 666buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to 667provide other sizes, and in particular an arbitrarily large size, multiple 668buffers are logically concatenated. However an interrupt must be used to switch 669between buffers. That has two potential problems: 670 a) the interrupt may not be handled in time so that the current buffer 671 becomes full and some trace data is lost. 672 b) the interrupts may slow the system and affect the performance 673 results. 674 675If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event 676which the tools report as an error. 677 678In full-trace mode, the driver waits for data to be copied out before allowing 679the (logical) buffer to wrap-around. If data is not copied out quickly enough, 680again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to 681wait, the intel_pt event gets disabled. Because it is difficult to know when 682that happens, perf tools always re-enable the intel_pt event after copying out 683data. 684 685 686Intel PT and build ids 687~~~~~~~~~~~~~~~~~~~~~~ 688 689By default "perf record" post-processes the event stream to find all build ids 690for executables for all addresses sampled. Deliberately, Intel PT is not 691decoded for that purpose (it would take too long). Instead the build ids for 692all executables encountered (due to mmap, comm or task events) are included 693in the perf.data file. 694 695To see buildids included in the perf.data file use the command: 696 697 perf buildid-list 698 699If the perf.data file contains Intel PT data, that is the same as: 700 701 perf buildid-list --with-hits 702 703 704Snapshot mode and event disabling 705~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 706 707In order to make a snapshot, the intel_pt event is disabled using an IOCTL, 708namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the 709collection of side-band information. In order to prevent that, a dummy 710software event has been introduced that permits tracking events (like mmaps) to 711continue to be recorded while intel_pt is disabled. That is important to ensure 712there is complete side-band information to allow the decoding of subsequent 713snapshots. 714 715A test has been created for that. To find the test: 716 717 perf test list 718 ... 719 23: Test using a dummy software event to keep tracking 720 721To run the test: 722 723 perf test 23 724 23: Test using a dummy software event to keep tracking : Ok 725 726 727perf record modes (nothing new here) 728~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 729 730perf record essentially operates in one of three modes: 731 per thread 732 per cpu 733 workload only 734 735"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a 736workload). 737"per cpu" is selected by -C or -a. 738"workload only" mode is selected by not using the other options but providing a 739command to run (i.e. the workload). 740 741In per-thread mode an exact list of threads is traced. There is no inheritance. 742Each thread has its own event buffer. 743 744In per-cpu mode all processes (or processes from the selected cgroup i.e. -G 745option, or processes selected with -p or -u) are traced. Each cpu has its own 746buffer. Inheritance is allowed. 747 748In workload-only mode, the workload is traced but with per-cpu buffers. 749Inheritance is allowed. Note that you can now trace a workload in per-thread 750mode by using the --per-thread option. 751 752 753Privileged vs non-privileged users 754~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 755 756Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users 757have memory limits imposed upon them. That affects what buffer sizes they can 758have as outlined above. 759 760The v4.2 kernel introduced support for a context switch metadata event, 761PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes 762are scheduled out and in, just not by whom, which is left for the 763PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context, 764which in turn requires CAP_PERFMON or CAP_SYS_ADMIN. 765 766Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context 767switches") commit, that introduces these metadata events for further info. 768 769When working with kernels < v4.2, the following considerations must be taken, 770as the sched:sched_switch tracepoints will be used to receive such information: 771 772Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are 773not permitted to use tracepoints which means there is insufficient side-band 774information to decode Intel PT in per-cpu mode, and potentially workload-only 775mode too if the workload creates new processes. 776 777Note also, that to use tracepoints, read-access to debugfs is required. So if 778debugfs is not mounted or the user does not have read-access, it will again not 779be possible to decode Intel PT in per-cpu mode. 780 781 782sched_switch tracepoint 783~~~~~~~~~~~~~~~~~~~~~~~ 784 785The sched_switch tracepoint is used to provide side-band data for Intel PT 786decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't 787available. 788 789The sched_switch events are automatically added. e.g. the second event shown 790below: 791 792 $ perf record -vv -e intel_pt//u uname 793 ------------------------------------------------------------ 794 perf_event_attr: 795 type 6 796 size 112 797 config 0x400 798 { sample_period, sample_freq } 1 799 sample_type IP|TID|TIME|CPU|IDENTIFIER 800 read_format ID 801 disabled 1 802 inherit 1 803 exclude_kernel 1 804 exclude_hv 1 805 enable_on_exec 1 806 sample_id_all 1 807 ------------------------------------------------------------ 808 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 809 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 810 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 811 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 812 ------------------------------------------------------------ 813 perf_event_attr: 814 type 2 815 size 112 816 config 0x108 817 { sample_period, sample_freq } 1 818 sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER 819 read_format ID 820 inherit 1 821 sample_id_all 1 822 exclude_guest 1 823 ------------------------------------------------------------ 824 sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 825 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 826 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 827 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 828 ------------------------------------------------------------ 829 perf_event_attr: 830 type 1 831 size 112 832 config 0x9 833 { sample_period, sample_freq } 1 834 sample_type IP|TID|TIME|IDENTIFIER 835 read_format ID 836 disabled 1 837 inherit 1 838 exclude_kernel 1 839 exclude_hv 1 840 mmap 1 841 comm 1 842 enable_on_exec 1 843 task 1 844 sample_id_all 1 845 mmap2 1 846 comm_exec 1 847 ------------------------------------------------------------ 848 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 849 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 850 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 851 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 852 mmap size 528384B 853 AUX area mmap length 4194304 854 perf event ring buffer mmapped per cpu 855 Synthesizing auxtrace information 856 Linux 857 [ perf record: Woken up 1 times to write data ] 858 [ perf record: Captured and wrote 0.042 MB perf.data ] 859 860Note, the sched_switch event is only added if the user is permitted to use it 861and only in per-cpu mode. 862 863Note also, the sched_switch event is only added if TSC packets are requested. 864That is because, in the absence of timing information, the sched_switch events 865cannot be matched against the Intel PT trace. 866 867 868perf script 869----------- 870 871By default, perf script will decode trace data found in the perf.data file. 872This can be further controlled by new option --itrace. 873 874 875New --itrace option 876~~~~~~~~~~~~~~~~~~~ 877 878Having no option is the same as 879 880 --itrace 881 882which, in turn, is the same as 883 884 --itrace=cepwx 885 886The letters are: 887 888 i synthesize "instructions" events 889 b synthesize "branches" events 890 x synthesize "transactions" events 891 w synthesize "ptwrite" events 892 p synthesize "power" events (incl. PSB events) 893 c synthesize branches events (calls only) 894 r synthesize branches events (returns only) 895 o synthesize PEBS-via-PT events 896 I synthesize Event Trace events 897 e synthesize tracing error events 898 d create a debug log 899 g synthesize a call chain (use with i or x) 900 G synthesize a call chain on existing event records 901 l synthesize last branch entries (use with i or x) 902 L synthesize last branch entries on existing event records 903 s skip initial number of events 904 q quicker (less detailed) decoding 905 Z prefer to ignore timestamps (so-called "timeless" decoding) 906 907"Instructions" events look like they were recorded by "perf record -e 908instructions". 909 910"Branches" events look like they were recorded by "perf record -e branches". "c" 911and "r" can be combined to get calls and returns. 912 913"Transactions" events correspond to the start or end of transactions. The 914'flags' field can be used in perf script to determine whether the event is a 915transaction start, commit or abort. 916 917Note that "instructions", "branches" and "transactions" events depend on code 918flow packets which can be disabled by using the config term "branch=0". Refer 919to the config terms section above. 920 921"ptwrite" events record the payload of the ptwrite instruction and whether 922"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are 923recorded only if the "ptw" config term was used. Refer to the config terms 924section above. perf script "synth" field displays "ptwrite" information like 925this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was 926used. 927 928"Power" events correspond to power event packets and CBR (core-to-bus ratio) 929packets. While CBR packets are always recorded when tracing is enabled, power 930event packets are recorded only if the "pwr_evt" config term was used. Refer to 931the config terms section above. The power events record information about 932C-state changes, whereas CBR is indicative of CPU frequency. perf script 933"event,synth" fields display information like this: 934 cbr: cbr: 22 freq: 2189 MHz (200%) 935 mwait: hints: 0x60 extensions: 0x1 936 pwre: hw: 0 cstate: 2 sub-cstate: 0 937 exstop: ip: 1 938 pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 939Where: 940 "cbr" includes the frequency and the percentage of maximum non-turbo 941 "mwait" shows mwait hints and extensions 942 "pwre" shows C-state transitions (to a C-state deeper than C0) and 943 whether initiated by hardware 944 "exstop" indicates execution stopped and whether the IP was recorded 945 exactly, 946 "pwrx" indicates return to C0 947For more details refer to the Intel 64 and IA-32 Architectures Software 948Developer Manuals. 949 950PSB events show when a PSB+ occurred and also the byte-offset in the trace. 951Emitting a PSB+ can cause a CPU a slight delay. When doing timing analysis 952of code with Intel PT, it is useful to know if a timing bubble was caused 953by Intel PT or not. 954 955Error events show where the decoder lost the trace. Error events 956are quite important. Users must know if what they are seeing is a complete 957picture or not. The "e" option may be followed by flags which affect what errors 958will or will not be reported. Each flag must be preceded by either '+' or '-'. 959The flags supported by Intel PT are: 960 -o Suppress overflow errors 961 -l Suppress trace data lost errors 962For example, for errors but not overflow or data lost errors: 963 964 --itrace=e-o-l 965 966The "d" option will cause the creation of a file "intel_pt.log" containing all 967decoded packets and instructions. Note that this option slows down the decoder 968and that the resulting file may be very large. The "d" option may be followed 969by flags which affect what debug messages will or will not be logged. Each flag 970must be preceded by either '+' or '-'. The flags support by Intel PT are: 971 -a Suppress logging of perf events 972 +a Log all perf events 973By default, logged perf events are filtered by any specified time ranges, but 974flag +a overrides that. 975 976In addition, the period of the "instructions" event can be specified. e.g. 977 978 --itrace=i10us 979 980sets the period to 10us i.e. one instruction sample is synthesized for each 10 981microseconds of trace. Alternatives to "us" are "ms" (milliseconds), 982"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions). 983 984"ms", "us" and "ns" are converted to TSC ticks. 985 986The timing information included with Intel PT does not give the time of every 987instruction. Consequently, for the purpose of sampling, the decoder estimates 988the time since the last timing packet based on 1 tick per instruction. The time 989on the sample is *not* adjusted and reflects the last known value of TSC. 990 991For Intel PT, the default period is 100us. 992 993Setting it to a zero period means "as often as possible". 994 995In the case of Intel PT that is the same as a period of 1 and a unit of 996'instructions' (i.e. --itrace=i1i). 997 998Also the call chain size (default 16, max. 1024) for instructions or 999transactions events can be specified. e.g. 1000 1001 --itrace=ig32 1002 --itrace=xg32 1003 1004Also the number of last branch entries (default 64, max. 1024) for instructions or 1005transactions events can be specified. e.g. 1006 1007 --itrace=il10 1008 --itrace=xl10 1009 1010Note that last branch entries are cleared for each sample, so there is no overlap 1011from one sample to the next. 1012 1013The G and L options are designed in particular for sample mode, and work much 1014like g and l but add call chain and branch stack to the other selected events 1015instead of synthesized events. For example, to record branch-misses events for 1016'ls' and then add a call chain derived from the Intel PT trace: 1017 1018 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls 1019 perf report --itrace=Ge 1020 1021Although in fact G is a default for perf report, so that is the same as just: 1022 1023 perf report 1024 1025One caveat with the G and L options is that they work poorly with "Large PEBS". 1026Large PEBS means PEBS records will be accumulated by hardware and the written 1027into the event buffer in one go. That reduces interrupts, but can give very 1028late timestamps. Because the Intel PT trace is synchronized by timestamps, 1029the PEBS events do not match the trace. Currently, Large PEBS is used only in 1030certain circumstances: 1031 - hardware supports it 1032 - PEBS is used 1033 - event period is specified, instead of frequency 1034 - the sample type is limited to the following flags: 1035 PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | 1036 PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | 1037 PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | 1038 PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | 1039 PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | 1040 PERF_SAMPLE_PERIOD (and sometimes) | PERF_SAMPLE_TIME 1041Because Intel PT sample mode uses a different sample type to the list above, 1042Large PEBS is not used with Intel PT sample mode. To avoid Large PEBS in other 1043cases, avoid specifying the event period i.e. avoid the 'perf record' -c option, 1044--count option, or 'period' config term. 1045 1046To disable trace decoding entirely, use the option --no-itrace. 1047 1048It is also possible to skip events generated (instructions, branches, transactions) 1049at the beginning. This is useful to ignore initialization code. 1050 1051 --itrace=i0nss1000000 1052 1053skips the first million instructions. 1054 1055The q option changes the way the trace is decoded. The decoding is much faster 1056but much less detailed. Specifically, with the q option, the decoder does not 1057decode TNT packets, and does not walk object code, but gets the ip from FUP and 1058TIP packets. The q option can be used with the b and i options but the period 1059is not used. The q option decodes more quickly, but is useful only if the 1060control flow of interest is represented or indicated by FUP, TIP, TIP.PGE, or 1061TIP.PGD packets (refer below). However the q option could be used to find time 1062ranges that could then be decoded fully using the --time option. 1063 1064What will *not* be decoded with the (single) q option: 1065 1066 - direct calls and jmps 1067 - conditional branches 1068 - non-branch instructions 1069 1070What *will* be decoded with the (single) q option: 1071 1072 - asynchronous branches such as interrupts 1073 - indirect branches 1074 - function return target address *if* the noretcomp config term (refer 1075 config terms section) was used 1076 - start of (control-flow) tracing 1077 - end of (control-flow) tracing, if it is not out of context 1078 - power events, ptwrite, transaction start and abort 1079 - instruction pointer associated with PSB packets 1080 1081Note the q option does not specify what events will be synthesized e.g. the p 1082option must be used also to show power events. 1083 1084Repeating the q option (double-q i.e. qq) results in even faster decoding and even 1085less detail. The decoder decodes only extended PSB (PSB+) packets, getting the 1086instruction pointer if there is a FUP packet within PSB+ (i.e. between PSB and 1087PSBEND). Note PSB packets occur regularly in the trace based on the psb_period 1088config term (refer config terms section). There will be a FUP packet if the 1089PSB+ occurs while control flow is being traced. 1090 1091What will *not* be decoded with the qq option: 1092 1093 - everything except instruction pointer associated with PSB packets 1094 1095What *will* be decoded with the qq option: 1096 1097 - instruction pointer associated with PSB packets 1098 1099The Z option is equivalent to having recorded a trace without TSC 1100(i.e. config term tsc=0). It can be useful to avoid timestamp issues when 1101decoding a trace of a virtual machine. 1102 1103 1104dump option 1105~~~~~~~~~~~ 1106 1107perf script has an option (-D) to "dump" the events i.e. display the binary 1108data. 1109 1110When -D is used, Intel PT packets are displayed. The packet decoder does not 1111pay attention to PSB packets, but just decodes the bytes - so the packets seen 1112by the actual decoder may not be identical in places where the data is corrupt. 1113One example of that would be when the buffer-switching interrupt has been too 1114slow, and the buffer has been filled completely. In that case, the last packet 1115in the buffer might be truncated and immediately followed by a PSB as the trace 1116continues in the next buffer. 1117 1118To disable the display of Intel PT packets, combine the -D option with 1119--no-itrace. 1120 1121 1122perf report 1123----------- 1124 1125By default, perf report will decode trace data found in the perf.data file. 1126This can be further controlled by new option --itrace exactly the same as 1127perf script, with the exception that the default is --itrace=igxe. 1128 1129 1130perf inject 1131----------- 1132 1133perf inject also accepts the --itrace option in which case tracing data is 1134removed and replaced with the synthesized events. e.g. 1135 1136 perf inject --itrace -i perf.data -o perf.data.new 1137 1138Below is an example of using Intel PT with autofdo. It requires autofdo 1139(https://github.com/google/autofdo) and gcc version 5. The bubble 1140sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) 1141amended to take the number of elements as a parameter. 1142 1143 $ gcc-5 -O3 sort.c -o sort_optimized 1144 $ ./sort_optimized 30000 1145 Bubble sorting array of 30000 elements 1146 2254 ms 1147 1148 $ cat ~/.perfconfig 1149 [intel-pt] 1150 mispred-all = on 1151 1152 $ perf record -e intel_pt//u ./sort 3000 1153 Bubble sorting array of 3000 elements 1154 58 ms 1155 [ perf record: Woken up 2 times to write data ] 1156 [ perf record: Captured and wrote 3.939 MB perf.data ] 1157 $ perf inject -i perf.data -o inj --itrace=i100usle --strip 1158 $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1 1159 $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo 1160 $ ./sort_autofdo 30000 1161 Bubble sorting array of 30000 elements 1162 2155 ms 1163 1164Note there is currently no advantage to using Intel PT instead of LBR, but 1165that may change in the future if greater use is made of the data. 1166 1167 1168PEBS via Intel PT 1169----------------- 1170 1171Some hardware has the feature to redirect PEBS records to the Intel PT trace. 1172Recording is selected by using the aux-output config term e.g. 1173 1174 perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname 1175 1176Note that currently, software only supports redirecting at most one PEBS event. 1177 1178To display PEBS events from the Intel PT trace, use the itrace 'o' option e.g. 1179 1180 perf script --itrace=oe 1181 1182XED 1183--- 1184 1185include::build-xed.txt[] 1186 1187 1188Tracing Virtual Machines 1189------------------------ 1190 1191Currently, only kernel tracing is supported and only with either "timeless" decoding 1192(i.e. no TSC timestamps) or VM Time Correlation. VM Time Correlation is an extra step 1193using 'perf inject' and requires unchanging VMX TSC Offset and no VMX TSC Scaling. 1194 1195Other limitations and caveats 1196 1197 VMX controls may suppress packets needed for decoding resulting in decoding errors 1198 VMX controls may block the perf NMI to the host potentially resulting in lost trace data 1199 Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors 1200 Guest thread information is unknown 1201 Guest VCPU is unknown but may be able to be inferred from the host thread 1202 Callchains are not supported 1203 1204Example using "timeless" decoding 1205 1206Start VM 1207 1208 $ sudo virsh start kubuntu20.04 1209 Domain kubuntu20.04 started 1210 1211Mount the guest file system. Note sshfs needs -o direct_io to enable reading of proc files. root access is needed to read /proc/kcore. 1212 1213 $ mkdir vm0 1214 $ sshfs -o direct_io root@vm0:/ vm0 1215 1216Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore 1217 1218 $ perf buildid-cache -v --kcore vm0/proc/kcore 1219 kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306 1220 $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms 1221 1222Find the VM process 1223 1224 $ ps -eLl | grep 'KVM\|PID' 1225 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 1226 3 S 64055 1430 1 1440 1 80 0 - 1921718 - ? 00:02:47 CPU 0/KVM 1227 3 S 64055 1430 1 1441 1 80 0 - 1921718 - ? 00:02:41 CPU 1/KVM 1228 3 S 64055 1430 1 1442 1 80 0 - 1921718 - ? 00:02:38 CPU 2/KVM 1229 3 S 64055 1430 1 1443 2 80 0 - 1921718 - ? 00:03:18 CPU 3/KVM 1230 1231Start an open-ended perf record, tracing the VM process, do something on the VM, and then ctrl-C to stop. 1232TSC is not supported and tsc=0 must be specified. That means mtc is useless, so add mtc=0. 1233However, IPC can still be determined, hence cyc=1 can be added. 1234Only kernel decoding is supported, so 'k' must be specified. 1235Intel PT traces both the host and the guest so --guest and --host need to be specified. 1236Without timestamps, --per-thread must be specified to distinguish threads. 1237 1238 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread 1239 ^C 1240 [ perf record: Woken up 1 times to write data ] 1241 [ perf record: Captured and wrote 5.829 MB ] 1242 1243perf script can be used to provide an instruction trace 1244 1245 $ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21 1246 CPU 0/KVM 1440 ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9 1247 CPU 0/KVM 1440 ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10 1248 CPU 0/KVM 1440 ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11 1249 CPU 0/KVM 1440 ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12 1250 CPU 0/KVM 1440 ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13 1251 CPU 0/KVM 1440 ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14 1252 CPU 0/KVM 1440 ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15 1253 CPU 0/KVM 1440 ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax 1254 CPU 0/KVM 1440 ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40 1255 CPU 0/KVM 1440 ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46 1256 CPU 0/KVM 1440 ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.11 (50/445) 1257 :1440 1440 ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms]) nopl %eax, (%rax,%rax,1) 1258 :1440 1440 ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms]) retq IPC: 0.04 (2/41) 1259 :1440 1440 ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms]) data16 nop 1260 :1440 1440 ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms]) xor %eax, %eax 1261 :1440 1440 ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms]) popq %rbp 1262 :1440 1440 ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms]) retq IPC: 0.16 (4/25) 1263 :1440 1440 ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms]) test %eax, %eax 1264 :1440 1440 ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms]) jz 0xffffffffbb74603c IPC: 0.06 (2/30) 1265 :1440 1440 ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms]) popq %rbx 1266 :1440 1440 ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms]) popq %r12 1267 1268Example using VM Time Correlation 1269 1270Start VM 1271 1272 $ sudo virsh start kubuntu20.04 1273 Domain kubuntu20.04 started 1274 1275Mount the guest file system. Note sshfs needs -o direct_io to enable reading of proc files. root access is needed to read /proc/kcore. 1276 1277 $ mkdir -p vm0 1278 $ sshfs -o direct_io root@vm0:/ vm0 1279 1280Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore 1281 1282 $ perf buildid-cache -v --kcore vm0/proc/kcore 1283 same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777 1284 $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms 1285 1286Find the VM process 1287 1288 $ ps -eLl | grep 'KVM\|PID' 1289 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 1290 3 S 64055 16998 1 17005 13 80 0 - 1818189 - ? 00:00:16 CPU 0/KVM 1291 3 S 64055 16998 1 17006 4 80 0 - 1818189 - ? 00:00:05 CPU 1/KVM 1292 3 S 64055 16998 1 17007 3 80 0 - 1818189 - ? 00:00:04 CPU 2/KVM 1293 3 S 64055 16998 1 17008 4 80 0 - 1818189 - ? 00:00:05 CPU 3/KVM 1294 1295Start an open-ended perf record, tracing the VM process, do something on the VM, and then ctrl-C to stop. 1296IPC can be determined, hence cyc=1 can be added. 1297Only kernel decoding is supported, so 'k' must be specified. 1298Intel PT traces both the host and the guest so --guest and --host need to be specified. 1299 1300 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998 1301 ^C[ perf record: Woken up 1 times to write data ] 1302 [ perf record: Captured and wrote 9.041 MB perf.data.kvm ] 1303 1304Now 'perf inject' can be used to determine the VMX TCS Offset. Note, Intel PT TSC packets are 1305only 7-bytes, so the TSC Offset might differ from the actual value in the 8th byte. That will 1306have no effect i.e. the resulting timestamps will be correct anyway. 1307 1308 $ perf inject -i perf.data.kvm --vm-time-correlation=dry-run 1309 ERROR: Unknown TSC Offset for VMCS 0x1bff6a 1310 VMCS: 0x1bff6a TSC Offset 0xffffe42722c64c41 1311 ERROR: Unknown TSC Offset for VMCS 0x1cbc08 1312 VMCS: 0x1cbc08 TSC Offset 0xffffe42722c64c41 1313 ERROR: Unknown TSC Offset for VMCS 0x1c3ce8 1314 VMCS: 0x1c3ce8 TSC Offset 0xffffe42722c64c41 1315 ERROR: Unknown TSC Offset for VMCS 0x1cbce9 1316 VMCS: 0x1cbce9 TSC Offset 0xffffe42722c64c41 1317 1318Each virtual CPU has a different Virtual Machine Control Structure (VMCS) 1319shown above with the calculated TSC Offset. For an unchanging TSC Offset 1320they should all be the same for the same virtual machine. 1321 1322Now that the TSC Offset is known, it can be provided to 'perf inject' 1323 1324 $ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41" 1325 1326Note the options for 'perf inject' --vm-time-correlation are: 1327 1328 [ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ] ]... 1329 1330So it is possible to specify different TSC Offsets for different VMCS. 1331The option "dry-run" will cause the file to be processed but without updating it. 1332Note it is also possible to get a intel_pt.log file by adding option --itrace=d 1333 1334There were no errors so, do it for real 1335 1336 $ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force 1337 1338'perf script' can be used to see if there are any decoder errors 1339 1340 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o 1341 1342There were none. 1343 1344'perf script' can be used to provide an instruction trace showing timestamps 1345 1346 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21 1347 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9 1348 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10 1349 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11 1350 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12 1351 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13 1352 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14 1353 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15 1354 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax 1355 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40 1356 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46 1357 CPU 1/KVM 17006 [001] 11500.262866075: ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.05 (40/769) 1358 :17006 17006 [001] 11500.262869216: ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms]) clac 1359 :17006 17006 [001] 11500.262869216: ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms]) pushq $0xffffffffffffffff 1360 :17006 17006 [001] 11500.262869216: ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms]) callq 0xffffffff82201160 1361 :17006 17006 [001] 11500.262869216: ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms]) cld 1362 :17006 17006 [001] 11500.262869216: ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms]) pushq %rsi 1363 :17006 17006 [001] 11500.262869216: ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms]) movq 0x8(%rsp), %rsi 1364 :17006 17006 [001] 11500.262869216: ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms]) movq %rdi, 0x8(%rsp) 1365 :17006 17006 [001] 11500.262869216: ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms]) pushq %rdx 1366 :17006 17006 [001] 11500.262869216: ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms]) pushq %rcx 1367 :17006 17006 [001] 11500.262869216: ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms]) pushq %rax 1368 1369 1370Event Trace 1371----------- 1372 1373Event Trace records information about asynchronous events, for example interrupts, 1374faults, VM exits and entries. The information is recorded in CFE and EVD packets, 1375and also the Interrupt Flag is recorded on the MODE.Exec packet. The CFE packet 1376contains a type field to identify one of the following: 1377 1378 1 INTR interrupt, fault, exception, NMI 1379 2 IRET interrupt return 1380 3 SMI system management interrupt 1381 4 RSM resume from system management mode 1382 5 SIPI startup interprocessor interrupt 1383 6 INIT INIT signal 1384 7 VMENTRY VM-Entry 1385 8 VMEXIT VM-Entry 1386 9 VMEXIT_INTR VM-Exit due to interrupt 1387 10 SHUTDOWN Shutdown 1388 1389For more details, refer to the Intel 64 and IA-32 Architectures Software 1390Developer Manuals (version 076 or later). 1391 1392The capability to do Event Trace is indicated by the 1393/sys/bus/event_source/devices/intel_pt/caps/event_trace file. 1394 1395Event trace is selected for recording using the "event" config term. e.g. 1396 1397 perf record -e intel_pt/event/u uname 1398 1399Event trace events are output using the --itrace I option. e.g. 1400 1401 perf script --itrace=Ie 1402 1403perf script displays events containing CFE type, vector and event data, 1404in the form: 1405 1406 evt: hw int (t) cfe: INTR IP: 1 vector: 3 PFA: 0x8877665544332211 1407 1408The IP flag indicates if the event binds to an IP, which includes any case where 1409flow control packet generation is enabled, as well as when CFE packet IP bit is 1410set. 1411 1412perf script displays events containing changes to the Interrupt Flag in the form: 1413 1414 iflag: t IFLAG: 1->0 via branch 1415 1416where "via branch" indicates a branch (interrupt or return from interrupt) and 1417"non branch" indicates an instruction such as CFI, STI or POPF). 1418 1419In addition, the current state of the interrupt flag is indicated by the presence 1420or absence of the "D" (interrupt disabled) perf script flag. If the interrupt 1421flag is changed, then the "t" flag is also included i.e. 1422 1423 no flag, interrupts enabled IF=1 1424 t interrupts become disabled IF=1 -> IF=0 1425 D interrupts are disabled IF=0 1426 Dt interrupts become enabled IF=0 -> IF=1 1427 1428The intel-pt-events.py script illustrates how to access Event Trace information 1429using a Python script. 1430 1431 1432TNT Disable 1433----------- 1434 1435TNT packets are disabled using the "notnt" config term. e.g. 1436 1437 perf record -e intel_pt/notnt/u uname 1438 1439In that case the --itrace q option is forced because walking executable code 1440to reconstruct the control flow is not possible. 1441 1442 1443Emulated PTWRITE 1444---------------- 1445 1446Later perf tools support a method to emulate the ptwrite instruction, which 1447can be useful if hardware does not support the ptwrite instruction. 1448 1449Instead of using the ptwrite instruction, a function is used which produces 1450a trace that encodes the payload data into TNT packets. Here is an example 1451of the function: 1452 1453 #include <stdint.h> 1454 1455 void perf_emulate_ptwrite(uint64_t x) 1456 __attribute__((externally_visible, noipa, no_instrument_function, naked)); 1457 1458 #define PERF_EMULATE_PTWRITE_8_BITS \ 1459 "1: shl %rax\n" \ 1460 " jc 1f\n" \ 1461 "1: shl %rax\n" \ 1462 " jc 1f\n" \ 1463 "1: shl %rax\n" \ 1464 " jc 1f\n" \ 1465 "1: shl %rax\n" \ 1466 " jc 1f\n" \ 1467 "1: shl %rax\n" \ 1468 " jc 1f\n" \ 1469 "1: shl %rax\n" \ 1470 " jc 1f\n" \ 1471 "1: shl %rax\n" \ 1472 " jc 1f\n" \ 1473 "1: shl %rax\n" \ 1474 " jc 1f\n" 1475 1476 /* Undefined instruction */ 1477 #define PERF_EMULATE_PTWRITE_UD2 ".byte 0x0f, 0x0b\n" 1478 1479 #define PERF_EMULATE_PTWRITE_MAGIC PERF_EMULATE_PTWRITE_UD2 ".ascii \"perf,ptwrite \"\n" 1480 1481 void perf_emulate_ptwrite(uint64_t x __attribute__ ((__unused__))) 1482 { 1483 /* Assumes SysV ABI : x passed in rdi */ 1484 __asm__ volatile ( 1485 "jmp 1f\n" 1486 PERF_EMULATE_PTWRITE_MAGIC 1487 "1: mov %rdi, %rax\n" 1488 PERF_EMULATE_PTWRITE_8_BITS 1489 PERF_EMULATE_PTWRITE_8_BITS 1490 PERF_EMULATE_PTWRITE_8_BITS 1491 PERF_EMULATE_PTWRITE_8_BITS 1492 PERF_EMULATE_PTWRITE_8_BITS 1493 PERF_EMULATE_PTWRITE_8_BITS 1494 PERF_EMULATE_PTWRITE_8_BITS 1495 PERF_EMULATE_PTWRITE_8_BITS 1496 "1: ret\n" 1497 ); 1498 } 1499 1500For example, a test program with the function above: 1501 1502 #include <stdio.h> 1503 #include <stdint.h> 1504 #include <stdlib.h> 1505 1506 #include "perf_emulate_ptwrite.h" 1507 1508 int main(int argc, char *argv[]) 1509 { 1510 uint64_t x = 0; 1511 1512 if (argc > 1) 1513 x = strtoull(argv[1], NULL, 0); 1514 perf_emulate_ptwrite(x); 1515 return 0; 1516 } 1517 1518Can be compiled and traced: 1519 1520 $ gcc -Wall -Wextra -O3 -g -o eg_ptw eg_ptw.c 1521 $ perf record -e intel_pt//u ./eg_ptw 0x1234567890abcdef 1522 [ perf record: Woken up 1 times to write data ] 1523 [ perf record: Captured and wrote 0.017 MB perf.data ] 1524 $ perf script --itrace=ew 1525 eg_ptw 19875 [007] 8061.235912: ptwrite: IP: 0 payload: 0x1234567890abcdef 55701249a196 perf_emulate_ptwrite+0x16 (/home/user/eg_ptw) 1526 $ 1527 1528 1529Pipe mode 1530--------- 1531Pipe mode is a problem for Intel PT and possibly other auxtrace users. 1532It's not recommended to use a pipe as data output with Intel PT because 1533of the following reason. 1534 1535Essentially the auxtrace buffers do not behave like the regular perf 1536event buffers. That is because the head and tail are updated by 1537software, but in the auxtrace case the data is written by hardware. 1538So the head and tail do not get updated as data is written. 1539 1540In the Intel PT case, the head and tail are updated only when the trace 1541is disabled by software, for example: 1542 - full-trace, system wide : when buffer passes watermark 1543 - full-trace, not system-wide : when buffer passes watermark or 1544 context switches 1545 - snapshot mode : as above but also when a snapshot is made 1546 - sample mode : as above but also when a sample is made 1547 1548That means finished-round ordering doesn't work. An auxtrace buffer 1549can turn up that has data that extends back in time, possibly to the 1550very beginning of tracing. 1551 1552For a perf.data file, that problem is solved by going through the trace 1553and queuing up the auxtrace buffers in advance. 1554 1555For pipe mode, the order of events and timestamps can presumably 1556be messed up. 1557 1558 1559EXAMPLE 1560------- 1561 1562Examples can be found on perf wiki page "Perf tools support for Intel® Processor Trace": 1563 1564https://perf.wiki.kernel.org/index.php/Perf_tools_support_for_Intel%C2%AE_Processor_Trace 1565 1566 1567SEE ALSO 1568-------- 1569 1570linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 1571linkperf:perf-inject[1] 1572