1libtracefs(3) 2============= 3 4NAME 5---- 6tracefs_sql - Create a synthetic event via an SQL statement 7 8SYNOPSIS 9-------- 10[verse] 11-- 12*#include <tracefs.h>* 13 14struct tracefs_synth pass:[*]*tracefs_sql*(struct tep_handle pass:[*]_tep_, const char pass:[*]_name_, 15 const char pass:[*]_sql_buffer_, char pass:[**]_err_); 16-- 17 18DESCRIPTION 19----------- 20Synthetic events are dynamically created events that attach two existing events 21together via one or more matching fields between the two events. It can be used 22to find the latency between the events, or to simply pass fields of the first event 23on to the second event to display as one event. 24 25The Linux kernel interface to create synthetic events is complex, and there needs 26to be a better way to create synthetic events that is easy and can be understood 27via existing technology. 28 29If you think of each event as a table, where the fields are the column of the table 30and each instance of the event as a row, you can understand how SQL can be used 31to attach two events together and form another event (table). Utilizing the 32SQL *SELECT* *FROM* *JOIN* *ON* [ *WHERE* ] syntax, a synthetic event can easily 33be created from two different events. 34 35For simple SQL queries to make a histogram instead of a synthetic event, see 36HISTOGRAMS below. 37 38*tracefs_sql*() takes in a _tep_ handler (See _tep_local_events_(3)) that is used to 39verify the events within the _sql_buffer_ expression. The _name_ is the name of the 40synthetic event to create. If _err_ points to an address of a string, it will be filled 41with a detailed message on any type of parsing error, including fields that do not belong 42to an event, or if the events or fields are not properly compared. 43 44The example program below is a fully functional parser where it will create a synthetic 45event from a SQL syntax passed in via the command line or a file. 46 47The SQL format is as follows: 48 49*SELECT* <fields> *FROM* <start-event> *JOIN* <end-event> *ON* <matching-fields> *WHERE* <filter> 50 51Note, although the examples show the SQL commands in uppercase, they are not required to 52be so. That is, you can use "SELECT" or "select" or "sElEct". 53 54For example: 55[source,c] 56-- 57SELECT syscalls.sys_enter_read.fd, syscalls.sys_exit_read.ret FROM syscalls.sys_enter_read 58 JOIN syscalls.sys_exit_read 59 ON syscalls.sys_enter_read.common_pid = syscalls.sys_exit_write.common_pid 60-- 61 62Will create a synthetic event that with the fields: 63 64 u64 fd; s64 ret; 65 66Because the function takes a _tep_ handle, and usually all event names are unique, you can 67leave off the system (group) name of the event, and *tracefs_sql*() will discover the 68system for you. 69 70That is, the above statement would work with: 71 72[source,c] 73-- 74SELECT sys_enter_read.fd, sys_exit_read.ret FROM sys_enter_read JOIN sys_exit_read 75 ON sys_enter_read.common_pid = sys_exit_write.common_pid 76-- 77 78The *AS* keyword can be used to name the fields as well as to give an alias to the 79events, such that the above can be simplified even more as: 80 81[source,c] 82-- 83SELECT start.fd, end.ret FROM sys_enter_read AS start JOIN sys_exit_read AS end ON start.common_pid = end.common_pid 84-- 85 86The above aliases _sys_enter_read_ as *start* and _sys_exit_read_ as *end* and uses 87those aliases to reference the event throughout the statement. 88 89Using the *AS* keyword in the selection portion of the SQL statement will define what 90those fields will be called in the synthetic event. 91 92[source,c] 93-- 94SELECT start.fd AS filed, end.ret AS return FROM sys_enter_read AS start JOIN sys_exit_read AS end 95 ON start.common_pid = end.common_pid 96-- 97 98The above labels the _fd_ of _start_ as *filed* and the _ret_ of _end_ as *return* where 99the synthetic event that is created will now have the fields: 100 101 u64 filed; s64 return; 102 103The fields can also be calculated with results passed to the synthetic event: 104 105[source,c] 106-- 107select start.truesize, end.len, (start.truesize - end.len) as diff from napi_gro_receive_entry as start 108 JOIN netif_receive_skb as end ON start.skbaddr = end.skbaddr 109-- 110 111Which would show the *truesize* of the _napi_gro_receive_entry_ event, the actual 112_len_ of the content, shown by the _netif_receive_skb_, and the delta between 113the two and expressed by the field *diff*. 114 115The code also supports recording the timestamps at either event, and performing calculations 116on them. For wakeup latency, you have: 117 118[source,c] 119-- 120select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 121 JOIN sched_switch as end ON start.pid = end.next_pid 122-- 123 124The above will create a synthetic event that records the _pid_ of the task being woken up, 125and the time difference between the _sched_waking_ event and the _sched_switch_ event. 126The *TIMESTAMP_USECS* will truncate the time down to microseconds as the timestamp usually 127recorded in the tracing buffer has nanosecond resolution. If you do not want that 128truncation, use *TIMESTAMP* instead of *TIMESTAMP_USECS*. 129 130Because it is so common to have: 131 132[source,c] 133-- 134 (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) 135-- 136 137The above can be represented with *TIMESTAMP_DELTA_USECS* or if nanoseconds are OK, you can 138use *TIMESTAMP_DELTA*. That is, the previous select can also be represented by: 139 140[source,c] 141-- 142select start.pid, TIMESTAMP_DELTA_USECS as lat from sched_waking as start JOIN sched_switch as end ON start.pid = end.next_pid 143-- 144 145 146Finally, the *WHERE* clause can be added, that will let you add filters on either or both events. 147 148[source,c] 149-- 150select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 151 JOIN sched_switch as end ON start.pid = end.next_pid 152 WHERE start.prio < 100 && (!(end.prev_pid < 1 || end.prev_prio > 100) || end.prev_pid == 0) 153-- 154 155*NOTE* 156 157Although both events can be used together in the *WHERE* clause, they must not be mixed outside 158the top most "&&" statements. You can not OR (||) the events together, where a filter of one 159event is OR'd to a filter of the other event. This does not make sense, as the synthetic event 160requires both events to take place to be recorded. If one is filtered out, then the synthetic 161event does not execute. 162 163[source,c] 164-- 165select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 166 JOIN sched_switch as end ON start.pid = end.next_pid 167 WHERE start.prio < 100 && end.prev_prio < 100 168-- 169 170The above is valid. 171 172Where as the below is not. 173 174[source,c] 175-- 176select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 177 JOIN sched_switch as end ON start.pid = end.next_pid 178 WHERE start.prio < 100 || end.prev_prio < 100 179-- 180 181If the kernel supports it, you can pass around a stacktrace between events. 182 183[source, c] 184-- 185select start.prev_pid as pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta, start.STACKTRACE as stack 186 FROM sched_switch as start JOIN sched_switch as end ON start.prev_pid = end.next_pid 187 WHERE start.prev_state == 2 188-- 189 190The above will record a stacktrace when a task is in the UNINTERRUPTIBLE (blocked) state, and trigger 191the synthetic event when it is scheduled back in, recording the time delta that it was blocked for. 192It will record the stacktrace of where it was when it scheduled out along with the delta. 193 194 195KEYWORDS AS EVENT FIELDS 196------------------------ 197 198In some cases, an event may have a keyword. For example, regcache_drop_region has "from" 199as a field and the following will not work 200 201[source,c] 202-- 203 select from from regcache_drop_region 204-- 205 206In such cases, add a backslash to the conflicting field, and this will tell the parser 207that the "from" is a field and not a keyword: 208 209[source,c] 210-- 211 select \from from regcache_drop_region 212-- 213 214HISTOGRAMS 215---------- 216 217Simple SQL statements without the *JOIN* *ON* may also be used, which will create a histogram 218instead. When doing this, the struct tracefs_hist descriptor can be retrieved from the 219returned synthetic event descriptor via the *tracefs_synth_get_start_hist*(3). 220 221In order to utilize the histogram types (see xxx) the CAST command of SQL can be used. 222 223That is: 224 225[source,c] 226-- 227 select CAST(common_pid AS comm), CAST(id AS syscall) FROM sys_enter 228-- 229 230Which produces: 231 232[source,c] 233-- 234 # echo 'hist:keys=common_pid.execname,id.syscall' > events/raw_syscalls/sys_enter/trigger 235 236 # cat events/raw_syscalls/sys_enter/hist 237 238{ common_pid: bash [ 18248], id: sys_setpgid [109] } hitcount: 1 239{ common_pid: sendmail [ 1812], id: sys_read [ 0] } hitcount: 1 240{ common_pid: bash [ 18247], id: sys_getpid [ 39] } hitcount: 1 241{ common_pid: bash [ 18247], id: sys_dup2 [ 33] } hitcount: 1 242{ common_pid: gmain [ 13684], id: sys_inotify_add_watch [254] } hitcount: 1 243{ common_pid: cat [ 18247], id: sys_access [ 21] } hitcount: 1 244{ common_pid: bash [ 18248], id: sys_getpid [ 39] } hitcount: 1 245{ common_pid: cat [ 18247], id: sys_fadvise64 [221] } hitcount: 1 246{ common_pid: sendmail [ 1812], id: sys_openat [257] } hitcount: 1 247{ common_pid: less [ 18248], id: sys_munmap [ 11] } hitcount: 1 248{ common_pid: sendmail [ 1812], id: sys_close [ 3] } hitcount: 1 249{ common_pid: gmain [ 1534], id: sys_poll [ 7] } hitcount: 1 250{ common_pid: bash [ 18247], id: sys_execve [ 59] } hitcount: 1 251-- 252 253Note, string fields may not be cast. 254 255The possible types to cast to are: 256 257*HEX* - convert the value to use hex and not decimal 258 259*SYM* - convert a pointer to symbolic (kallsyms values) 260 261*SYM-OFFSET* - convert a pointer to symbolic and include the offset. 262 263*SYSCALL* - convert the number to the mapped system call name 264 265*EXECNAME* or *COMM* - can only be used with the common_pid field. Will show the task 266name of the process. 267 268*LOG* or *LOG2* - bucket the key values in a log 2 values (1, 2, 3-4, 5-8, 9-16, 17-32, ...) 269 270The above fields are not case sensitive, and "LOG2" works as good as "log". 271 272A special CAST to _COUNTER_ or __COUNTER__ will make the field a value and not 273a key. For example: 274 275[source,c] 276-- 277 SELECT common_pid, CAST(bytes_req AS _COUNTER_) FROM kmalloc 278-- 279 280Which will create 281 282[source,c] 283-- 284 echo 'hist:keys=common_pid:vals=bytes_req' > events/kmem/kmalloc/trigger 285 286 cat events/kmem/kmalloc/hist 287 288{ common_pid: 1812 } hitcount: 1 bytes_req: 32 289{ common_pid: 9111 } hitcount: 2 bytes_req: 272 290{ common_pid: 1768 } hitcount: 3 bytes_req: 1112 291{ common_pid: 0 } hitcount: 4 bytes_req: 512 292{ common_pid: 18297 } hitcount: 11 bytes_req: 2004 293-- 294 295RETURN VALUE 296------------ 297Returns 0 on success and -1 on failure. On failure, if _err_ is defined, it will be 298allocated to hold a detailed description of what went wrong if it the error was caused 299by a parsing error, or that an event, field does not exist or is not compatible with 300what it was combined with. 301 302CREATE A TOOL 303------------- 304 305The below example is a functional program that can be used to parse SQL commands into 306synthetic events. 307 308[source, c] 309-- 310 man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c 311 gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs` 312-- 313 314Then you can run the above examples: 315 316[source, c] 317-- 318 sudo ./sqlhist 'select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 319 JOIN sched_switch as end ON start.pid = end.next_pid 320 WHERE start.prio < 100 || end.prev_prio < 100' 321-- 322EXAMPLE 323------- 324[source,c] 325-- 326#include <stdio.h> 327#include <stdlib.h> 328#include <stdarg.h> 329#include <string.h> 330#include <errno.h> 331#include <unistd.h> 332#include <tracefs.h> 333 334static void usage(char **argv) 335{ 336 fprintf(stderr, "usage: %s [-ed][-n name][-s][-S fields][-m var][-c var][-T][-t dir][-f file | sql-command-line]\n" 337 " -n name - name of synthetic event 'Anonymous' if left off\n" 338 " -t dir - use dir instead of /sys/kernel/tracing\n" 339 " -e - execute the commands to create the synthetic event\n" 340 " -m - trigger the action when var is a new max.\n" 341 " -c - trigger the action when var changes.\n" 342 " -s - used with -m or -c to do a snapshot of the tracing buffer\n" 343 " -S - used with -m or -c to save fields of the end event (comma deliminated)\n" 344 " -T - used with -m or -c to do both a snapshot and a trace\n" 345 " -f file - read sql lines from file otherwise from the command line\n" 346 " if file is '-' then read from standard input.\n", 347 argv[0]); 348 exit(-1); 349} 350 351enum action { 352 ACTION_DEFAULT = 0, 353 ACTION_SNAPSHOT = (1 << 0), 354 ACTION_TRACE = (1 << 1), 355 ACTION_SAVE = (1 << 2), 356 ACTION_MAX = (1 << 3), 357 ACTION_CHANGE = (1 << 4), 358}; 359 360#define ACTIONS ((ACTION_MAX - 1)) 361 362static int do_sql(const char *instance_name, 363 const char *buffer, const char *name, const char *var, 364 const char *trace_dir, bool execute, int action, 365 char **save_fields) 366{ 367 struct tracefs_synth *synth; 368 struct tep_handle *tep; 369 struct trace_seq seq; 370 enum tracefs_synth_handler handler; 371 char *err; 372 int ret; 373 374 if ((action & ACTIONS) && !var) { 375 fprintf(stderr, "Error: -s, -S and -T not supported without -m or -c"); 376 exit(-1); 377 } 378 379 if (!name) 380 name = "Anonymous"; 381 382 trace_seq_init(&seq); 383 tep = tracefs_local_events(trace_dir); 384 if (!tep) { 385 if (!trace_dir) 386 trace_dir = "tracefs directory"; 387 perror(trace_dir); 388 exit(-1); 389 } 390 391 synth = tracefs_sql(tep, name, buffer, &err); 392 if (!synth) { 393 perror("Failed creating synthetic event!"); 394 if (err) 395 fprintf(stderr, "%s", err); 396 free(err); 397 exit(-1); 398 } 399 400 if (tracefs_synth_complete(synth)) { 401 if (var) { 402 if (action & ACTION_MAX) 403 handler = TRACEFS_SYNTH_HANDLE_MAX; 404 else 405 handler = TRACEFS_SYNTH_HANDLE_CHANGE; 406 407 if (action & ACTION_SAVE) { 408 ret = tracefs_synth_save(synth, handler, var, save_fields); 409 if (ret < 0) { 410 err = "adding save"; 411 goto failed_action; 412 } 413 } 414 if (action & ACTION_TRACE) { 415 /* 416 * By doing the trace before snapshot, it will be included 417 * in the snapshot. 418 */ 419 ret = tracefs_synth_trace(synth, handler, var); 420 if (ret < 0) { 421 err = "adding trace"; 422 goto failed_action; 423 } 424 } 425 if (action & ACTION_SNAPSHOT) { 426 ret = tracefs_synth_snapshot(synth, handler, var); 427 if (ret < 0) { 428 err = "adding snapshot"; 429 failed_action: 430 perror(err); 431 if (errno == ENODEV) 432 fprintf(stderr, "ERROR: '%s' is not a variable\n", 433 var); 434 exit(-1); 435 } 436 } 437 } 438 tracefs_synth_echo_cmd(&seq, synth); 439 if (execute) { 440 ret = tracefs_synth_create(synth); 441 if (ret < 0) { 442 fprintf(stderr, "%s\n", tracefs_error_last(NULL)); 443 exit(-1); 444 } 445 } 446 } else { 447 struct tracefs_instance *instance = NULL; 448 struct tracefs_hist *hist; 449 450 hist = tracefs_synth_get_start_hist(synth); 451 if (!hist) { 452 perror("get_start_hist"); 453 exit(-1); 454 } 455 if (instance_name) { 456 if (execute) 457 instance = tracefs_instance_create(instance_name); 458 else 459 instance = tracefs_instance_alloc(trace_dir, 460 instance_name); 461 if (!instance) { 462 perror("Failed to create instance"); 463 exit(-1); 464 } 465 } 466 tracefs_hist_echo_cmd(&seq, instance, hist, 0); 467 if (execute) { 468 ret = tracefs_hist_start(instance, hist); 469 if (ret < 0) { 470 fprintf(stderr, "%s\n", tracefs_error_last(instance)); 471 exit(-1); 472 } 473 } 474 } 475 476 tracefs_synth_free(synth); 477 478 trace_seq_do_printf(&seq); 479 trace_seq_destroy(&seq); 480 return 0; 481} 482 483int main (int argc, char **argv) 484{ 485 char *trace_dir = NULL; 486 char *buffer = NULL; 487 char buf[BUFSIZ]; 488 int buffer_size = 0; 489 const char *file = NULL; 490 const char *instance = NULL; 491 bool execute = false; 492 char **save_fields = NULL; 493 const char *name; 494 const char *var; 495 int action = 0; 496 char *tok; 497 FILE *fp; 498 size_t r; 499 int c; 500 int i; 501 502 for (;;) { 503 c = getopt(argc, argv, "ht:f:en:m:c:sS:TB:"); 504 if (c == -1) 505 break; 506 507 switch(c) { 508 case 'h': 509 usage(argv); 510 case 't': 511 trace_dir = optarg; 512 break; 513 case 'f': 514 file = optarg; 515 break; 516 case 'e': 517 execute = true; 518 break; 519 case 'm': 520 action |= ACTION_MAX; 521 var = optarg; 522 break; 523 case 'c': 524 action |= ACTION_CHANGE; 525 var = optarg; 526 break; 527 case 's': 528 action |= ACTION_SNAPSHOT; 529 break; 530 case 'S': 531 action |= ACTION_SAVE; 532 tok = strtok(optarg, ","); 533 while (tok) { 534 save_fields = tracefs_list_add(save_fields, tok); 535 tok = strtok(NULL, ","); 536 } 537 if (!save_fields) { 538 perror(optarg); 539 exit(-1); 540 } 541 break; 542 case 'T': 543 action |= ACTION_TRACE | ACTION_SNAPSHOT; 544 break; 545 case 'B': 546 instance = optarg; 547 break; 548 case 'n': 549 name = optarg; 550 break; 551 } 552 } 553 554 if ((action & (ACTION_MAX|ACTION_CHANGE)) == (ACTION_MAX|ACTION_CHANGE)) { 555 fprintf(stderr, "Can not use both -m and -c together\n"); 556 exit(-1); 557 } 558 if (file) { 559 if (!strcmp(file, "-")) 560 fp = stdin; 561 else 562 fp = fopen(file, "r"); 563 if (!fp) { 564 perror(file); 565 exit(-1); 566 } 567 while ((r = fread(buf, 1, BUFSIZ, fp)) > 0) { 568 buffer = realloc(buffer, buffer_size + r + 1); 569 strncpy(buffer + buffer_size, buf, r); 570 buffer_size += r; 571 } 572 fclose(fp); 573 if (buffer_size) 574 buffer[buffer_size] = '\0'; 575 } else if (argc == optind) { 576 usage(argv); 577 } else { 578 for (i = optind; i < argc; i++) { 579 r = strlen(argv[i]); 580 buffer = realloc(buffer, buffer_size + r + 2); 581 if (i != optind) 582 buffer[buffer_size++] = ' '; 583 strcpy(buffer + buffer_size, argv[i]); 584 buffer_size += r; 585 } 586 } 587 588 do_sql(instance, buffer, name, var, trace_dir, execute, action, save_fields); 589 free(buffer); 590 591 return 0; 592} 593-- 594 595FILES 596----- 597[verse] 598-- 599*tracefs.h* 600 Header file to include in order to have access to the library APIs. 601*-ltracefs* 602 Linker switch to add when building a program that uses the library. 603-- 604 605SEE ALSO 606-------- 607*sqlhist*(1), 608*libtracefs*(3), 609*libtraceevent*(3), 610*trace-cmd*(1), 611*tracefs_synth_init*(3), 612*tracefs_synth_add_match_field*(3), 613*tracefs_synth_add_compare_field*(3), 614*tracefs_synth_add_start_field*(3), 615*tracefs_synth_add_end_field*(3), 616*tracefs_synth_append_start_filter*(3), 617*tracefs_synth_append_end_filter*(3), 618*tracefs_synth_create*(3), 619*tracefs_synth_destroy*(3), 620*tracefs_synth_free*(3), 621*tracefs_synth_echo_cmd*(3), 622*tracefs_hist_alloc*(3), 623*tracefs_hist_alloc_2d*(3), 624*tracefs_hist_alloc_nd*(3), 625*tracefs_hist_free*(3), 626*tracefs_hist_add_key*(3), 627*tracefs_hist_add_value*(3), 628*tracefs_hist_add_name*(3), 629*tracefs_hist_start*(3), 630*tracefs_hist_destory*(3), 631*tracefs_hist_add_sort_key*(3), 632*tracefs_hist_sort_key_direction*(3) 633 634AUTHOR 635------ 636[verse] 637-- 638*Steven Rostedt* <rostedt@goodmis.org> 639*Tzvetomir Stoyanov* <tz.stoyanov@gmail.com> 640*sameeruddin shaik* <sameeruddin.shaik8@gmail.com> 641-- 642REPORTING BUGS 643-------------- 644Report bugs to <linux-trace-devel@vger.kernel.org> 645 646LICENSE 647------- 648libtracefs is Free Software licensed under the GNU LGPL 2.1 649 650RESOURCES 651--------- 652https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ 653 654COPYING 655------- 656Copyright \(C) 2020 VMware, Inc. Free use of this software is granted under 657the terms of the GNU Public License (GPL). 658