• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1libtracefs(3)
2=============
3
4NAME
5----
6tracefs_sql - Create a synthetic event via an SQL statement
7
8SYNOPSIS
9--------
10[verse]
11--
12*#include <tracefs.h>*
13
14struct tracefs_synth pass:[*]*tracefs_sql*(struct tep_handle pass:[*]_tep_, const char pass:[*]_name_,
15					const char pass:[*]_sql_buffer_, char pass:[**]_err_);
16--
17
18DESCRIPTION
19-----------
20Synthetic events are dynamically created events that attach two existing events
21together via one or more matching fields between the two events. It can be used
22to find the latency between the events, or to simply pass fields of the first event
23on to the second event to display as one event.
24
25The Linux kernel interface to create synthetic events is complex, and there needs
26to be a better way to create synthetic events that is easy and can be understood
27via existing technology.
28
29If you think of each event as a table, where the fields are the column of the table
30and each instance of the event as a row, you can understand how SQL can be used
31to attach two events together and form another event (table). Utilizing the
32SQL *SELECT* *FROM* *JOIN* *ON* [ *WHERE* ] syntax, a synthetic event can easily
33be created from two different events.
34
35For simple SQL queries to make a histogram instead of a synthetic event, see
36HISTOGRAMS below.
37
38*tracefs_sql*() takes in a _tep_ handler (See _tep_local_events_(3)) that is used to
39verify the events within the _sql_buffer_ expression. The _name_ is the name of the
40synthetic event to create. If _err_ points to an address of a string, it will be filled
41with a detailed message on any type of parsing error, including fields that do not belong
42to an event, or if the events or fields are not properly compared.
43
44The example program below is a fully functional parser where it will create a synthetic
45event from a SQL syntax passed in via the command line or a file.
46
47The SQL format is as follows:
48
49*SELECT* <fields> *FROM* <start-event> *JOIN* <end-event> *ON* <matching-fields> *WHERE* <filter>
50
51Note, although the examples show the SQL commands in uppercase, they are not required to
52be so. That is, you can use "SELECT" or "select" or "sElEct".
53
54For example:
55[source,c]
56--
57SELECT syscalls.sys_enter_read.fd, syscalls.sys_exit_read.ret FROM syscalls.sys_enter_read
58   JOIN syscalls.sys_exit_read
59   ON syscalls.sys_enter_read.common_pid = syscalls.sys_exit_write.common_pid
60--
61
62Will create a synthetic event that with the fields:
63
64  u64 fd; s64 ret;
65
66Because the function takes a _tep_ handle, and usually all event names are unique, you can
67leave off the system (group) name of the event, and *tracefs_sql*() will discover the
68system for you.
69
70That is, the above statement would work with:
71
72[source,c]
73--
74SELECT sys_enter_read.fd, sys_exit_read.ret FROM sys_enter_read JOIN sys_exit_read
75   ON sys_enter_read.common_pid = sys_exit_write.common_pid
76--
77
78The *AS* keyword can be used to name the fields as well as to give an alias to the
79events, such that the above can be simplified even more as:
80
81[source,c]
82--
83SELECT start.fd, end.ret FROM sys_enter_read AS start JOIN sys_exit_read AS end ON start.common_pid = end.common_pid
84--
85
86The above aliases _sys_enter_read_ as *start* and _sys_exit_read_ as *end* and uses
87those aliases to reference the event throughout the statement.
88
89Using the *AS* keyword in the selection portion of the SQL statement will define what
90those fields will be called in the synthetic event.
91
92[source,c]
93--
94SELECT start.fd AS filed, end.ret AS return FROM sys_enter_read AS start JOIN sys_exit_read AS end
95   ON start.common_pid = end.common_pid
96--
97
98The above labels the _fd_ of _start_ as *filed* and the _ret_ of _end_ as *return* where
99the synthetic event that is created will now have the fields:
100
101  u64 filed; s64 return;
102
103The fields can also be calculated with results passed to the synthetic event:
104
105[source,c]
106--
107select start.truesize, end.len, (start.truesize - end.len) as diff from napi_gro_receive_entry as start
108   JOIN netif_receive_skb as end ON start.skbaddr = end.skbaddr
109--
110
111Which would show the *truesize* of the _napi_gro_receive_entry_ event, the actual
112_len_ of the content, shown by the _netif_receive_skb_, and the delta between
113the two and expressed by the field *diff*.
114
115The code also supports recording the timestamps at either event, and performing calculations
116on them. For wakeup latency, you have:
117
118[source,c]
119--
120select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
121   JOIN sched_switch as end ON start.pid = end.next_pid
122--
123
124The above will create a synthetic event that records the _pid_ of the task being woken up,
125and the time difference between the _sched_waking_ event and the _sched_switch_ event.
126The *TIMESTAMP_USECS* will truncate the time down to microseconds as the timestamp usually
127recorded in the tracing buffer has nanosecond resolution. If you do not want that
128truncation, use *TIMESTAMP* instead of *TIMESTAMP_USECS*.
129
130Because it is so common to have:
131
132[source,c]
133--
134   (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS)
135--
136
137The above can be represented with *TIMESTAMP_DELTA_USECS* or if nanoseconds are OK, you can
138use *TIMESTAMP_DELTA*. That is, the previous select can also be represented by:
139
140[source,c]
141--
142select start.pid, TIMESTAMP_DELTA_USECS as lat from sched_waking as start JOIN sched_switch as end ON start.pid = end.next_pid
143--
144
145
146Finally, the *WHERE* clause can be added, that will let you add filters on either or both events.
147
148[source,c]
149--
150select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
151   JOIN sched_switch as end ON start.pid = end.next_pid
152   WHERE start.prio < 100 && (!(end.prev_pid < 1 || end.prev_prio > 100) || end.prev_pid == 0)
153--
154
155*NOTE*
156
157Although both events can be used together in the *WHERE* clause, they must not be mixed outside
158the top most "&&" statements. You can not OR (||) the events together, where a filter of one
159event is OR'd to a filter of the other event. This does not make sense, as the synthetic event
160requires both events to take place to be recorded. If one is filtered out, then the synthetic
161event does not execute.
162
163[source,c]
164--
165select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
166   JOIN sched_switch as end ON start.pid = end.next_pid
167   WHERE start.prio < 100 && end.prev_prio < 100
168--
169
170The above is valid.
171
172Where as the below is not.
173
174[source,c]
175--
176select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
177   JOIN sched_switch as end ON start.pid = end.next_pid
178   WHERE start.prio < 100 || end.prev_prio < 100
179--
180
181If the kernel supports it, you can pass around a stacktrace between events.
182
183[source, c]
184--
185select start.prev_pid as pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta, start.STACKTRACE as stack
186   FROM sched_switch as start JOIN sched_switch as end ON start.prev_pid = end.next_pid
187   WHERE start.prev_state == 2
188--
189
190The above will record a stacktrace when a task is in the UNINTERRUPTIBLE (blocked) state, and trigger
191the synthetic event when it is scheduled back in, recording the time delta that it was blocked for.
192It will record the stacktrace of where it was when it scheduled out along with the delta.
193
194
195KEYWORDS AS EVENT FIELDS
196------------------------
197
198In some cases, an event may have a keyword. For example, regcache_drop_region has "from"
199as a field and the following will not work
200
201[source,c]
202--
203  select from from regcache_drop_region
204--
205
206In such cases, add a backslash to the conflicting field, and this will tell the parser
207that the "from" is a field and not a keyword:
208
209[source,c]
210--
211  select \from from regcache_drop_region
212--
213
214HISTOGRAMS
215----------
216
217Simple SQL statements without the *JOIN* *ON* may also be used, which will create a histogram
218instead. When doing this, the struct tracefs_hist descriptor can be retrieved from the
219returned synthetic event descriptor via the *tracefs_synth_get_start_hist*(3).
220
221In order to utilize the histogram types (see xxx) the CAST command of SQL can be used.
222
223That is:
224
225[source,c]
226--
227  select CAST(common_pid AS comm), CAST(id AS syscall) FROM sys_enter
228--
229
230Which produces:
231
232[source,c]
233--
234 # echo 'hist:keys=common_pid.execname,id.syscall' > events/raw_syscalls/sys_enter/trigger
235
236 # cat events/raw_syscalls/sys_enter/hist
237
238{ common_pid: bash            [     18248], id: sys_setpgid                   [109] } hitcount:          1
239{ common_pid: sendmail        [      1812], id: sys_read                      [  0] } hitcount:          1
240{ common_pid: bash            [     18247], id: sys_getpid                    [ 39] } hitcount:          1
241{ common_pid: bash            [     18247], id: sys_dup2                      [ 33] } hitcount:          1
242{ common_pid: gmain           [     13684], id: sys_inotify_add_watch         [254] } hitcount:          1
243{ common_pid: cat             [     18247], id: sys_access                    [ 21] } hitcount:          1
244{ common_pid: bash            [     18248], id: sys_getpid                    [ 39] } hitcount:          1
245{ common_pid: cat             [     18247], id: sys_fadvise64                 [221] } hitcount:          1
246{ common_pid: sendmail        [      1812], id: sys_openat                    [257] } hitcount:          1
247{ common_pid: less            [     18248], id: sys_munmap                    [ 11] } hitcount:          1
248{ common_pid: sendmail        [      1812], id: sys_close                     [  3] } hitcount:          1
249{ common_pid: gmain           [      1534], id: sys_poll                      [  7] } hitcount:          1
250{ common_pid: bash            [     18247], id: sys_execve                    [ 59] } hitcount:          1
251--
252
253Note, string fields may not be cast.
254
255The possible types to cast to are:
256
257*HEX* - convert the value to use hex and not decimal
258
259*SYM* - convert a pointer to symbolic (kallsyms values)
260
261*SYM-OFFSET* - convert a pointer to symbolic and include the offset.
262
263*SYSCALL* - convert the number to the mapped system call name
264
265*EXECNAME* or *COMM* - can only be used with the common_pid field. Will show the task
266name of the process.
267
268*LOG* or *LOG2* - bucket the key values in a log 2 values (1, 2, 3-4, 5-8, 9-16, 17-32, ...)
269
270The above fields are not case sensitive, and "LOG2" works as good as "log".
271
272A special CAST to _COUNTER_ or __COUNTER__ will make the field a value and not
273a key. For example:
274
275[source,c]
276--
277  SELECT common_pid, CAST(bytes_req AS _COUNTER_) FROM kmalloc
278--
279
280Which will create
281
282[source,c]
283--
284  echo 'hist:keys=common_pid:vals=bytes_req' > events/kmem/kmalloc/trigger
285
286  cat events/kmem/kmalloc/hist
287
288{ common_pid:       1812 } hitcount:          1  bytes_req:         32
289{ common_pid:       9111 } hitcount:          2  bytes_req:        272
290{ common_pid:       1768 } hitcount:          3  bytes_req:       1112
291{ common_pid:          0 } hitcount:          4  bytes_req:        512
292{ common_pid:      18297 } hitcount:         11  bytes_req:       2004
293--
294
295RETURN VALUE
296------------
297Returns 0 on success and -1 on failure. On failure, if _err_ is defined, it will be
298allocated to hold a detailed description of what went wrong if it the error was caused
299by a parsing error, or that an event, field does not exist or is not compatible with
300what it was combined with.
301
302CREATE A TOOL
303-------------
304
305The below example is a functional program that can be used to parse SQL commands into
306synthetic events.
307
308[source, c]
309--
310   man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c
311   gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs`
312--
313
314Then you can run the above examples:
315
316[source, c]
317--
318  sudo ./sqlhist 'select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
319                  JOIN sched_switch as end ON start.pid = end.next_pid
320                  WHERE start.prio < 100 || end.prev_prio < 100'
321--
322EXAMPLE
323-------
324[source,c]
325--
326#include <stdio.h>
327#include <stdlib.h>
328#include <stdarg.h>
329#include <string.h>
330#include <errno.h>
331#include <unistd.h>
332#include <tracefs.h>
333
334static void usage(char **argv)
335{
336	fprintf(stderr, "usage: %s [-ed][-n name][-s][-S fields][-m var][-c var][-T][-t dir][-f file | sql-command-line]\n"
337		"  -n name - name of synthetic event 'Anonymous' if left off\n"
338		"  -t dir - use dir instead of /sys/kernel/tracing\n"
339		"  -e - execute the commands to create the synthetic event\n"
340		"  -m - trigger the action when var is a new max.\n"
341		"  -c - trigger the action when var changes.\n"
342		"  -s - used with -m or -c to do a snapshot of the tracing buffer\n"
343		"  -S - used with -m or -c to save fields of the end event (comma deliminated)\n"
344		"  -T - used with -m or -c to do both a snapshot and a trace\n"
345		"  -f file - read sql lines from file otherwise from the command line\n"
346		"            if file is '-' then read from standard input.\n",
347		argv[0]);
348	exit(-1);
349}
350
351enum action {
352	ACTION_DEFAULT		= 0,
353	ACTION_SNAPSHOT		= (1 << 0),
354	ACTION_TRACE		= (1 << 1),
355	ACTION_SAVE		= (1 << 2),
356	ACTION_MAX		= (1 << 3),
357	ACTION_CHANGE		= (1 << 4),
358};
359
360#define ACTIONS ((ACTION_MAX - 1))
361
362static int do_sql(const char *instance_name,
363		  const char *buffer, const char *name, const char *var,
364		  const char *trace_dir, bool execute, int action,
365		  char **save_fields)
366{
367	struct tracefs_synth *synth;
368	struct tep_handle *tep;
369	struct trace_seq seq;
370	enum tracefs_synth_handler handler;
371	char *err;
372	int ret;
373
374	if ((action & ACTIONS) && !var) {
375		fprintf(stderr, "Error: -s, -S and -T not supported without -m or -c");
376		exit(-1);
377	}
378
379	if (!name)
380		name = "Anonymous";
381
382	trace_seq_init(&seq);
383	tep = tracefs_local_events(trace_dir);
384	if (!tep) {
385		if (!trace_dir)
386			trace_dir = "tracefs directory";
387		perror(trace_dir);
388		exit(-1);
389	}
390
391	synth = tracefs_sql(tep, name, buffer, &err);
392	if (!synth) {
393		perror("Failed creating synthetic event!");
394		if (err)
395			fprintf(stderr, "%s", err);
396		free(err);
397		exit(-1);
398	}
399
400	if (tracefs_synth_complete(synth)) {
401		if (var) {
402			if (action & ACTION_MAX)
403				handler = TRACEFS_SYNTH_HANDLE_MAX;
404			else
405				handler = TRACEFS_SYNTH_HANDLE_CHANGE;
406
407			if (action & ACTION_SAVE) {
408				ret = tracefs_synth_save(synth, handler, var, save_fields);
409				if (ret < 0) {
410					err = "adding save";
411					goto failed_action;
412				}
413			}
414			if (action & ACTION_TRACE) {
415				/*
416				 * By doing the trace before snapshot, it will be included
417				 * in the snapshot.
418				 */
419				ret = tracefs_synth_trace(synth, handler, var);
420				if (ret < 0) {
421					err = "adding trace";
422					goto failed_action;
423				}
424			}
425			if (action & ACTION_SNAPSHOT) {
426				ret = tracefs_synth_snapshot(synth, handler, var);
427				if (ret < 0) {
428					err = "adding snapshot";
429 failed_action:
430					perror(err);
431					if (errno == ENODEV)
432						fprintf(stderr, "ERROR: '%s' is not a variable\n",
433							var);
434					exit(-1);
435				}
436			}
437		}
438		tracefs_synth_echo_cmd(&seq, synth);
439		if (execute) {
440			ret = tracefs_synth_create(synth);
441			if (ret < 0) {
442				fprintf(stderr, "%s\n", tracefs_error_last(NULL));
443				exit(-1);
444			}
445		}
446	} else {
447		struct tracefs_instance *instance = NULL;
448		struct tracefs_hist *hist;
449
450		hist = tracefs_synth_get_start_hist(synth);
451		if (!hist) {
452			perror("get_start_hist");
453			exit(-1);
454		}
455		if (instance_name) {
456			if (execute)
457				instance = tracefs_instance_create(instance_name);
458			else
459				instance = tracefs_instance_alloc(trace_dir,
460								  instance_name);
461			if (!instance) {
462				perror("Failed to create instance");
463				exit(-1);
464			}
465		}
466		tracefs_hist_echo_cmd(&seq, instance, hist, 0);
467		if (execute) {
468			ret = tracefs_hist_start(instance, hist);
469			if (ret < 0) {
470				fprintf(stderr, "%s\n", tracefs_error_last(instance));
471				exit(-1);
472			}
473		}
474	}
475
476	tracefs_synth_free(synth);
477
478	trace_seq_do_printf(&seq);
479	trace_seq_destroy(&seq);
480	return 0;
481}
482
483int main (int argc, char **argv)
484{
485	char *trace_dir = NULL;
486	char *buffer = NULL;
487	char buf[BUFSIZ];
488	int buffer_size = 0;
489	const char *file = NULL;
490	const char *instance = NULL;
491	bool execute = false;
492	char **save_fields = NULL;
493	const char *name;
494	const char *var;
495	int action = 0;
496	char *tok;
497	FILE *fp;
498	size_t r;
499	int c;
500	int i;
501
502	for (;;) {
503		c = getopt(argc, argv, "ht:f:en:m:c:sS:TB:");
504		if (c == -1)
505			break;
506
507		switch(c) {
508		case 'h':
509			usage(argv);
510		case 't':
511			trace_dir = optarg;
512			break;
513		case 'f':
514			file = optarg;
515			break;
516		case 'e':
517			execute = true;
518			break;
519		case 'm':
520			action |= ACTION_MAX;
521			var = optarg;
522			break;
523		case 'c':
524			action |= ACTION_CHANGE;
525			var = optarg;
526			break;
527		case 's':
528			action |= ACTION_SNAPSHOT;
529			break;
530		case 'S':
531			action |= ACTION_SAVE;
532			tok = strtok(optarg, ",");
533			while (tok) {
534				save_fields = tracefs_list_add(save_fields, tok);
535				tok = strtok(NULL, ",");
536			}
537			if (!save_fields) {
538				perror(optarg);
539				exit(-1);
540			}
541			break;
542		case 'T':
543			action |= ACTION_TRACE | ACTION_SNAPSHOT;
544			break;
545		case 'B':
546			instance = optarg;
547			break;
548		case 'n':
549			name = optarg;
550			break;
551		}
552	}
553
554	if ((action & (ACTION_MAX|ACTION_CHANGE)) == (ACTION_MAX|ACTION_CHANGE)) {
555		fprintf(stderr, "Can not use both -m and -c together\n");
556		exit(-1);
557	}
558	if (file) {
559		if (!strcmp(file, "-"))
560			fp = stdin;
561		else
562			fp = fopen(file, "r");
563		if (!fp) {
564			perror(file);
565			exit(-1);
566		}
567		while ((r = fread(buf, 1, BUFSIZ, fp)) > 0) {
568			buffer = realloc(buffer, buffer_size + r + 1);
569			strncpy(buffer + buffer_size, buf, r);
570			buffer_size += r;
571		}
572		fclose(fp);
573		if (buffer_size)
574			buffer[buffer_size] = '\0';
575	} else if (argc == optind) {
576		usage(argv);
577	} else {
578		for (i = optind; i < argc; i++) {
579			r = strlen(argv[i]);
580			buffer = realloc(buffer, buffer_size + r + 2);
581			if (i != optind)
582				buffer[buffer_size++] = ' ';
583			strcpy(buffer + buffer_size, argv[i]);
584			buffer_size += r;
585		}
586	}
587
588	do_sql(instance, buffer, name, var, trace_dir, execute, action, save_fields);
589	free(buffer);
590
591	return 0;
592}
593--
594
595FILES
596-----
597[verse]
598--
599*tracefs.h*
600	Header file to include in order to have access to the library APIs.
601*-ltracefs*
602	Linker switch to add when building a program that uses the library.
603--
604
605SEE ALSO
606--------
607*sqlhist*(1),
608*libtracefs*(3),
609*libtraceevent*(3),
610*trace-cmd*(1),
611*tracefs_synth_init*(3),
612*tracefs_synth_add_match_field*(3),
613*tracefs_synth_add_compare_field*(3),
614*tracefs_synth_add_start_field*(3),
615*tracefs_synth_add_end_field*(3),
616*tracefs_synth_append_start_filter*(3),
617*tracefs_synth_append_end_filter*(3),
618*tracefs_synth_create*(3),
619*tracefs_synth_destroy*(3),
620*tracefs_synth_free*(3),
621*tracefs_synth_echo_cmd*(3),
622*tracefs_hist_alloc*(3),
623*tracefs_hist_alloc_2d*(3),
624*tracefs_hist_alloc_nd*(3),
625*tracefs_hist_free*(3),
626*tracefs_hist_add_key*(3),
627*tracefs_hist_add_value*(3),
628*tracefs_hist_add_name*(3),
629*tracefs_hist_start*(3),
630*tracefs_hist_destory*(3),
631*tracefs_hist_add_sort_key*(3),
632*tracefs_hist_sort_key_direction*(3)
633
634AUTHOR
635------
636[verse]
637--
638*Steven Rostedt* <rostedt@goodmis.org>
639*Tzvetomir Stoyanov* <tz.stoyanov@gmail.com>
640*sameeruddin shaik* <sameeruddin.shaik8@gmail.com>
641--
642REPORTING BUGS
643--------------
644Report bugs to  <linux-trace-devel@vger.kernel.org>
645
646LICENSE
647-------
648libtracefs is Free Software licensed under the GNU LGPL 2.1
649
650RESOURCES
651---------
652https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
653
654COPYING
655-------
656Copyright \(C) 2020 VMware, Inc. Free use of this software is granted under
657the terms of the GNU Public License (GPL).
658