• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# CPU Scheduling events
2
3On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel
4[ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt)
5infrastructure.
6
7This allows to get fine grained scheduling events such as:
8
9* Which threads were scheduled on which CPU core at any point in time, with
10  nanosecond accuracy.
11* The reason why a running thread got descheduled (e.g. pre-emption, blocked on
12  a mutex, blocking syscall or any other wait queue).
13* The point in time when a thread became eligible to be executed, even if it was
14  not put immediately on any CPU run queue, together with the source thread that
15  made it executable.
16
17## UI
18
19The UI represents individual scheduling events as slices:
20
21![](/docs/images/cpu-zoomed.png "Detailed view of CPU run queues")
22
23Clicking on a CPU slice shows the relevant information in the details panel:
24
25![](/docs/images/cpu-sched-details.png "CPU scheduling details")
26
27Scrolling down, when expanding individual processes, the scheduling events also
28create one track for each thread, which allows to follow the evolution of the
29state of individual threads:
30
31![](/docs/images/thread-states.png "States of individual threads")
32
33## SQL
34
35At the SQL level, the scheduling data is exposed in the
36[`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table.
37
38```sql
39select ts, dur, cpu, end_state, priority, process.name, thread.name
40from sched_slice left join thread using(utid) left join process using(upid)
41```
42
43ts | dur | cpu | end_state | priority | process.name, | thread.name
44---|-----|-----|-----------|----------|---------------|------------
45261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd
46261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0
47261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2
48261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1
49261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
50261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2
51261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0
52261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
53261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0
54261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
55
56## TraceConfig
57
58To collect this data, include the following data sources:
59
60```protobuf
61# Scheduling data from the kernel.
62data_sources: {
63  config {
64    name: "linux.ftrace"
65    ftrace_config {
66      compact_sched: {
67        enabled: true
68      }
69      ftrace_events: "sched/sched_switch"
70      # optional: precise thread lifetime tracking:
71      ftrace_events: "sched/sched_process_exit"
72      ftrace_events: "sched/sched_process_free"
73      ftrace_events: "task/task_newtask"
74      ftrace_events: "task/task_rename"
75    }
76  }
77}
78
79# Adds full process names and thread<>process relationships:
80data_sources: {
81  config {
82    name: "linux.process_stats"
83  }
84}
85```
86
87## Scheduling wakeups and latency analysis
88
89By further enabling the following in the TraceConfig, the ftrace data source
90will record also scheduling wake up events:
91
92```protobuf
93  ftrace_events: "sched/sched_wakeup_new"
94  ftrace_events: "sched/sched_waking"
95```
96
97While `sched_switch` events are emitted only when a thread is in the
98`R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are
99emitted when any event causes a thread state to change.
100
101Consider the following example:
102
103```
104Thread A
105condition_variable.wait()
106                                     Thread B
107                                     condition_variable.notify()
108```
109
110When Thread A suspends on the wait() it will enter the state `S(sleeping)` and
111get removed from the CPU run queue. When Thread B notifies the variable, the
112kernel will transition Thread A into the `R(unnable)` state. Thread A at that
113point is eligible to be put back on a run queue. However this might not happen
114for some time because, for instance:
115
116* All CPUs might be busy running some other thread, and Thread A needs to wait
117  to get a run queue slot assigned (or the other threads have higher priority).
118* Some other CPUs other than the current one, but the scheduler load balancer
119  might take some time to move the thread on another CPU.
120
121Unless using real-time thread priorities, most Linux Kernel scheduler
122configurations are not strictly work-conserving. For instance the scheduler
123might prefer to wait some time in the hope that the thread running on the
124current CPU goes to idle, avoiding a cross-cpu migration which might be more
125costly both in terms of overhead and power.
126
127NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The
128      difference lies in wakeup events across CPUs, which involve
129      inter-processor interrupts. The former is emitted on the source (wakee)
130      CPU, the latter on the destination (waked) CPU. `sched_waking` is usually
131      sufficient for latency analysis, unless you are looking into breaking down
132      latency due to inter-processor signaling.
133
134When enabling `sched_waking` events, the following will appear in the UI when
135selecting a CPU slice:
136
137![](/docs/images/latency.png "Scheduling wake-up events in the UI")
138
139### Decoding `end_state`
140
141The [sched_slice](/docs/analysis/sql-tables.autogen#sched_slice) table contains
142information on scheduling activity of the system:
143
144```
145> select * from sched_slice limit 1
146id  type        ts          dur    cpu utid end_state priority
1470   sched_slice 70730062200 125364 0   1    S         130
148```
149
150Each row of the table shows when a given thread (`utid`) began running
151(`ts`), on which core it ran (`cpu`), for how long it ran (`dur`),
152and why it stopped running: `end_state`.
153
154`end_state` is encoded as one or more ascii characters. The UI uses
155the following translations to convert `end_state` into human readable
156text:
157
158| end_state  | Translation            |
159|------------|------------------------|
160| R          | Runnable               |
161| R+         | Runnable (Preempted)   |
162| S          | Sleeping               |
163| D          | Uninterruptible Sleep  |
164| T          | Stopped                |
165| t          | Traced                 |
166| X          | Exit (Dead)            |
167| Z          | Exit (Zombie)          |
168| x          | Task Dead              |
169| I          | Idle                   |
170| K          | Wake Kill              |
171| W          | Waking                 |
172| P          | Parked                 |
173| N          | No Load                |
174
175Not all combinations of characters are meaningful.
176
177If we do not know when the scheduling ended (for example because the
178trace ended while the thread was still running) `end_state` will be
179`NULL` and `dur` will be -1.
180
181