1# CPU Scheduling events 2 3On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel 4[ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) 5infrastructure. 6 7This allows to get fine grained scheduling events such as: 8 9* Which threads were scheduling on which CPU cores at any point in time, with 10 nanosecond accuracy. 11* The reason why a running thread got descheduled (e.g. pre-emption, blocked on 12 a mutex, blocking syscall or any other wait queue). 13* The point in time when a thread became eligible to be executed, even if it was 14 not put immediately on any CPU run queue, together with the source thread that 15 made it executable. 16 17## UI 18 19When zoomed out, the UI shows a quantized view of CPU usage, which collapses the 20scheduling information: 21 22 23 24However, by zooming in, the individual scheduling events become visible: 25 26 27 28Clicking on a CPU slice shows the relevant information in the details panel: 29 30 31 32Scrolling down, when expanding individual processes, the scheduling events also 33create one track for each thread, which allows to follow the evolution of the 34state of individual threads: 35 36 37 38 39```protobuf 40data_sources { 41 config { 42 name: "linux.ftrace" 43 ftrace_config { 44 ftrace_events: "sched/sched_switch" 45 ftrace_events: "sched/sched_waking" 46 } 47 } 48} 49``` 50 51## SQL 52 53At the SQL level, the scheduling data is exposed in the 54[`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table. 55 56```sql 57select ts, dur, cpu, end_state, priority, process.name, thread.name 58from sched_slice left join thread using(utid) left join process using(upid) 59``` 60 61ts | dur | cpu | end_state | priority | process.name, | thread.name 62---|-----|-----|-----------|----------|---------------|------------ 63261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd 64261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 65261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 66261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 67261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 68261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 69261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 70261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 71261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 72261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 73 74## TraceConfig 75 76```protobuf 77data_sources: { 78 config { 79 name: "linux.ftrace" 80 ftrace_config { 81 ftrace_events: "sched/sched_switch" 82 ftrace_events: "sched/sched_process_exit" 83 ftrace_events: "sched/sched_process_free" 84 ftrace_events: "task/task_newtask" 85 ftrace_events: "task/task_rename" 86 } 87 } 88} 89 90# This is to get full process name and thread<>process relationships. 91data_sources: { 92 config { 93 name: "linux.process_stats" 94 } 95} 96``` 97 98## Scheduling wakeups and latency analysis 99 100By further enabling the following in the TraceConfig, the ftrace data source 101will record also scheduling wake up events: 102 103```protobuf 104 ftrace_events: "sched/sched_wakeup_new" 105 ftrace_events: "sched/sched_waking" 106``` 107 108While `sched_switch` events are emitted only when a thread is in the 109`R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are 110emitted when any event causes a thread state to change. 111 112Consider the following example: 113 114``` 115Thread A 116condition_variable.wait() 117 Thread B 118 condition_variable.notify() 119``` 120 121When Thread A suspends on the wait() it will enter the state `S(sleeping)` and 122get removed from the CPU run queue. When Thread B notifies the variable, the 123kernel will transition Thread A into the `R(unnable)` state. Thread A at that 124point is eligible to be put back on a run queue. However this might not happen 125for some time because, for instance: 126 127* All CPUs might be busy running some other thread, and Thread A needs to wait 128 to get a run queue slot assigned (or the other threads have higher priority). 129* Some other CPUs other than the current one, but the scheduler load balancer 130 might take some time to move the thread on another CPU. 131 132Unless using real-time thread priorities, most Linux Kernel scheduler 133configurations are not strictly work-conserving. For instance the scheduler 134might prefer to wait some time in the hope that the thread running on the 135current CPU goes to idle, avoiding a cross-cpu migration which might be more 136costly both in terms of overhead and power. 137 138NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The 139 difference lies in wakeup events across CPUs, which involve 140 inter-processor interrupts. The former is emitted on the source (wakee) 141 CPU, the latter on the destination (waked) CPU. `sched_waking` is usually 142 sufficient for latency analysis, unless you are looking into breaking down 143 latency due to inter-processor signaling. 144 145When enabling `sched_waking` events, the following will appear in the UI when 146selecting a CPU slice: 147 148 149 150### Decoding `end_state` 151 152The [sched_slice](/docs/analysis/sql-tables.autogen#sched_slice) table contains 153information on scheduling activity of the system: 154 155``` 156> select * from sched_slice limit 1 157id type ts dur cpu utid end_state priority 1580 sched_slice 70730062200 125364 0 1 S 130 159``` 160 161Each row of the table shows when a given thread (`utid`) began running 162(`ts`), on which core it ran (`cpu`), for how long it ran (`dur`), 163and why it stopped running: `end_state`. 164 165`end_state` is encoded as one or more ascii characters. The UI uses 166the following translations to convert `end_state` into human readable 167text: 168 169| end_state | Translation | 170|------------|------------------------| 171| R | Runnable | 172| S | Sleeping | 173| D | Uninterruptible Sleep | 174| T | Stopped | 175| t | Traced | 176| X | Exit (Dead) | 177| Z | Exit (Zombie) | 178| x | Task Dead | 179| I | Task Dead | 180| K | Wake Kill | 181| W | Waking | 182| P | Parked | 183| N | No Load | 184| + | (Preempted) | 185 186Not all combinations of characters are meaningful. 187 188If we do not know when the scheduling ended (for example because the 189trace ended while the thread was still running) `end_state` will be 190`NULL` and `dur` will be -1. 191 192