1 Central, scheduler-driven, power-performance control 2 (EXPERIMENTAL) 3 4Abstract 5======== 6 7The topic of a single simple power-performance tunable, that is wholly 8scheduler centric, and has well defined and predictable properties has come up 9on several occasions in the past [1,2]. With techniques such as a scheduler 10driven DVFS [3], we now have a good framework for implementing such a tunable. 11This document describes the overall ideas behind its design and implementation. 12 13 14Table of Contents 15================= 16 171. Motivation 182. Introduction 193. Signal Boosting Strategy 204. OPP selection using boosted CPU utilization 215. Per task group boosting 226. Per-task wakeup-placement-strategy Selection 237. Question and Answers 24 - What about "auto" mode? 25 - What about boosting on a congested system? 26 - How CPUs are boosted when we have tasks with multiple boost values? 278. References 28 29 301. Motivation 31============= 32 33Schedutil [3] is a utilization-driven cpufreq governor which allows the 34scheduler to select the optimal DVFS operating point (OPP) for running a task 35allocated to a CPU. 36 37However, sometimes it may be desired to intentionally boost the performance of 38a workload even if that could imply a reasonable increase in energy 39consumption. For example, in order to reduce the response time of a task, we 40may want to run the task at a higher OPP than the one that is actually required 41by it's CPU bandwidth demand. 42 43This last requirement is especially important if we consider that one of the 44main goals of the utilization-driven governor component is to replace all 45currently available CPUFreq policies. Since schedutil is event-based, as 46opposed to the sampling driven governors we currently have, they are already 47more responsive at selecting the optimal OPP to run tasks allocated to a CPU. 48However, just tracking the actual task utilization may not be enough from a 49performance standpoint. For example, it is not possible to get behaviors 50similar to those provided by the "performance" and "interactive" CPUFreq 51governors. 52 53This document describes an implementation of a tunable, stacked on top of the 54utilization-driven governor which extends its functionality to support task 55performance boosting. 56 57By "performance boosting" we mean the reduction of the time required to 58complete a task activation, i.e. the time elapsed from a task wakeup to its 59next deactivation (e.g. because it goes back to sleep or it terminates). For 60example, if we consider a simple periodic task which executes the same workload 61for 5[s] every 20[s] while running at a certain OPP, a boosted execution of 62that task must complete each of its activations in less than 5[s]. 63 64The rest of this document introduces in more details the proposed solution 65which has been named SchedTune. 66 67 682. Introduction 69=============== 70 71SchedTune exposes a simple user-space interface provided through a new 72CGroup controller 'stune' which provides two power-performance tunables 73per group: 74 75 /<stune cgroup mount point>/schedtune.prefer_idle 76 /<stune cgroup mount point>/schedtune.boost 77 78The CGroup implementation permits arbitrary user-space defined task 79classification to tune the scheduler for different goals depending on the 80specific nature of the task, e.g. background vs interactive vs low-priority. 81 82More details are given in section 5. 83 842.1 Boosting 85============ 86 87The boost value is expressed as an integer in the range [0..100]. 88 89A value of 0 (default) configures the CFS scheduler for maximum energy 90efficiency. This means that schedutil runs the tasks at the minimum OPP 91required to satisfy their workload demand. 92 93A value of 100 configures scheduler for maximum performance, which translates 94to the selection of the maximum OPP on that CPU. 95 96The range between 0 and 100 can be set to satisfy other scenarios suitably. For 97example to satisfy interactive response or depending on other system events 98(battery level etc). 99 100The overall design of the SchedTune module is built on top of "Per-Entity Load 101Tracking" (PELT) signals and schedutil by introducing a bias on the OPP 102selection. 103 104Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune 105the operating frequency of that CPU to better match the workload demand. The 106selection of the actual OPP being activated is influenced by the boost value 107for the task CGroup. 108 109This simple biasing approach leverages existing frameworks, which means minimal 110modifications to the scheduler, and yet it allows to achieve a range of 111different behaviours all from a single simple tunable knob. 112 113In EAS schedulers, we use boosted task and CPU utilization for energy 114calculation and energy-aware task placement. 115 1162.2 prefer_idle 117=============== 118 119This is a flag which indicates to the scheduler that userspace would like 120the scheduler to focus on energy or to focus on performance. 121 122A value of 0 (default) signals to the CFS scheduler that tasks in this group 123can be placed according to the energy-aware wakeup strategy. 124 125A value of 1 signals to the CFS scheduler that tasks in this group should be 126placed to minimise wakeup latency. 127 128Android platforms typically use this flag for application tasks which the 129user is currently interacting with. 130 131 1323. Signal Boosting Strategy 133=========================== 134 135The whole PELT machinery works based on the value of a few load tracking signals 136which basically track the CPU bandwidth requirements for tasks and the capacity 137of CPUs. The basic idea behind the SchedTune knob is to artificially inflate 138some of these load tracking signals to make a task or RQ appears more demanding 139that it actually is. 140 141Which signals have to be inflated depends on the specific "consumer". However, 142independently from the specific (signal, consumer) pair, it is important to 143define a simple and possibly consistent strategy for the concept of boosting a 144signal. 145 146A boosting strategy defines how the "abstract" user-space defined 147sched_cfs_boost value is translated into an internal "margin" value to be added 148to a signal to get its inflated value: 149 150 margin := boosting_strategy(sched_cfs_boost, signal) 151 boosted_signal := signal + margin 152 153The boosting strategy currently implemented in SchedTune is called 'Signal 154Proportional Compensation' (SPC). With SPC, the sched_cfs_boost value is used to 155compute a margin which is proportional to the complement of the original signal. 156When a signal has a maximum possible value, its complement is defined as 157the delta from the actual value and its possible maximum. 158 159Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE as 160the maximum possible value, the margin becomes: 161 162 margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal) 163 164Using this boosting strategy: 165- a 100% sched_cfs_boost means that the signal is scaled to the maximum value 166- each value in the range of sched_cfs_boost effectively inflates the signal in 167 question by a quantity which is proportional to the maximum value. 168 169For example, by applying the SPC boosting strategy to the selection of the OPP 170to run a task it is possible to achieve these behaviors: 171 172- 0% boosting: run the task at the minimum OPP required by its workload 173- 100% boosting: run the task at the maximum OPP available for the CPU 174- 50% boosting: run at the half-way OPP between minimum and maximum 175 176Which means that, at 50% boosting, a task will be scheduled to run at half of 177the maximum theoretically achievable performance on the specific target 178platform. 179 180A graphical representation of an SPC boosted signal is represented in the 181following figure where: 182 a) "-" represents the original signal 183 b) "b" represents a 50% boosted signal 184 c) "p" represents a 100% boosted signal 185 186 187 ^ 188 | SCHED_CAPACITY_SCALE 189 +-----------------------------------------------------------------+ 190 |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp 191 | 192 | boosted_signal 193 | bbbbbbbbbbbbbbbbbbbbbbbb 194 | 195 | original signal 196 | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ 197 | | 198 |bbbbbbbbbbbbbbbbbb | 199 | | 200 | | 201 | | 202 | +-----------------------+ 203 | | 204 | | 205 | | 206 |------------------+ 207 | 208 | 209 +-----------------------------------------------------------------------> 210 211The plot above shows a ramped load signal (titled 'original_signal') and it's 212boosted equivalent. For each step of the original signal the boosted signal 213corresponding to a 50% boost is midway from the original signal and the upper 214bound. Boosting by 100% generates a boosted signal which is always saturated to 215the upper bound. 216 217 2184. OPP selection using boosted CPU utilization 219============================================== 220 221It is worth calling out that the implementation does not introduce any new load 222signals. Instead, it provides an API to tune existing signals. This tuning is 223done on demand and only in scheduler code paths where it is sensible to do so. 224The new API calls are defined to return either the default signal or a boosted 225one, depending on the value of sched_cfs_boost. This is a clean an non invasive 226modification of the existing existing code paths. 227 228The signal representing a CPU's utilization is boosted according to the 229previously described SPC boosting strategy. To schedutil, this allows a CPU 230(ie CFS run-queue) to appear more used then it actually is. 231 232Thus, with the sched_cfs_boost enabled we have the following main functions to 233get the current utilization of a CPU: 234 235 cpu_util() 236 boosted_cpu_util() 237 238The new boosted_cpu_util() is similar to the first but returns a boosted 239utilization signal which is a function of the sched_cfs_boost value. 240 241This function is used in the CFS scheduler code paths where schedutil needs to 242decide the OPP to run a CPU at. For example, this allows selecting the highest 243OPP for a CPU which has the boost value set to 100%. 244 245 2465. Per task group boosting 247========================== 248 249On battery powered devices there usually are many background services which are 250long running and need energy efficient scheduling. On the other hand, some 251applications are more performance sensitive and require an interactive 252response and/or maximum performance, regardless of the energy cost. 253 254To better service such scenarios, the SchedTune implementation has an extension 255that provides a more fine grained boosting interface. 256 257A new CGroup controller, namely "schedtune", can be enabled which allows to 258defined and configure task groups with different boosting values. 259Tasks that require special performance can be put into separate CGroups. 260The value of the boost associated with the tasks in this group can be specified 261using a single knob exposed by the CGroup controller: 262 263 schedtune.boost 264 265This knob allows the definition of a boost value that is to be used for 266SPC boosting of all tasks attached to this group. 267 268The current schedtune controller implementation is really simple and has these 269main characteristics: 270 271 1) It is only possible to create 1 level depth hierarchies 272 273 The root control groups define the system-wide boost value to be applied 274 by default to all tasks. Its direct subgroups are named "boost groups" and 275 they define the boost value for specific set of tasks. 276 Further nested subgroups are not allowed since they do not have a sensible 277 meaning from a user-space standpoint. 278 279 2) It is possible to define only a limited number of "boost groups" 280 281 This number is defined at compile time and by default configured to 16. 282 This is a design decision motivated by two main reasons: 283 a) In a real system we do not expect utilization scenarios with more than 284 a few boost groups. For example, a reasonable collection of groups could 285 be just "background", "interactive" and "performance". 286 b) It simplifies the implementation considerably, especially for the code 287 which has to compute the per CPU boosting once there are multiple 288 RUNNABLE tasks with different boost values. 289 290Such a simple design should allow servicing the main utilization scenarios 291identified so far. It provides a simple interface which can be used to manage 292the power-performance of all tasks or only selected tasks. 293Moreover, this interface can be easily integrated by user-space run-times (e.g. 294Android, ChromeOS) to implement a QoS solution for task boosting based on tasks 295classification, which has been a long standing requirement. 296 297Setup and usage 298--------------- 299 3000. Use a kernel with CONFIG_SCHED_TUNE support enabled 301 3021. Check that the "schedtune" CGroup controller is available: 303 304 root@linaro-nano:~# cat /proc/cgroups 305 #subsys_name hierarchy num_cgroups enabled 306 cpuset 0 1 1 307 cpu 0 1 1 308 schedtune 0 1 1 309 3102. Mount a tmpfs to create the CGroups mount point (Optional) 311 312 root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup 313 3143. Mount the "schedtune" controller 315 316 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune 317 root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune 318 3194. Create task groups and configure their specific boost value (Optional) 320 321 For example here we create a "performance" boost group configure to boost 322 all its tasks to 100% 323 324 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance 325 root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost 326 3275. Move tasks into the boost group 328 329 For example, the following moves the tasks with PID $TASKPID (and all its 330 threads) into the "performance" boost group. 331 332 root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs 333 334This simple configuration allows only the threads of the $TASKPID task to run, 335when needed, at the highest OPP in the most capable CPU of the system. 336 337 3386. Per-task wakeup-placement-strategy Selection 339=============================================== 340 341Many devices have a number of CFS tasks in use which require an absolute 342minimum wakeup latency, and many tasks for which wakeup latency is not 343important. 344 345For touch-driven environments, removing additional wakeup latency can be 346critical. 347 348When you use the Schedtume CGroup controller, you have access to a second 349parameter which allows a group to be marked such that energy_aware task 350placement is bypassed for tasks belonging to that group. 351 352prefer_idle=0 (default - use energy-aware task placement if available) 353prefer_idle=1 (never use energy-aware task placement for these tasks) 354 355Since the regular wakeup task placement algorithm in CFS is biased for 356performance, this has the effect of restoring minimum wakeup latency 357for the desired tasks whilst still allowing energy-aware wakeup placement 358to save energy for other tasks. 359 360 3617. Question and Answers 362======================= 363 364What about "auto" mode? 365----------------------- 366 367The 'auto' mode as described in [5] can be implemented by interfacing SchedTune 368with some suitable user-space element. This element could use the exposed 369system-wide or cgroup based interface. 370 371How are multiple groups of tasks with different boost values managed? 372--------------------------------------------------------------------- 373 374The current SchedTune implementation keeps track of the boosted RUNNABLE tasks 375on a CPU. The CPU utilization seen by schedutil (and used to select an 376appropriate OPP) is boosted with a value which is the maximum of the boost 377values of the currently RUNNABLE tasks in its RQ. 378 379This allows cpufreq to boost a CPU only while there are boosted tasks ready 380to run and switch back to the energy efficient mode as soon as the last boosted 381task is dequeued. 382 383 3848. References 385============= 386[1] http://lwn.net/Articles/552889 387[2] http://lkml.org/lkml/2012/5/18/91 388[3] https://lkml.org/lkml/2016/3/29/1041 389