1 Central, scheduler-driven, power-performance control 2 (EXPERIMENTAL) 3 4Abstract 5======== 6 7The topic of a single simple power-performance tunable, that is wholly 8scheduler centric, and has well defined and predictable properties has come up 9on several occasions in the past [1,2]. With techniques such as a scheduler 10driven DVFS [3], we now have a good framework for implementing such a tunable. 11This document describes the overall ideas behind its design and implementation. 12 13 14Table of Contents 15================= 16 171. Motivation 182. Introduction 193. Signal Boosting Strategy 204. OPP selection using boosted CPU utilization 215. Per task group boosting 226. Question and Answers 23 - What about "auto" mode? 24 - What about boosting on a congested system? 25 - How CPUs are boosted when we have tasks with multiple boost values? 267. References 27 28 291. Motivation 30============= 31 32Sched-DVFS [3] is a new event-driven cpufreq governor which allows the 33scheduler to select the optimal DVFS operating point (OPP) for running a task 34allocated to a CPU. The introduction of sched-DVFS enables running workloads at 35the most energy efficient OPPs. 36 37However, sometimes it may be desired to intentionally boost the performance of 38a workload even if that could imply a reasonable increase in energy 39consumption. For example, in order to reduce the response time of a task, we 40may want to run the task at a higher OPP than the one that is actually required 41by it's CPU bandwidth demand. 42 43This last requirement is especially important if we consider that one of the 44main goals of the sched-DVFS component is to replace all currently available 45CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling 46driven governors we currently have, it is already more responsive at selecting 47the optimal OPP to run tasks allocated to a CPU. However, just tracking the 48actual task load demand may not be enough from a performance standpoint. For 49example, it is not possible to get behaviors similar to those provided by the 50"performance" and "interactive" CPUFreq governors. 51 52This document describes an implementation of a tunable, stacked on top of the 53sched-DVFS which extends its functionality to support task performance 54boosting. 55 56By "performance boosting" we mean the reduction of the time required to 57complete a task activation, i.e. the time elapsed from a task wakeup to its 58next deactivation (e.g. because it goes back to sleep or it terminates). For 59example, if we consider a simple periodic task which executes the same workload 60for 5[s] every 20[s] while running at a certain OPP, a boosted execution of 61that task must complete each of its activations in less than 5[s]. 62 63A previous attempt [5] to introduce such a boosting feature has not been 64successful mainly because of the complexity of the proposed solution. The 65approach described in this document exposes a single simple interface to 66user-space. This single tunable knob allows the tuning of system wide 67scheduler behaviours ranging from energy efficiency at one end through to 68incremental performance boosting at the other end. This first tunable affects 69all tasks. However, a more advanced extension of the concept is also provided 70which uses CGroups to boost the performance of only selected tasks while using 71the energy efficient default for all others. 72 73The rest of this document introduces in more details the proposed solution 74which has been named SchedTune. 75 76 772. Introduction 78=============== 79 80SchedTune exposes a simple user-space interface with a single power-performance 81tunable: 82 83 /proc/sys/kernel/sched_cfs_boost 84 85This permits expressing a boost value as an integer in the range [0..100]. 86 87A value of 0 (default) configures the CFS scheduler for maximum energy 88efficiency. This means that sched-DVFS runs the tasks at the minimum OPP 89required to satisfy their workload demand. 90A value of 100 configures scheduler for maximum performance, which translates 91to the selection of the maximum OPP on that CPU. 92 93The range between 0 and 100 can be set to satisfy other scenarios suitably. For 94example to satisfy interactive response or depending on other system events 95(battery level etc). 96 97A CGroup based extension is also provided, which permits further user-space 98defined task classification to tune the scheduler for different goals depending 99on the specific nature of the task, e.g. background vs interactive vs 100low-priority. 101 102The overall design of the SchedTune module is built on top of "Per-Entity Load 103Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating 104Performance Point (OPP) selection. 105Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune 106the operating frequency of that CPU to better match the workload demand. The 107selection of the actual OPP being activated is influenced by the global boost 108value, or the boost value for the task CGroup when in use. 109 110This simple biasing approach leverages existing frameworks, which means minimal 111modifications to the scheduler, and yet it allows to achieve a range of 112different behaviours all from a single simple tunable knob. 113The only new concept introduced is that of signal boosting. 114 115 1163. Signal Boosting Strategy 117=========================== 118 119The whole PELT machinery works based on the value of a few load tracking signals 120which basically track the CPU bandwidth requirements for tasks and the capacity 121of CPUs. The basic idea behind the SchedTune knob is to artificially inflate 122some of these load tracking signals to make a task or RQ appears more demanding 123that it actually is. 124 125Which signals have to be inflated depends on the specific "consumer". However, 126independently from the specific (signal, consumer) pair, it is important to 127define a simple and possibly consistent strategy for the concept of boosting a 128signal. 129 130A boosting strategy defines how the "abstract" user-space defined 131sched_cfs_boost value is translated into an internal "margin" value to be added 132to a signal to get its inflated value: 133 134 margin := boosting_strategy(sched_cfs_boost, signal) 135 boosted_signal := signal + margin 136 137Different boosting strategies were identified and analyzed before selecting the 138one found to be most effective. 139 140Signal Proportional Compensation (SPC) 141-------------------------------------- 142 143In this boosting strategy the sched_cfs_boost value is used to compute a 144margin which is proportional to the complement of the original signal. 145When a signal has a maximum possible value, its complement is defined as 146the delta from the actual value and its possible maximum. 147 148Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as 149the maximum possible value, the margin becomes: 150 151 margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) 152 153Using this boosting strategy: 154- a 100% sched_cfs_boost means that the signal is scaled to the maximum value 155- each value in the range of sched_cfs_boost effectively inflates the signal in 156 question by a quantity which is proportional to the maximum value. 157 158For example, by applying the SPC boosting strategy to the selection of the OPP 159to run a task it is possible to achieve these behaviors: 160 161- 0% boosting: run the task at the minimum OPP required by its workload 162- 100% boosting: run the task at the maximum OPP available for the CPU 163- 50% boosting: run at the half-way OPP between minimum and maximum 164 165Which means that, at 50% boosting, a task will be scheduled to run at half of 166the maximum theoretically achievable performance on the specific target 167platform. 168 169A graphical representation of an SPC boosted signal is represented in the 170following figure where: 171 a) "-" represents the original signal 172 b) "b" represents a 50% boosted signal 173 c) "p" represents a 100% boosted signal 174 175 176 ^ 177 | SCHED_LOAD_SCALE 178 +-----------------------------------------------------------------+ 179 |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp 180 | 181 | boosted_signal 182 | bbbbbbbbbbbbbbbbbbbbbbbb 183 | 184 | original signal 185 | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ 186 | | 187 |bbbbbbbbbbbbbbbbbb | 188 | | 189 | | 190 | | 191 | +-----------------------+ 192 | | 193 | | 194 | | 195 |------------------+ 196 | 197 | 198 +-----------------------------------------------------------------------> 199 200The plot above shows a ramped load signal (titled 'original_signal') and it's 201boosted equivalent. For each step of the original signal the boosted signal 202corresponding to a 50% boost is midway from the original signal and the upper 203bound. Boosting by 100% generates a boosted signal which is always saturated to 204the upper bound. 205 206 2074. OPP selection using boosted CPU utilization 208============================================== 209 210It is worth calling out that the implementation does not introduce any new load 211signals. Instead, it provides an API to tune existing signals. This tuning is 212done on demand and only in scheduler code paths where it is sensible to do so. 213The new API calls are defined to return either the default signal or a boosted 214one, depending on the value of sched_cfs_boost. This is a clean an non invasive 215modification of the existing existing code paths. 216 217The signal representing a CPU's utilization is boosted according to the 218previously described SPC boosting strategy. To sched-DVFS, this allows a CPU 219(ie CFS run-queue) to appear more used then it actually is. 220 221Thus, with the sched_cfs_boost enabled we have the following main functions to 222get the current utilization of a CPU: 223 224 cpu_util() 225 boosted_cpu_util() 226 227The new boosted_cpu_util() is similar to the first but returns a boosted 228utilization signal which is a function of the sched_cfs_boost value. 229 230This function is used in the CFS scheduler code paths where sched-DVFS needs to 231decide the OPP to run a CPU at. 232For example, this allows selecting the highest OPP for a CPU which has 233the boost value set to 100%. 234 235 2365. Per task group boosting 237========================== 238 239The availability of a single knob which is used to boost all tasks in the 240system is certainly a simple solution but it quite likely doesn't fit many 241utilization scenarios, especially in the mobile device space. 242 243For example, on battery powered devices there usually are many background 244services which are long running and need energy efficient scheduling. On the 245other hand, some applications are more performance sensitive and require an 246interactive response and/or maximum performance, regardless of the energy cost. 247To better service such scenarios, the SchedTune implementation has an extension 248that provides a more fine grained boosting interface. 249 250A new CGroup controller, namely "schedtune", could be enabled which allows to 251defined and configure task groups with different boosting values. 252Tasks that require special performance can be put into separate CGroups. 253The value of the boost associated with the tasks in this group can be specified 254using a single knob exposed by the CGroup controller: 255 256 schedtune.boost 257 258This knob allows the definition of a boost value that is to be used for 259SPC boosting of all tasks attached to this group. 260 261The current schedtune controller implementation is really simple and has these 262main characteristics: 263 264 1) It is only possible to create 1 level depth hierarchies 265 266 The root control groups define the system-wide boost value to be applied 267 by default to all tasks. Its direct subgroups are named "boost groups" and 268 they define the boost value for specific set of tasks. 269 Further nested subgroups are not allowed since they do not have a sensible 270 meaning from a user-space standpoint. 271 272 2) It is possible to define only a limited number of "boost groups" 273 274 This number is defined at compile time and by default configured to 16. 275 This is a design decision motivated by two main reasons: 276 a) In a real system we do not expect utilization scenarios with more then few 277 boost groups. For example, a reasonable collection of groups could be 278 just "background", "interactive" and "performance". 279 b) It simplifies the implementation considerably, especially for the code 280 which has to compute the per CPU boosting once there are multiple 281 RUNNABLE tasks with different boost values. 282 283Such a simple design should allow servicing the main utilization scenarios identified 284so far. It provides a simple interface which can be used to manage the 285power-performance of all tasks or only selected tasks. 286Moreover, this interface can be easily integrated by user-space run-times (e.g. 287Android, ChromeOS) to implement a QoS solution for task boosting based on tasks 288classification, which has been a long standing requirement. 289 290Setup and usage 291--------------- 292 2930. Use a kernel with CGROUP_SCHEDTUNE support enabled 294 2951. Check that the "schedtune" CGroup controller is available: 296 297 root@linaro-nano:~# cat /proc/cgroups 298 #subsys_name hierarchy num_cgroups enabled 299 cpuset 0 1 1 300 cpu 0 1 1 301 schedtune 0 1 1 302 3032. Mount a tmpfs to create the CGroups mount point (Optional) 304 305 root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup 306 3073. Mount the "schedtune" controller 308 309 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune 310 root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune 311 3124. Setup the system-wide boost value (Optional) 313 314 If not configured the root control group has a 0% boost value, which 315 basically disables boosting for all tasks in the system thus running in 316 an energy-efficient mode. 317 318 root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost 319 3205. Create task groups and configure their specific boost value (Optional) 321 322 For example here we create a "performance" boost group configure to boost 323 all its tasks to 100% 324 325 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance 326 root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost 327 3286. Move tasks into the boost group 329 330 For example, the following moves the tasks with PID $TASKPID (and all its 331 threads) into the "performance" boost group. 332 333 root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs 334 335This simple configuration allows only the threads of the $TASKPID task to run, 336when needed, at the highest OPP in the most capable CPU of the system. 337 338 3396. Question and Answers 340======================= 341 342What about "auto" mode? 343----------------------- 344 345The 'auto' mode as described in [5] can be implemented by interfacing SchedTune 346with some suitable user-space element. This element could use the exposed 347system-wide or cgroup based interface. 348 349How are multiple groups of tasks with different boost values managed? 350--------------------------------------------------------------------- 351 352The current SchedTune implementation keeps track of the boosted RUNNABLE tasks 353on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization 354is boosted with a value which is the maximum of the boost values of the 355currently RUNNABLE tasks in its RQ. 356 357This allows sched-DVFS to boost a CPU only while there are boosted tasks ready 358to run and switch back to the energy efficient mode as soon as the last boosted 359task is dequeued. 360 361 3627. References 363============= 364[1] http://lwn.net/Articles/552889 365[2] http://lkml.org/lkml/2012/5/18/91 366[3] http://lkml.org/lkml/2015/6/26/620 367