• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1             Central, scheduler-driven, power-performance control
2                               (EXPERIMENTAL)
3
4Abstract
5========
6
7The topic of a single simple power-performance tunable, that is wholly
8scheduler centric, and has well defined and predictable properties has come up
9on several occasions in the past [1,2]. With techniques such as a scheduler
10driven DVFS [3], we now have a good framework for implementing such a tunable.
11This document describes the overall ideas behind its design and implementation.
12
13
14Table of Contents
15=================
16
171. Motivation
182. Introduction
193. Signal Boosting Strategy
204. OPP selection using boosted CPU utilization
215. Per task group boosting
226. Per-task wakeup-placement-strategy Selection
237. Question and Answers
24   - What about "auto" mode?
25   - What about boosting on a congested system?
26   - How CPUs are boosted when we have tasks with multiple boost values?
278. References
28
29
301. Motivation
31=============
32
33Schedutil [3] is a utilization-driven cpufreq governor which allows the
34scheduler to select the optimal DVFS operating point (OPP) for running a task
35allocated to a CPU.
36
37However, sometimes it may be desired to intentionally boost the performance of
38a workload even if that could imply a reasonable increase in energy
39consumption. For example, in order to reduce the response time of a task, we
40may want to run the task at a higher OPP than the one that is actually required
41by it's CPU bandwidth demand.
42
43This last requirement is especially important if we consider that one of the
44main goals of the utilization-driven governor component is to replace all
45currently available CPUFreq policies. Since schedutil is event-based, as
46opposed to the sampling driven governors we currently have, they are already
47more responsive at selecting the optimal OPP to run tasks allocated to a CPU.
48However, just tracking the actual task utilization may not be enough from a
49performance standpoint.  For example, it is not possible to get behaviors
50similar to those provided by the "performance" and "interactive" CPUFreq
51governors.
52
53This document describes an implementation of a tunable, stacked on top of the
54utilization-driven governor which extends its functionality to support task
55performance boosting.
56
57By "performance boosting" we mean the reduction of the time required to
58complete a task activation, i.e. the time elapsed from a task wakeup to its
59next deactivation (e.g. because it goes back to sleep or it terminates).  For
60example, if we consider a simple periodic task which executes the same workload
61for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
62that task must complete each of its activations in less than 5[s].
63
64The rest of this document introduces in more details the proposed solution
65which has been named SchedTune.
66
67
682. Introduction
69===============
70
71SchedTune exposes a simple user-space interface provided through a new
72CGroup controller 'stune' which provides two power-performance tunables
73per group:
74
75  /<stune cgroup mount point>/schedtune.prefer_idle
76  /<stune cgroup mount point>/schedtune.boost
77
78The CGroup implementation permits arbitrary user-space defined task
79classification to tune the scheduler for different goals depending on the
80specific nature of the task, e.g. background vs interactive vs low-priority.
81
82More details are given in section 5.
83
842.1 Boosting
85============
86
87The boost value is expressed as an integer in the range [0..100].
88
89A value of 0 (default) configures the CFS scheduler for maximum energy
90efficiency. This means that schedutil runs the tasks at the minimum OPP
91required to satisfy their workload demand.
92
93A value of 100 configures scheduler for maximum performance, which translates
94to the selection of the maximum OPP on that CPU.
95
96The range between 0 and 100 can be set to satisfy other scenarios suitably. For
97example to satisfy interactive response or depending on other system events
98(battery level etc).
99
100The overall design of the SchedTune module is built on top of "Per-Entity Load
101Tracking" (PELT) signals and schedutil by introducing a bias on the OPP
102selection.
103
104Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
105the operating frequency of that CPU to better match the workload demand. The
106selection of the actual OPP being activated is influenced by the boost value
107for the task CGroup.
108
109This simple biasing approach leverages existing frameworks, which means minimal
110modifications to the scheduler, and yet it allows to achieve a range of
111different behaviours all from a single simple tunable knob.
112
113In EAS schedulers, we use boosted task and CPU utilization for energy
114calculation and energy-aware task placement.
115
1162.2 prefer_idle
117===============
118
119This is a flag which indicates to the scheduler that userspace would like
120the scheduler to focus on energy or to focus on performance.
121
122A value of 0 (default) signals to the CFS scheduler that tasks in this group
123can be placed according to the energy-aware wakeup strategy.
124
125A value of 1 signals to the CFS scheduler that tasks in this group should be
126placed to minimise wakeup latency.
127
128Android platforms typically use this flag for application tasks which the
129user is currently interacting with.
130
131
1323. Signal Boosting Strategy
133===========================
134
135The whole PELT machinery works based on the value of a few load tracking signals
136which basically track the CPU bandwidth requirements for tasks and the capacity
137of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
138some of these load tracking signals to make a task or RQ appears more demanding
139that it actually is.
140
141Which signals have to be inflated depends on the specific "consumer".  However,
142independently from the specific (signal, consumer) pair, it is important to
143define a simple and possibly consistent strategy for the concept of boosting a
144signal.
145
146A boosting strategy defines how the "abstract" user-space defined
147sched_cfs_boost value is translated into an internal "margin" value to be added
148to a signal to get its inflated value:
149
150  margin         := boosting_strategy(sched_cfs_boost, signal)
151  boosted_signal := signal + margin
152
153The boosting strategy currently implemented in SchedTune is called 'Signal
154Proportional Compensation' (SPC). With SPC, the sched_cfs_boost value is used to
155compute a margin which is proportional to the complement of the original signal.
156When a signal has a maximum possible value, its complement is defined as
157the delta from the actual value and its possible maximum.
158
159Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE as
160the maximum possible value, the margin becomes:
161
162	margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
163
164Using this boosting strategy:
165- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
166- each value in the range of sched_cfs_boost effectively inflates the signal in
167  question by a quantity which is proportional to the maximum value.
168
169For example, by applying the SPC boosting strategy to the selection of the OPP
170to run a task it is possible to achieve these behaviors:
171
172-   0% boosting: run the task at the minimum OPP required by its workload
173- 100% boosting: run the task at the maximum OPP available for the CPU
174-  50% boosting: run at the half-way OPP between minimum and maximum
175
176Which means that, at 50% boosting, a task will be scheduled to run at half of
177the maximum theoretically achievable performance on the specific target
178platform.
179
180A graphical representation of an SPC boosted signal is represented in the
181following figure where:
182 a) "-" represents the original signal
183 b) "b" represents a  50% boosted signal
184 c) "p" represents a 100% boosted signal
185
186
187   ^
188   |  SCHED_CAPACITY_SCALE
189   +-----------------------------------------------------------------+
190   |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
191   |
192   |                                             boosted_signal
193   |                                          bbbbbbbbbbbbbbbbbbbbbbbb
194   |
195   |                                            original signal
196   |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
197   |                                          |
198   |bbbbbbbbbbbbbbbbbb                        |
199   |                                          |
200   |                                          |
201   |                                          |
202   |                  +-----------------------+
203   |                  |
204   |                  |
205   |                  |
206   |------------------+
207   |
208   |
209   +----------------------------------------------------------------------->
210
211The plot above shows a ramped load signal (titled 'original_signal') and it's
212boosted equivalent. For each step of the original signal the boosted signal
213corresponding to a 50% boost is midway from the original signal and the upper
214bound. Boosting by 100% generates a boosted signal which is always saturated to
215the upper bound.
216
217
2184. OPP selection using boosted CPU utilization
219==============================================
220
221It is worth calling out that the implementation does not introduce any new load
222signals. Instead, it provides an API to tune existing signals. This tuning is
223done on demand and only in scheduler code paths where it is sensible to do so.
224The new API calls are defined to return either the default signal or a boosted
225one, depending on the value of sched_cfs_boost. This is a clean an non invasive
226modification of the existing existing code paths.
227
228The signal representing a CPU's utilization is boosted according to the
229previously described SPC boosting strategy. To schedutil, this allows a CPU
230(ie CFS run-queue) to appear more used then it actually is.
231
232Thus, with the sched_cfs_boost enabled we have the following main functions to
233get the current utilization of a CPU:
234
235  cpu_util()
236  boosted_cpu_util()
237
238The new boosted_cpu_util() is similar to the first but returns a boosted
239utilization signal which is a function of the sched_cfs_boost value.
240
241This function is used in the CFS scheduler code paths where schedutil needs to
242decide the OPP to run a CPU at. For example, this allows selecting the highest
243OPP for a CPU which has the boost value set to 100%.
244
245
2465. Per task group boosting
247==========================
248
249On battery powered devices there usually are many background services which are
250long running and need energy efficient scheduling. On the other hand, some
251applications are more performance sensitive and require an interactive
252response and/or maximum performance, regardless of the energy cost.
253
254To better service such scenarios, the SchedTune implementation has an extension
255that provides a more fine grained boosting interface.
256
257A new CGroup controller, namely "schedtune", can be enabled which allows to
258defined and configure task groups with different boosting values.
259Tasks that require special performance can be put into separate CGroups.
260The value of the boost associated with the tasks in this group can be specified
261using a single knob exposed by the CGroup controller:
262
263   schedtune.boost
264
265This knob allows the definition of a boost value that is to be used for
266SPC boosting of all tasks attached to this group.
267
268The current schedtune controller implementation is really simple and has these
269main characteristics:
270
271  1) It is only possible to create 1 level depth hierarchies
272
273     The root control groups define the system-wide boost value to be applied
274     by default to all tasks. Its direct subgroups are named "boost groups" and
275     they define the boost value for specific set of tasks.
276     Further nested subgroups are not allowed since they do not have a sensible
277     meaning from a user-space standpoint.
278
279  2) It is possible to define only a limited number of "boost groups"
280
281     This number is defined at compile time and by default configured to 16.
282     This is a design decision motivated by two main reasons:
283     a) In a real system we do not expect utilization scenarios with more than
284        a few boost groups. For example, a reasonable collection of groups could
285        be just "background", "interactive" and "performance".
286     b) It simplifies the implementation considerably, especially for the code
287	which has to compute the per CPU boosting once there are multiple
288        RUNNABLE tasks with different boost values.
289
290Such a simple design should allow servicing the main utilization scenarios
291identified so far. It provides a simple interface which can be used to manage
292the power-performance of all tasks or only selected tasks.
293Moreover, this interface can be easily integrated by user-space run-times (e.g.
294Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
295classification, which has been a long standing requirement.
296
297Setup and usage
298---------------
299
3000. Use a kernel with CONFIG_SCHED_TUNE support enabled
301
3021. Check that the "schedtune" CGroup controller is available:
303
304   root@linaro-nano:~# cat /proc/cgroups
305   #subsys_name	hierarchy	num_cgroups	enabled
306   cpuset  	0		1		1
307   cpu     	0		1		1
308   schedtune	0		1		1
309
3102. Mount a tmpfs to create the CGroups mount point (Optional)
311
312   root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
313
3143. Mount the "schedtune" controller
315
316   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
317   root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
318
3194. Create task groups and configure their specific boost value (Optional)
320
321   For example here we create a "performance" boost group configure to boost
322   all its tasks to 100%
323
324   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
325   root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
326
3275. Move tasks into the boost group
328
329   For example, the following moves the tasks with PID $TASKPID (and all its
330   threads) into the "performance" boost group.
331
332   root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
333
334This simple configuration allows only the threads of the $TASKPID task to run,
335when needed, at the highest OPP in the most capable CPU of the system.
336
337
3386. Per-task wakeup-placement-strategy Selection
339===============================================
340
341Many devices have a number of CFS tasks in use which require an absolute
342minimum wakeup latency, and many tasks for which wakeup latency is not
343important.
344
345For touch-driven environments, removing additional wakeup latency can be
346critical.
347
348When you use the Schedtume CGroup controller, you have access to a second
349parameter which allows a group to be marked such that energy_aware task
350placement is bypassed for tasks belonging to that group.
351
352prefer_idle=0 (default - use energy-aware task placement if available)
353prefer_idle=1 (never use energy-aware task placement for these tasks)
354
355Since the regular wakeup task placement algorithm in CFS is biased for
356performance, this has the effect of restoring minimum wakeup latency
357for the desired tasks whilst still allowing energy-aware wakeup placement
358to save energy for other tasks.
359
360
3617. Question and Answers
362=======================
363
364What about "auto" mode?
365-----------------------
366
367The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
368with some suitable user-space element. This element could use the exposed
369system-wide or cgroup based interface.
370
371How are multiple groups of tasks with different boost values managed?
372---------------------------------------------------------------------
373
374The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
375on a CPU. The CPU utilization seen by schedutil (and used to select an
376appropriate OPP) is boosted with a value which is the maximum of the boost
377values of the currently RUNNABLE tasks in its RQ.
378
379This allows cpufreq to boost a CPU only while there are boosted tasks ready
380to run and switch back to the energy efficient mode as soon as the last boosted
381task is dequeued.
382
383
3848. References
385=============
386[1] http://lwn.net/Articles/552889
387[2] http://lkml.org/lkml/2012/5/18/91
388[3] https://lkml.org/lkml/2016/3/29/1041
389