1Energy cost model for energy-aware scheduling (EXPERIMENTAL) 2 3Introduction 4============= 5 6The basic energy model uses platform energy data stored in sched_group_energy 7data structures attached to the sched_groups in the sched_domain hierarchy. The 8energy cost model offers two functions that can be used to guide scheduling 9decisions: 10 111. static unsigned int sched_group_energy(struct energy_env *eenv) 122. static int energy_diff(struct energy_env *eenv) 13 14sched_group_energy() estimates the energy consumed by all cpus in a specific 15sched_group including any shared resources owned exclusively by this group of 16cpus. Resources shared with other cpus are excluded (e.g. later level caches). 17 18energy_diff() estimates the total energy impact of a utilization change. That 19is, adding, removing, or migrating utilization (tasks). 20 21Both functions use a struct energy_env to specify the scenario to be evaluated: 22 23 struct energy_env { 24 struct sched_group *sg_top; 25 struct sched_group *sg_cap; 26 int cap_idx; 27 int util_delta; 28 int src_cpu; 29 int dst_cpu; 30 int energy; 31 }; 32 33sg_top: sched_group to be evaluated. Not used by energy_diff(). 34 35sg_cap: sched_group covering the cpus in the same frequency domain. Set by 36sched_group_energy(). 37 38cap_idx: Capacity state to be used for energy calculations. Set by 39find_new_capacity(). 40 41util_delta: Amount of utilization to be added, removed, or migrated. 42 43src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be 44-1 if no source (e.g. task wake-up). 45 46dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1 47if utilization is removed (e.g. terminating tasks). 48 49energy: Result of sched_group_energy(). 50 51The metric used to represent utilization is the actual per-entity running time 52averaged over time using a geometric series. Very similar to the existing 53per-entity load-tracking, but _not_ scaled by task priority and capped by the 54capacity of the cpu. The latter property does mean that utilization may 55underestimate the compute requirements for task on fully/over utilized cpus. 56The greatest potential for energy savings without affecting performance too much 57is scenarios where the system isn't fully utilized. If the system is deemed 58fully utilized load-balancing should be done with task load (includes task 59priority) instead in the interest of fairness and performance. 60 61 62Background and Terminology 63=========================== 64 65To make it clear from the start: 66 67energy = [joule] (resource like a battery on powered devices) 68power = energy/time = [joule/second] = [watt] 69 70The goal of energy-aware scheduling is to minimize energy, while still getting 71the job done. That is, we want to maximize: 72 73 performance [inst/s] 74 -------------------- 75 power [W] 76 77which is equivalent to minimizing: 78 79 energy [J] 80 ----------- 81 instruction 82 83while still getting 'good' performance. It is essentially an alternative 84optimization objective to the current performance-only objective for the 85scheduler. This alternative considers two objectives: energy-efficiency and 86performance. Hence, there needs to be a user controllable knob to switch the 87objective. Since it is early days, this is currently a sched_feature 88(ENERGY_AWARE). 89 90The idea behind introducing an energy cost model is to allow the scheduler to 91evaluate the implications of its decisions rather than applying energy-saving 92techniques blindly that may only have positive effects on some platforms. At 93the same time, the energy cost model must be as simple as possible to minimize 94the scheduler latency impact. 95 96Platform topology 97------------------ 98 99The system topology (cpus, caches, and NUMA information, not peripherals) is 100represented in the scheduler by the sched_domain hierarchy which has 101sched_groups attached at each level that covers one or more cpus (see 102sched-domains.txt for more details). To add energy awareness to the scheduler 103we need to consider power and frequency domains. 104 105Power domain: 106 107A power domain is a part of the system that can be powered on/off 108independently. Power domains are typically organized in a hierarchy where you 109may be able to power down just a cpu or a group of cpus along with any 110associated resources (e.g. shared caches). Powering up a cpu means that all 111power domains it is a part of in the hierarchy must be powered up. Hence, it is 112more expensive to power up the first cpu that belongs to a higher level power 113domain than powering up additional cpus in the same high level domain. Two 114level power domain hierarchy example: 115 116 Power source 117 +-------------------------------+----... 118per group PD G G 119 | +----------+ | 120 +--------+-------| Shared | (other groups) 121per-cpu PD G G | resource | 122 | | +----------+ 123 +-------+ +-------+ 124 | CPU 0 | | CPU 1 | 125 +-------+ +-------+ 126 127Frequency domain: 128 129Frequency domains (P-states) typically cover the same group of cpus as one of 130the power domain levels. That is, there might be several smaller power domains 131sharing the same frequency (P-state) or there might be a power domain spanning 132multiple frequency domains. 133 134From a scheduling point of view there is no need to know the actual frequencies 135[Hz]. All the scheduler cares about is the compute capacity available at the 136current state (P-state) the cpu is in and any other available states. For that 137reason, and to also factor in any cpu micro-architecture differences, compute 138capacity scaling states are called 'capacity states' in this document. For SMP 139systems this is equivalent to P-states. For mixed micro-architecture systems 140(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture 141performance relative to the other cpus in the system. 142 143Energy modelling: 144------------------ 145 146Due to the hierarchical nature of the power domains, the most obvious way to 147model energy costs is therefore to associate power and energy costs with 148domains (groups of cpus). Energy costs of shared resources are associated with 149the group of cpus that share the resources, only the cost of powering the 150cpu itself and any private resources (e.g. private L1 caches) is associated 151with the per-cpu groups (lowest level). 152 153For example, for an SMP system with per-cpu power domains and a cluster level 154(group of cpus) power domain we get the overall energy costs to be: 155 156 energy = energy_cluster + n * energy_cpu 157 158where 'n' is the number of cpus powered up and energy_cluster is the cost paid 159as soon as any cpu in the cluster is powered up. 160 161The power and frequency domains can naturally be mapped onto the existing 162sched_domain hierarchy and sched_groups by adding the necessary data to the 163existing data structures. 164 165The energy model considers energy consumption from two contributors (shown in 166the illustration below): 167 1681. Busy energy: Energy consumed while a cpu and the higher level groups that it 169belongs to are busy running tasks. Busy energy is associated with the state of 170the cpu, not an event. The time the cpu spends in this state varies. Thus, the 171most obvious platform parameter for this contribution is busy power 172(energy/time). 173 1742. Idle energy: Energy consumed while a cpu and higher level groups that it 175belongs to are idle (in a C-state). Like busy energy, idle energy is associated 176with the state of the cpu. Thus, the platform parameter for this contribution 177is idle power (energy/time). 178 179Energy consumed during transitions from an idle-state (C-state) to a busy state 180(P-state) or going the other way is ignored by the model to simplify the energy 181model calculations. 182 183 184 Power 185 ^ 186 | busy->idle idle->busy 187 | transition transition 188 | 189 | _ __ 190 | / \ / \__________________ 191 |______________/ \ / 192 | \ / 193 | Busy \ Idle / Busy 194 | low P-state \____________/ high P-state 195 | 196 +------------------------------------------------------------> time 197 198Busy |--------------| |-----------------| 199 200Wakeup |------| |------| 201 202Idle |------------| 203 204 205The basic algorithm 206==================== 207 208The basic idea is to determine the total energy impact when utilization is 209added or removed by estimating the impact at each level in the sched_domain 210hierarchy starting from the bottom (sched_group contains just a single cpu). 211The energy cost comes from busy time (sched_group is awake because one or more 212cpus are busy) and idle time (in an idle-state). Energy model numbers account 213for energy costs associated with all cpus in the sched_group as a group. 214 215 for_each_domain(cpu, sd) { 216 sg = sched_group_of(cpu) 217 energy_before = curr_util(sg) * busy_power(sg) 218 + (1-curr_util(sg)) * idle_power(sg) 219 energy_after = new_util(sg) * busy_power(sg) 220 + (1-new_util(sg)) * idle_power(sg) 221 energy_diff += energy_before - energy_after 222 223 } 224 225 return energy_diff 226 227{curr, new}_util: The cpu utilization at the lowest level and the overall 228non-idle time for the entire group for higher levels. Utilization is in the 229range 0.0 to 1.0 in the pseudo-code. 230 231busy_power: The power consumption of the sched_group. 232 233idle_power: The power consumption of the sched_group when idle. 234 235Note: It is a fundamental assumption that the utilization is (roughly) scale 236invariant. Task utilization tracking factors in any frequency scaling and 237performance scaling differences due to difference cpu microarchitectures such 238that task utilization can be used across the entire system. 239 240 241Platform energy data 242===================== 243 244struct sched_group_energy can be attached to sched_groups in the sched_domain 245hierarchy and has the following members: 246 247cap_states: 248 List of struct capacity_state representing the supported capacity states 249 (P-states). struct capacity_state has two members: cap and power, which 250 represents the compute capacity and the busy_power of the state. The 251 list must be ordered by capacity low->high. 252 253nr_cap_states: 254 Number of capacity states in cap_states list. 255 256idle_states: 257 List of struct idle_state containing idle_state power cost for each 258 idle-state supported by the system orderd by shallowest state first. 259 All states must be included at all level in the hierarchy, i.e. a 260 sched_group spanning just a single cpu must also include coupled 261 idle-states (cluster states). In addition to the cpuidle idle-states, 262 the list must also contain an entry for the idling using the arch 263 default idle (arch_idle_cpu()). Despite this state may not be a true 264 hardware idle-state it is considered the shallowest idle-state in the 265 energy model and must be the first entry. cpus may enter this state 266 (possibly 'active idling') if cpuidle decides not enter a cpuidle 267 idle-state. Default idle may not be used when cpuidle is enabled. 268 In this case, it should just be a copy of the first cpuidle idle-state. 269 270nr_idle_states: 271 Number of idle states in idle_states list. 272 273There are no unit requirements for the energy cost data. Data can be normalized 274with any reference, however, the normalization must be consistent across all 275energy cost data. That is, one bogo-joule/watt must be the same quantity for 276data, but we don't care what it is. 277 278A recipe for platform characterization 279======================================= 280 281Obtaining the actual model data for a particular platform requires some way of 282measuring power/energy. There isn't a tool to help with this (yet). This 283section provides a recipe for use as reference. It covers the steps used to 284characterize the ARM TC2 development platform. This sort of measurements is 285expected to be done anyway when tuning cpuidle and cpufreq for a given 286platform. 287 288The energy model needs two types of data (struct sched_group_energy holds 289these) for each sched_group where energy costs should be taken into account: 290 2911. Capacity state information 292 293A list containing the compute capacity and power consumption when fully 294utilized attributed to the group as a whole for each available capacity state. 295At the lowest level (group contains just a single cpu) this is the power of the 296cpu alone without including power consumed by resources shared with other cpus. 297It basically needs to fit the basic modelling approach described in "Background 298and Terminology" section: 299 300 energy_system = energy_shared + n * energy_cpu 301 302for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at 303the lowest level. 'energy_shared' is included at the next level which 304represents the group of cpus among which the resources are shared. 305 306This model is, of course, a simplification of reality. Thus, power/energy 307attributions might not always exactly represent how the hardware is designed. 308Also, busy power is likely to depend on the workload. It is therefore 309recommended to use a representative mix of workloads when characterizing the 310capacity states. 311 312If the group has no capacity scaling support, the list will contain a single 313state where power is the busy power attributed to the group. The capacity 314should be set to a default value (1024). 315 316When frequency domains include multiple power domains, the group representing 317the frequency domain and all child groups share capacity states. This must be 318indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at 319all levels that share the capacity state must have the list of capacity states 320with the power set to the contribution of the individual group. 321 3222. Idle power information 323 324Stored in the idle_states list. The power number is the group idle power 325consumption in each idle state as well when the group is idle but has not 326entered an idle-state ('active idle' as mentioned earlier). Due to the way the 327energy model is defined, the idle power of the deepest group idle state can 328alternatively be accounted for in the parent group busy power. In that case the 329group idle state power values are offset such that the idle power of the 330deepest state is zero. It is less intuitive, but it is easier to measure as 331idle power consumed by the group and the busy/idle power of the parent group 332cannot be distinguished without per group measurement points. 333 334Measuring capacity states and idle power: 335 336The capacity states' capacity and power can be estimated by running a benchmark 337workload at each available capacity state. By restricting the benchmark to run 338on subsets of cpus it is possible to extrapolate the power consumption of 339shared resources. 340 341ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a 342shared L2 cache. TC2 has on-chip energy counters per cluster. Running a 343benchmark workload on just one cpu in a cluster means that power is consumed in 344the cluster (higher level group) and a single cpu (lowest level group). Adding 345another benchmark task to another cpu increases the power consumption by the 346amount consumed by the additional cpu. Hence, it is possible to extrapolate the 347cluster busy power. 348 349For platforms that don't have energy counters or equivalent instrumentation 350built-in, it may be possible to use an external DAQ to acquire similar data. 351 352If the benchmark includes some performance score (for example sysbench cpu 353benchmark), this can be used to record the compute capacity. 354 355Measuring idle power requires insight into the idle state implementation on the 356particular platform. Specifically, if the platform has coupled idle-states (or 357package states). To measure non-coupled per-cpu idle-states it is necessary to 358keep one cpu busy to keep any shared resources alive to isolate the idle power 359of the cpu from idle/busy power of the shared resources. The cpu can be tricked 360into different per-cpu idle states by disabling the other states. Based on 361various combinations of measurements with specific cpus busy and disabling 362idle-states it is possible to extrapolate the idle-state power. 363