Documentation/scheduler/sched-energy.txt

Energy cost model for energy-aware scheduling (EXPERIMENTAL)

Introduction
=============

The basic energy model uses platform energy data stored in sched_group_energy
data structures attached to the sched_groups in the sched_domain hierarchy. The
energy cost model offers two functions that can be used to guide scheduling
decisions:

1.	static unsigned int sched_group_energy(struct energy_env *eenv)
2.	static int energy_diff(struct energy_env *eenv)

sched_group_energy() estimates the energy consumed by all cpus in a specific
sched_group including any shared resources owned exclusively by this group of
cpus. Resources shared with other cpus are excluded (e.g. later level caches).

energy_diff() estimates the total energy impact of a utilization change. That
is, adding, removing, or migrating utilization (tasks).

Both functions use a struct energy_env to specify the scenario to be evaluated:

	struct energy_env {
		struct sched_group      *sg_top;
		struct sched_group      *sg_cap;
		int                     cap_idx;
		int                     util_delta;
		int                     src_cpu;
		int                     dst_cpu;
		int                     energy;
	};

sg_top: sched_group to be evaluated. Not used by energy_diff().

sg_cap: sched_group covering the cpus in the same frequency domain. Set by
sched_group_energy().

cap_idx: Capacity state to be used for energy calculations. Set by
find_new_capacity().

util_delta: Amount of utilization to be added, removed, or migrated.

src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
-1 if no source (e.g. task wake-up).

dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
if utilization is removed (e.g. terminating tasks).

energy: Result of sched_group_energy().

The metric used to represent utilization is the actual per-entity running time
averaged over time using a geometric series. Very similar to the existing
per-entity load-tracking, but _not_ scaled by task priority and capped by the
capacity of the cpu. The latter property does mean that utilization may
underestimate the compute requirements for task on fully/over utilized cpus.
The greatest potential for energy savings without affecting performance too much
is scenarios where the system isn't fully utilized. If the system is deemed
fully utilized load-balancing should be done with task load (includes task
priority) instead in the interest of fairness and performance.


Background and Terminology
===========================

To make it clear from the start:

energy = [joule] (resource like a battery on powered devices)
power = energy/time = [joule/second] = [watt]

The goal of energy-aware scheduling is to minimize energy, while still getting
the job done. That is, we want to maximize:

	performance [inst/s]
	--------------------
	    power [W]

which is equivalent to minimizing:

	energy [J]
	-----------
	instruction

while still getting 'good' performance. It is essentially an alternative
optimization objective to the current performance-only objective for the
scheduler. This alternative considers two objectives: energy-efficiency and
performance. Hence, there needs to be a user controllable knob to switch the
objective. Since it is early days, this is currently a sched_feature
(ENERGY_AWARE).

The idea behind introducing an energy cost model is to allow the scheduler to
evaluate the implications of its decisions rather than applying energy-saving
techniques blindly that may only have positive effects on some platforms. At
the same time, the energy cost model must be as simple as possible to minimize
the scheduler latency impact.

Platform topology
------------------

The system topology (cpus, caches, and NUMA information, not peripherals) is
represented in the scheduler by the sched_domain hierarchy which has
sched_groups attached at each level that covers one or more cpus (see
sched-domains.txt for more details). To add energy awareness to the scheduler
we need to consider power and frequency domains.

Power domain:

A power domain is a part of the system that can be powered on/off
independently. Power domains are typically organized in a hierarchy where you
may be able to power down just a cpu or a group of cpus along with any
associated resources (e.g.  shared caches). Powering up a cpu means that all
power domains it is a part of in the hierarchy must be powered up. Hence, it is
more expensive to power up the first cpu that belongs to a higher level power
domain than powering up additional cpus in the same high level domain. Two
level power domain hierarchy example:

		Power source
		         +-------------------------------+----...
per group PD		 G                               G
		         |           +----------+        |
		    +--------+-------| Shared   |  (other groups)
per-cpu PD	    G        G       | resource |
		    |        |       +----------+
		+-------+ +-------+
		| CPU 0 | | CPU 1 |
		+-------+ +-------+

Frequency domain:

Frequency domains (P-states) typically cover the same group of cpus as one of
the power domain levels. That is, there might be several smaller power domains
sharing the same frequency (P-state) or there might be a power domain spanning
multiple frequency domains.

From a scheduling point of view there is no need to know the actual frequencies
[Hz]. All the scheduler cares about is the compute capacity available at the
current state (P-state) the cpu is in and any other available states. For that
reason, and to also factor in any cpu micro-architecture differences, compute
capacity scaling states are called 'capacity states' in this document. For SMP
systems this is equivalent to P-states. For mixed micro-architecture systems
(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
performance relative to the other cpus in the system.

Energy modelling:
------------------

Due to the hierarchical nature of the power domains, the most obvious way to
model energy costs is therefore to associate power and energy costs with
domains (groups of cpus). Energy costs of shared resources are associated with
the group of cpus that share the resources, only the cost of powering the
cpu itself and any private resources (e.g. private L1 caches) is associated
with the per-cpu groups (lowest level).

For example, for an SMP system with per-cpu power domains and a cluster level
(group of cpus) power domain we get the overall energy costs to be:

	energy = energy_cluster + n * energy_cpu

where 'n' is the number of cpus powered up and energy_cluster is the cost paid
as soon as any cpu in the cluster is powered up.

The power and frequency domains can naturally be mapped onto the existing
sched_domain hierarchy and sched_groups by adding the necessary data to the
existing data structures.

The energy model considers energy consumption from two contributors (shown in
the illustration below):

1. Busy energy: Energy consumed while a cpu and the higher level groups that it
belongs to are busy running tasks. Busy energy is associated with the state of
the cpu, not an event. The time the cpu spends in this state varies. Thus, the
most obvious platform parameter for this contribution is busy power
(energy/time).

2. Idle energy: Energy consumed while a cpu and higher level groups that it
belongs to are idle (in a C-state). Like busy energy, idle energy is associated
with the state of the cpu. Thus, the platform parameter for this contribution
is idle power (energy/time).

Energy consumed during transitions from an idle-state (C-state) to a busy state
(P-state) or going the other way is ignored by the model to simplify the energy
model calculations.


	Power
	^
	|            busy->idle             idle->busy
	|            transition             transition
	|
	|                _                      __
	|               / \                    /  \__________________
	|______________/   \                  /
	|                   \                /
	|  Busy              \    Idle      /        Busy
	|  low P-state        \____________/         high P-state
	|
	+------------------------------------------------------------> time

Busy    |--------------|                          |-----------------|

Wakeup                 |------|            |------|

Idle                          |------------|


The basic algorithm
====================

The basic idea is to determine the total energy impact when utilization is
added or removed by estimating the impact at each level in the sched_domain
hierarchy starting from the bottom (sched_group contains just a single cpu).
The energy cost comes from busy time (sched_group is awake because one or more
cpus are busy) and idle time (in an idle-state). Energy model numbers account
for energy costs associated with all cpus in the sched_group as a group.

	for_each_domain(cpu, sd) {
		sg = sched_group_of(cpu)
		energy_before = curr_util(sg) * busy_power(sg)
				+ (1-curr_util(sg)) * idle_power(sg)
		energy_after = new_util(sg) * busy_power(sg)
				+ (1-new_util(sg)) * idle_power(sg)
		energy_diff += energy_before - energy_after

	}

	return energy_diff

{curr, new}_util: The cpu utilization at the lowest level and the overall
non-idle time for the entire group for higher levels. Utilization is in the
range 0.0 to 1.0 in the pseudo-code.

busy_power: The power consumption of the sched_group.

idle_power: The power consumption of the sched_group when idle.

Note: It is a fundamental assumption that the utilization is (roughly) scale
invariant. Task utilization tracking factors in any frequency scaling and
performance scaling differences due to difference cpu microarchitectures such
that task utilization can be used across the entire system.


Platform energy data
=====================

struct sched_group_energy can be attached to sched_groups in the sched_domain
hierarchy and has the following members:

cap_states:
	List of struct capacity_state representing the supported capacity states
	(P-states). struct capacity_state has two members: cap and power, which
	represents the compute capacity and the busy_power of the state. The
	list must be ordered by capacity low->high.

nr_cap_states:
	Number of capacity states in cap_states list.

idle_states:
	List of struct idle_state containing idle_state power cost for each
	idle-state supported by the system orderd by shallowest state first.
	All states must be included at all level in the hierarchy, i.e. a
	sched_group spanning just a single cpu must also include coupled
	idle-states (cluster states). In addition to the cpuidle idle-states,
	the list must also contain an entry for the idling using the arch
	default idle (arch_idle_cpu()). Despite this state may not be a true
	hardware idle-state it is considered the shallowest idle-state in the
	energy model and must be the first entry. cpus may enter this state
	(possibly 'active idling') if cpuidle decides not enter a cpuidle
	idle-state. Default idle may not be used when cpuidle is enabled.
	In this case, it should just be a copy of the first cpuidle idle-state.

nr_idle_states:
	Number of idle states in idle_states list.

There are no unit requirements for the energy cost data. Data can be normalized
with any reference, however, the normalization must be consistent across all
energy cost data. That is, one bogo-joule/watt must be the same quantity for
data, but we don't care what it is.

A recipe for platform characterization
=======================================

Obtaining the actual model data for a particular platform requires some way of
measuring power/energy. There isn't a tool to help with this (yet). This
section provides a recipe for use as reference. It covers the steps used to
characterize the ARM TC2 development platform. This sort of measurements is
expected to be done anyway when tuning cpuidle and cpufreq for a given
platform.

The energy model needs two types of data (struct sched_group_energy holds
these) for each sched_group where energy costs should be taken into account:

1. Capacity state information

A list containing the compute capacity and power consumption when fully
utilized attributed to the group as a whole for each available capacity state.
At the lowest level (group contains just a single cpu) this is the power of the
cpu alone without including power consumed by resources shared with other cpus.
It basically needs to fit the basic modelling approach described in "Background
and Terminology" section:

	energy_system = energy_shared + n * energy_cpu

for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
the lowest level. 'energy_shared' is included at the next level which
represents the group of cpus among which the resources are shared.

This model is, of course, a simplification of reality. Thus, power/energy
attributions might not always exactly represent how the hardware is designed.
Also, busy power is likely to depend on the workload. It is therefore
recommended to use a representative mix of workloads when characterizing the
capacity states.

If the group has no capacity scaling support, the list will contain a single
state where power is the busy power attributed to the group. The capacity
should be set to a default value (1024).

When frequency domains include multiple power domains, the group representing
the frequency domain and all child groups share capacity states. This must be
indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
all levels that share the capacity state must have the list of capacity states
with the power set to the contribution of the individual group.

2. Idle power information

Stored in the idle_states list. The power number is the group idle power
consumption in each idle state as well when the group is idle but has not
entered an idle-state ('active idle' as mentioned earlier). Due to the way the
energy model is defined, the idle power of the deepest group idle state can
alternatively be accounted for in the parent group busy power. In that case the
group idle state power values are offset such that the idle power of the
deepest state is zero. It is less intuitive, but it is easier to measure as
idle power consumed by the group and the busy/idle power of the parent group
cannot be distinguished without per group measurement points.

Measuring capacity states and idle power:

The capacity states' capacity and power can be estimated by running a benchmark
workload at each available capacity state. By restricting the benchmark to run
on subsets of cpus it is possible to extrapolate the power consumption of
shared resources.

ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
benchmark workload on just one cpu in a cluster means that power is consumed in
the cluster (higher level group) and a single cpu (lowest level group). Adding
another benchmark task to another cpu increases the power consumption by the
amount consumed by the additional cpu. Hence, it is possible to extrapolate the
cluster busy power.

For platforms that don't have energy counters or equivalent instrumentation
built-in, it may be possible to use an external DAQ to acquire similar data.

If the benchmark includes some performance score (for example sysbench cpu
benchmark), this can be used to record the compute capacity.

Measuring idle power requires insight into the idle state implementation on the
particular platform. Specifically, if the platform has coupled idle-states (or
package states). To measure non-coupled per-cpu idle-states it is necessary to
keep one cpu busy to keep any shared resources alive to isolate the idle power
of the cpu from idle/busy power of the shared resources. The cpu can be tricked
into different per-cpu idle states by disabling the other states. Based on
various combinations of measurements with specific cpus busy and disabling
idle-states it is possible to extrapolate the idle-state power.