• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Instruction scheduling
2## Overview
3
4Rearrange adjacent instructions for better performance.
5
6## Rationality
7
8When instructions are executed on CPU they may stall the processor pipeline when input registers are not ready yet, because they are written by one of the previous instructions. Scheduling allows to reduce the amount of such stalls in pipeline.
9
10## Dependence
11
12* Dead Code Elimination(DCE)
13* Remove Empty Blocks
14* Remove Linear blocks
15* Reverse Post Order(RPO)
16
17## Algorithm
18
19Current decisions/limitations:
20 * Scheduler pass is placed immediately before register allocation
21 * It rearranges instructions only inside the basic block, but not between them
22 * No liveness analysis, only calculating dependencies using barrier/users/alias information
23 * No CPU pipeline/resource modeling, only having dependency costs
24 * Forward list scheduling algorithm with standard critical-path-based priority
25
26For each basic block we first scan instructions in reverse order marking barriers and calculating the dependencies.
27Together with dependencies we calculate priority as a longest (critical) path to leaf instructions in basic block dependency graph.
28
29Than we schedule each interval between barriers using the following algorithm.
30There are two priority queues, `waiting` and `ready`. `ready` queue is sorted based on priority calculated previously, while `waiting` queue is based on so-called `ASAP` (as soon as possible) values. In initialization, `ready` is empty and `waiting` contains all leaf instructions (without incoming dependencies), their `ASAP` is 1.
31
32`ASAP` value for each instruction is changed only before it enters the `waiting` queue, and remains unchanged since that time.
33Algorithm starts from tick `cycle` 1. If `ready` queue is empty we look through "soonest" instruction from `waiting` queue and if we need to skip some ticks without scheduling any instruction we have to adjust `cycle` value.
34Next, we move all already available instructions (`ASAP` <= `cycle`) from `waiting` queue into `ready` queue.
35
36Finally, we extract top instruction from `ready` queue and add it into new schedule. At this moment we adjust `ASAP` value for all dependent instructions and add some of them (which depend only on already scheduled instructions) into `waiting` queue.
37
38## Pseudocode
39
40```c++
41Scheduler::RunImpl() {
42    for (auto bb : GetGraph()->GetBlocksRPO())
43        ScheduleBasicBlock(bb);
44}
45
46// Dependency helper function
47void Scheduler::AddDep(uint32_t* prio, Inst* from, Inst* to, uint32_t latency, Inst* barrier) {
48    // Update instruction priority - "how high instruction is in dependency tree"
49    *prio = std::max(*prio, latency + prio_[to]);
50    // Do not add cross-barrier dependencies into deps_
51    if (barrier == nullptr || old_[to] > old_[barrier]) {
52        if (deps_.at(from).count(to) == 1) {
53            uint32_t old_latency = deps_.at(from).at(to);
54            if (old_latency >= latency)
55                return;
56        } else
57            num_deps_[to]++;
58        deps_.at(from)[to] = latency;
59    }
60}
61
62// Rearranges instructions in the basic block using list scheduling algorithm.
63Scheduler::ScheduleBasicBlock(BasicBlock* bb) {
64    // Calculate priority and dependencies
65    uint32_t num_inst = 0;
66    Inst* last_barrier = nullptr;
67
68    for (auto inst : bb->InstsSafeReverse()) {
69        uint32_t prio = 0;
70        old_.insert({inst, num_inst++});
71        num_deps_.insert({inst, 0U});
72        deps_.emplace(inst, GetGraph()->GetLocalAllocator()->Adapter());
73
74        // Dependency to the barrier
75        if (last_barrier != nullptr)
76            AddDep(&prio, inst, last_barrier, 1U, last_barrier);
77        // Dependency from barrier
78        if (barrier) {
79            Inst* old_last_barrier = last_barrier;
80            last_barrier = inst;
81            num_barriers++;
82            for (auto user = inst->GetNext(); user != old_last_barrier; user = user->GetNext())
83                AddDep(&prio, inst, user, 1U, last_barrier);
84        }
85
86        // Users
87        for (auto& user_item : inst->GetUsers()) {
88            auto user = user_item.GetInst();
89            AddDep(&prio, inst, user, inst_latency, last_barrier);
90        }
91
92        .... // Memory dependencies calculation
93        ...  // CanThrow or SaveState can't be rearranged, and stores can't be moved over them
94
95        prio_.insert({inst, prio});
96    }
97
98    // Schedule intervals between barriers
99    uint32_t cycle = 0;
100    num_inst = 0;
101    Inst* first = nullptr;
102    for (auto inst = bb->GetFirstInst(); inst != nullptr; inst = inst->GetNext()) {
103        bool barrier = inst->IsBarrier();
104        num_inst++;
105        if (first == nullptr)
106            first = inst;
107        if (barrier || inst == bb->GetLastInst()) {
108            Inst* last = nullptr;
109            if (barrier) {
110                last = inst->GetPrev();
111                num_inst--;
112            } else
113                last = inst;
114            if (num_inst > 1)
115                cycle += ScheduleInstsBetweenBarriers(first, last);
116            else if (num_inst == 1) {
117                sched_.push_back(first);
118                cycle++;
119            }
120            if (barrier) {
121                sched_.push_back(inst);
122                cycle++;
123            }
124            num_inst = 0;
125            first = nullptr;
126        }
127    }
128    ... // Here we rearrange instructions in basic block according to sched_
129}
130
131// Schedule instructions between [first..last] inclusive, none of them are barriers.
132uint32_t Scheduler::ScheduleInstsBetweenBarriers(Inst* first, Inst* last) {
133    // Compare function for 'waiting' queue
134    auto cmp_asap = [this](Inst* left, Inst* right) {
135        return asap_[left] > asap_[right] || (asap_[left] == asap_[right] && old_[left] < old_[right]);
136    };
137    // Queue of instructions, which dependencies are scheduled already, but they are still not finished yet
138    std::priority_queue<Inst*, ArenaVector<Inst*>, decltype(cmp_asap)> waiting(
139        cmp_asap, GetGraph()->GetLocalAllocator()->Adapter());
140
141    // Compare function for 'ready' queue
142    auto cmp_prio = [this](Inst* left, Inst* right) {
143        return prio_[left] < prio_[right] || (prio_[left] == prio_[right] && old_[left] < old_[right]);
144    };
145    // Queue of ready instructions
146    std::priority_queue<Inst*, ArenaVector<Inst*>, decltype(cmp_prio)> ready(
147        cmp_prio, GetGraph()->GetLocalAllocator()->Adapter());
148
149     // Initialization, add leafs into 'waiting' queue
150    uint32_t num_inst = 0;
151    for (auto inst = first; inst != last->GetNext(); inst = inst->GetNext()) {
152        asap_.insert({inst, 1U});
153        if (num_deps_[inst] == 0)
154            waiting.push(inst);
155        num_inst++;
156    }
157    // Scheduling
158    uint32_t cycle = 1;
159    while (num_inst > 0) {
160        if (ready.empty()) {
161            uint32_t nearest = asap_[waiting.top()];
162            // Skipping cycles where we can't schedule any instruction
163            if (nearest > cycle)
164                cycle = nearest;
165        }
166        // Move from 'waiting' to 'ready'
167        while (!waiting.empty()) {
168            Inst* soonest = waiting.top();
169            if (asap_[soonest] <= cycle) {
170                waiting.pop();
171                ready.push(soonest);
172            } else
173                break;
174        }
175        // Extract top 'ready' instruction
176        auto cur = ready.top();
177        ready.pop();
178        // Adjust all dependent instructions
179        for (auto pair : deps_.at(cur)) {
180            // Adjust asap
181            uint32_t asap = asap_[pair.first];
182            asap = std::max(asap, cycle + pair.second);
183            asap_[pair.first] = asap;
184            // Adjust num_deps
185            uint32_t num_deps = num_deps_[pair.first];
186            num_deps--;
187            num_deps_[pair.first] = num_deps;
188            if (num_deps == 0 && pair.first->GetOpcode() != Opcode::LoadPairPart)
189                waiting.push(pair.first);
190        }
191        // Add into schedule
192        sched_.push_back(cur);
193        num_inst--;
194        cycle++;
195    }
196    asap_.clear();
197    return cycle;
198}
199```
200
201## Examples
202
203IR Before optimization:
204```
205BB 0
206prop: start
207    0.i64  Constant                   0x2a -> (v8)
208    1.i64  Constant                   0x2b -> (v8)
209    2.i64  Constant                   0x2c -> (v9)
210    3.i64  Constant                   0x2d -> (v9)
211    4.i64  Constant                   0x2e -> (v11)
212    5.i64  Constant                   0x2f -> (v11)
213    6.i64  Constant                   0x30 -> (v12)
214    7.i64  Constant                   0x31 -> (v12)
215succs: [bb 2]
216
217BB 2  preds: [bb 0]
218    8.u64  Add                        v0, v1 -> (v10)
219    9.u64  Add                        v2, v3 -> (v10)
220   10.u64  Add                        v8, v9 -> (v14)
221   11.u64  Add                        v4, v5 -> (v13)
222   12.u64  Add                        v6, v7 -> (v13)
223   13.u64  Add                        v11, v12 -> (v14)
224   14.u64  Add                        v10, v13 -> (v15)
225   15.u64  Return                     v14
226succs: [bb 1]
227
228BB 1  preds: [bb 2]
229prop: end
230```
231
232IR after optimization:
233```
234BB 0
235prop: start
236    0.i64  Constant                   0x2a -> (v8)
237    1.i64  Constant                   0x2b -> (v8)
238    2.i64  Constant                   0x2c -> (v9)
239    3.i64  Constant                   0x2d -> (v9)
240    4.i64  Constant                   0x2e -> (v11)
241    5.i64  Constant                   0x2f -> (v11)
242    6.i64  Constant                   0x30 -> (v12)
243    7.i64  Constant                   0x31 -> (v12)
244succs: [bb 2]
245
246BB 2  preds: [bb 0]
247    8.u64  Add                        v0, v1 -> (v10)
248    9.u64  Add                        v2, v3 -> (v10)
249   11.u64  Add                        v4, v5 -> (v13)
250   12.u64  Add                        v6, v7 -> (v13)
251   10.u64  Add                        v8, v9 -> (v14)
252   13.u64  Add                        v11, v12 -> (v14)
253   14.u64  Add                        v10, v13 -> (v15)
254   15.u64  Return                     v14
255succs: [bb 1]
256
257BB 1  preds: [bb 2]
258prop: end
259```
260
261Instruction 10 was moved down.
262
263## Links
264
265Algorithm: [article](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.211.7673&rep=rep1&type=pdf)
266
267Source code:
268[scheduler.cpp](../optimizer/optimizations/scheduler.cpp)
269[scheduler.h](../optimizer/optimizations/scheduler.h)
270
271Tests:
272[scheduler_test.cpp](../tests/scheduler_test.cpp)
273