1# Instruction scheduling 2## Overview 3 4Rearrange adjacent instructions for better performance. 5 6## Rationality 7 8When instructions are executed on CPU they may stall the processor pipeline when input registers are not ready yet, because they are written by one of the previous instructions. Scheduling allows to reduce the amount of such stalls in pipeline. 9 10## Dependence 11 12* Dead Code Elimination(DCE) 13* Remove Empty Blocks 14* Remove Linear blocks 15* Reverse Post Order(RPO) 16 17## Algorithm 18 19Current decisions/limitations: 20 * Scheduler pass is placed immediately before register allocation 21 * It rearranges instructions only inside the basic block, but not between them 22 * No liveness analysis, only calculating dependencies using barrier/users/alias information 23 * No CPU pipeline/resource modeling, only having dependency costs 24 * Forward list scheduling algorithm with standard critical-path-based priority 25 26For each basic block we first scan instructions in reverse order marking barriers and calculating the dependencies. 27Together with dependencies we calculate priority as a longest (critical) path to leaf instructions in basic block dependency graph. 28 29Than we schedule each interval between barriers using the following algorithm. 30There are two priority queues, `waiting` and `ready`. `ready` queue is sorted based on priority calculated previously, while `waiting` queue is based on so-called `ASAP` (as soon as possible) values. In initialization, `ready` is empty and `waiting` contains all leaf instructions (without incoming dependencies), their `ASAP` is 1. 31 32`ASAP` value for each instruction is changed only before it enters the `waiting` queue, and remains unchanged since that time. 33Algorithm starts from tick `cycle` 1. If `ready` queue is empty we look through "soonest" instruction from `waiting` queue and if we need to skip some ticks without scheduling any instruction we have to adjust `cycle` value. 34Next, we move all already available instructions (`ASAP` <= `cycle`) from `waiting` queue into `ready` queue. 35 36Finally, we extract top instruction from `ready` queue and add it into new schedule. At this moment we adjust `ASAP` value for all dependent instructions and add some of them (which depend only on already scheduled instructions) into `waiting` queue. 37 38## Pseudocode 39 40```c++ 41Scheduler::RunImpl() { 42 for (auto bb : GetGraph()->GetBlocksRPO()) 43 ScheduleBasicBlock(bb); 44} 45 46// Dependency helper function 47void Scheduler::AddDep(uint32_t* prio, Inst* from, Inst* to, uint32_t latency, Inst* barrier) { 48 // Update instruction priority - "how high instruction is in dependency tree" 49 *prio = std::max(*prio, latency + prio_[to]); 50 // Do not add cross-barrier dependencies into deps_ 51 if (barrier == nullptr || old_[to] > old_[barrier]) { 52 if (deps_.at(from).count(to) == 1) { 53 uint32_t old_latency = deps_.at(from).at(to); 54 if (old_latency >= latency) 55 return; 56 } else 57 num_deps_[to]++; 58 deps_.at(from)[to] = latency; 59 } 60} 61 62// Rearranges instructions in the basic block using list scheduling algorithm. 63Scheduler::ScheduleBasicBlock(BasicBlock* bb) { 64 // Calculate priority and dependencies 65 uint32_t num_inst = 0; 66 Inst* last_barrier = nullptr; 67 68 for (auto inst : bb->InstsSafeReverse()) { 69 uint32_t prio = 0; 70 old_.insert({inst, num_inst++}); 71 num_deps_.insert({inst, 0U}); 72 deps_.emplace(inst, GetGraph()->GetLocalAllocator()->Adapter()); 73 74 // Dependency to the barrier 75 if (last_barrier != nullptr) 76 AddDep(&prio, inst, last_barrier, 1U, last_barrier); 77 // Dependency from barrier 78 if (barrier) { 79 Inst* old_last_barrier = last_barrier; 80 last_barrier = inst; 81 num_barriers++; 82 for (auto user = inst->GetNext(); user != old_last_barrier; user = user->GetNext()) 83 AddDep(&prio, inst, user, 1U, last_barrier); 84 } 85 86 // Users 87 for (auto& user_item : inst->GetUsers()) { 88 auto user = user_item.GetInst(); 89 AddDep(&prio, inst, user, inst_latency, last_barrier); 90 } 91 92 .... // Memory dependencies calculation 93 ... // CanThrow or SaveState can't be rearranged, and stores can't be moved over them 94 95 prio_.insert({inst, prio}); 96 } 97 98 // Schedule intervals between barriers 99 uint32_t cycle = 0; 100 num_inst = 0; 101 Inst* first = nullptr; 102 for (auto inst = bb->GetFirstInst(); inst != nullptr; inst = inst->GetNext()) { 103 bool barrier = inst->IsBarrier(); 104 num_inst++; 105 if (first == nullptr) 106 first = inst; 107 if (barrier || inst == bb->GetLastInst()) { 108 Inst* last = nullptr; 109 if (barrier) { 110 last = inst->GetPrev(); 111 num_inst--; 112 } else 113 last = inst; 114 if (num_inst > 1) 115 cycle += ScheduleInstsBetweenBarriers(first, last); 116 else if (num_inst == 1) { 117 sched_.push_back(first); 118 cycle++; 119 } 120 if (barrier) { 121 sched_.push_back(inst); 122 cycle++; 123 } 124 num_inst = 0; 125 first = nullptr; 126 } 127 } 128 ... // Here we rearrange instructions in basic block according to sched_ 129} 130 131// Schedule instructions between [first..last] inclusive, none of them are barriers. 132uint32_t Scheduler::ScheduleInstsBetweenBarriers(Inst* first, Inst* last) { 133 // Compare function for 'waiting' queue 134 auto cmp_asap = [this](Inst* left, Inst* right) { 135 return asap_[left] > asap_[right] || (asap_[left] == asap_[right] && old_[left] < old_[right]); 136 }; 137 // Queue of instructions, which dependencies are scheduled already, but they are still not finished yet 138 std::priority_queue<Inst*, ArenaVector<Inst*>, decltype(cmp_asap)> waiting( 139 cmp_asap, GetGraph()->GetLocalAllocator()->Adapter()); 140 141 // Compare function for 'ready' queue 142 auto cmp_prio = [this](Inst* left, Inst* right) { 143 return prio_[left] < prio_[right] || (prio_[left] == prio_[right] && old_[left] < old_[right]); 144 }; 145 // Queue of ready instructions 146 std::priority_queue<Inst*, ArenaVector<Inst*>, decltype(cmp_prio)> ready( 147 cmp_prio, GetGraph()->GetLocalAllocator()->Adapter()); 148 149 // Initialization, add leafs into 'waiting' queue 150 uint32_t num_inst = 0; 151 for (auto inst = first; inst != last->GetNext(); inst = inst->GetNext()) { 152 asap_.insert({inst, 1U}); 153 if (num_deps_[inst] == 0) 154 waiting.push(inst); 155 num_inst++; 156 } 157 // Scheduling 158 uint32_t cycle = 1; 159 while (num_inst > 0) { 160 if (ready.empty()) { 161 uint32_t nearest = asap_[waiting.top()]; 162 // Skipping cycles where we can't schedule any instruction 163 if (nearest > cycle) 164 cycle = nearest; 165 } 166 // Move from 'waiting' to 'ready' 167 while (!waiting.empty()) { 168 Inst* soonest = waiting.top(); 169 if (asap_[soonest] <= cycle) { 170 waiting.pop(); 171 ready.push(soonest); 172 } else 173 break; 174 } 175 // Extract top 'ready' instruction 176 auto cur = ready.top(); 177 ready.pop(); 178 // Adjust all dependent instructions 179 for (auto pair : deps_.at(cur)) { 180 // Adjust asap 181 uint32_t asap = asap_[pair.first]; 182 asap = std::max(asap, cycle + pair.second); 183 asap_[pair.first] = asap; 184 // Adjust num_deps 185 uint32_t num_deps = num_deps_[pair.first]; 186 num_deps--; 187 num_deps_[pair.first] = num_deps; 188 if (num_deps == 0 && pair.first->GetOpcode() != Opcode::LoadPairPart) 189 waiting.push(pair.first); 190 } 191 // Add into schedule 192 sched_.push_back(cur); 193 num_inst--; 194 cycle++; 195 } 196 asap_.clear(); 197 return cycle; 198} 199``` 200 201## Examples 202 203IR Before optimization: 204``` 205BB 0 206prop: start 207 0.i64 Constant 0x2a -> (v8) 208 1.i64 Constant 0x2b -> (v8) 209 2.i64 Constant 0x2c -> (v9) 210 3.i64 Constant 0x2d -> (v9) 211 4.i64 Constant 0x2e -> (v11) 212 5.i64 Constant 0x2f -> (v11) 213 6.i64 Constant 0x30 -> (v12) 214 7.i64 Constant 0x31 -> (v12) 215succs: [bb 2] 216 217BB 2 preds: [bb 0] 218 8.u64 Add v0, v1 -> (v10) 219 9.u64 Add v2, v3 -> (v10) 220 10.u64 Add v8, v9 -> (v14) 221 11.u64 Add v4, v5 -> (v13) 222 12.u64 Add v6, v7 -> (v13) 223 13.u64 Add v11, v12 -> (v14) 224 14.u64 Add v10, v13 -> (v15) 225 15.u64 Return v14 226succs: [bb 1] 227 228BB 1 preds: [bb 2] 229prop: end 230``` 231 232IR after optimization: 233``` 234BB 0 235prop: start 236 0.i64 Constant 0x2a -> (v8) 237 1.i64 Constant 0x2b -> (v8) 238 2.i64 Constant 0x2c -> (v9) 239 3.i64 Constant 0x2d -> (v9) 240 4.i64 Constant 0x2e -> (v11) 241 5.i64 Constant 0x2f -> (v11) 242 6.i64 Constant 0x30 -> (v12) 243 7.i64 Constant 0x31 -> (v12) 244succs: [bb 2] 245 246BB 2 preds: [bb 0] 247 8.u64 Add v0, v1 -> (v10) 248 9.u64 Add v2, v3 -> (v10) 249 11.u64 Add v4, v5 -> (v13) 250 12.u64 Add v6, v7 -> (v13) 251 10.u64 Add v8, v9 -> (v14) 252 13.u64 Add v11, v12 -> (v14) 253 14.u64 Add v10, v13 -> (v15) 254 15.u64 Return v14 255succs: [bb 1] 256 257BB 1 preds: [bb 2] 258prop: end 259``` 260 261Instruction 10 was moved down. 262 263## Links 264 265Algorithm: [article](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.211.7673&rep=rep1&type=pdf) 266 267Source code: 268[scheduler.cpp](../optimizer/optimizations/scheduler.cpp) 269[scheduler.h](../optimizer/optimizations/scheduler.h) 270 271Tests: 272[scheduler_test.cpp](../tests/scheduler_test.cpp) 273