1# Load Store Elimination 2## Overview 3 4The idea of optimization is to delete store instructions that store a value to memory that has been already written as well as delete load instructions that attempt to load a value that has been loaded earlier. 5 6## Rationality 7 8Elimination of load and store instructions generally reduces the number of long latency memory instructions. It should be stated that the optimization increases the register pressure and as a result can increase the amount of required spills/fills. 9 10## Dependence 11 12* AliasAnalysis 13* LoopAnalysis 14* Reverse Post Order (RPO) 15 16## Algorithm 17 18Algorithm needs to know that two memory instructions access the same memory address. This can be done using alias analysis that accepts two memory instructions and is able to say about these accesses the following 19* `MUST_ALIAS` if the instructions definitely access the same memory address 20* `NO_ALIAS` if the instructions definitely access different memory addresses. 21* `MAY_ALIAS` if analysis can't say with confidence whether the instructions access the same memory address or different (or probably they may access the same address only under some conditions but under others – not) 22 23The main structure for the algorithm is a heap representation in a form of 24 25`"Memory Instruction" -> "Value stored at location pointed by instruction"` 26 27The heap is a representation of the real heap during execution. Each instruction that works with memory operates with a particular memory address. E.g. store instructions write values on the heap, load instructions read values from the heap. But if a load instruction tries to read a value that was written by store instruction above, this load can be eliminated because we already know the value without actual reading. The heap structure helps to keep track of values that memory instruction operates with (store instruction -- write, load instructions -- read). 28 29Each basic block has its own heap. The initial value of a heap for a basic block is a merged heap from its predecessors. The result of heap merge is an intersected set of heap values on the same addresses. The initial value for entry basic block is empty heap. The empty heap is an initial value for loop headers as well because back edges can rewrite the heap values. 30 31A dependence on predecessors needs us to traverse basic blocks in reverse post order so that for each basic block we already visited all it's predecessors (empty heap of one of predecessor is a back edge). 32 33Once a heap is initialized for a basic block we iterate over instructions and update heap by applying the following rules to current instruction: 34 35- if the instruction is a store and a stored value is equal to value from heap for this store then this store can be eliminated. 36- if the instruction is a store and a value from heap for this store is absent or differs from a new stored value then the new stored value is written into heap. The values of memory instructions that `MUST_ALIAS` this store are updated as well. All values in the heap that `MAY_ALIAS` this store instruction are invalidated. 37- if the instruction is a load and there is a value from the heap for this load then this load can be eliminated. 38- if the instruction is a load and there is no value from the heap for this load then we update heap value for this load with the result of this load. All instructions that `MUST_ALIAS` this load updated as well. 39- If the instruction can invoke GC then all references on the heap that aren't mentioned in corresponded SaveState should be invalidated. 40- if the instruction is a volatile load then the whole heap is cleared. 41- if the instruction is a call then the whole heap is cleared. 42 43All instructions that can be eliminated are recorded in a separate list for eliminated instructions and erase them at the end of optimization. 44 45Loops have back edges by this reason we cannot eliminate anything in single traversal because further memory operations may update heap using back edge. Therefore an initial heap for loop header is empty. But still there is a case when we can use collected heap of loop preheader (e.g. if we can guarantee that memory access is read-only inside loop). 46 47The typical example is the following (in pandasm): 48 49``` 50.record T { 51 i32[] array <static> 52} 53 54.function i32 T.foo() { 55 movi v2, 0 #res 56 movi v0, 0 #i 57 ldobj.obj v1, T.array 58 ldarr v1 59 sta v3 60loop: 61 jge v3, loop_exit 62 lda v0 63 ldobj.obj v1, T.array 64 ldarr v1 65 add2 v2 66 sta v2 67 inci v0, 1 68 lda v0 69 jmp loop 70loop_exit: 71 lda v2 72 return 73} 74``` 75 76It will be translated in the following code: 77 78``` 79BB 2 preds: [bb 0] 80 3.ref LoadObject 242 v0 -> (v6) 81 6.i32 LenArray v3 -> (v30, v16) 82 30.i32 Cmp v7, v6 -> (v31) 83 31.b Compare GE i32 v30, v7 -> (v32) 84 32. IfImm NE b v31, 0x0 85succs: [bb 3, bb 4] 86 87BB 4 preds: [bb 2, bb 4] 88prop: head, loop 1 89 10p.i32 Phi v7(bb2), v27(bb4) -> (v25, v27) 90 11p.i32 Phi v7(bb2), v26(bb4) -> (v26) 91 20.ref LoadObject 242 v0 -> (v25) 92 25.i32 LoadArray v20, v10p -> (v26) 93 26.i32 Add v25, v11p -> (v11p, v35p) 94 27.i32 Add v10p, v28 -> (v10p, v16) 95 16.b Compare GE i32 v27, v6 -> (v17) 96 17. IfImm NE b v16, 0x0 97succs: [bb 3, bb 4] 98 99BB 3 preds: [bb 2, bb 4] 100 35p.i32 Phi v7(bb2), v26(bb4) -> (v29) 101 29.i32 Return v35p 102succs: [bb 1] 103``` 104 105Here is `v20` inside loop that can be eliminated due to `v3`. However we can't record `v20` to list of loads that will be eliminated because further in the loop we can have a store instruction like the following: 106 107``` 108v28.ref StoreObject 242 v0, v40 109``` 110 111To handle this, each loop has its list of memory accesses that was loaded before loop (heap of preheader). We call these memory accesses as phi-candidates. Each phi-candidate has its own list of aliased memory accesses that will be collected further. While traversing a loop we check aliasing of each memory access inside the loop with phi-candidates. If a memory access is `MAY_ALIAS`ed or `MUST_ALIAS`ed we put this memory access to the corresponding lists of phi-candidates. Irreducible loops do not have preheader so we ignore them for optimization. 112 113Finally, we iterate over all memory accesses collected in lists for each phi-candidate. If there is at least one store among them, we drop this phi-candidate because it is overwritten somewhere inside the loop. If there are only load instructions among them, then we can replace memory accesses inside the loop by values obtained from outside the loop (replace only `MUST_ALIAS`ed accesses). 114 115The following rules are added to the list above to handle loops: 116 117- visiting any memory access inside a reducible loop we check the aliasing with phi-candidates of this loop and all outer loop and record aliased ones to a corresponding candidate. 118- all phi-candidates of a loop (and of all outer loops of this loop) are invalidated if any of instructions that clear heap have been met in this loop. 119 120After iterating the whole graph the following actions are done 121- iterate over phi-candidates with aliased accesses. 122 * If any of aliased accesses for a candidate is a store, we do nothing. 123 * If among aliased accesses only loads, add `MUST_ALIAS`ed load to elimination list. 124- erase all instructions from elimination list 125 126## Pseudocode 127 128``` 129bool LSE::RunImpl() { 130 Heap heap; 131 PhiCands phis; 132 133 LSEVisitor visitor(GetGraph(), &heap, &phis); 134 for (auto block : GetGraph()->GetBlocksRPO()) { 135 if (block->IsLoopHeader()) { 136 // Put heap[block->GetLoop()->GetPreheader()] values to phis[block->GetLoop()] 137 // heap[block] is empty 138 MergeHeapValuesForLoop(block, &heap, &phis); 139 } else { 140 // Put in heap[block] values that are the same in block's predecessors 141 MergeHeapValuesForBlock(block, &heap, &mustrix); 142 } 143 144 for (auto inst : block->Insts()) { 145 if (IsHeapInvalidatingInst(inst)) { 146 heap[block].clear(); 147 auto loop = block->GetLoop(); 148 while (!loop->IsRoot()) { 149 phis.at(block->GetLoop()).clear(); 150 loop = loop->GetOuterLoop(); 151 } 152 } else if (IsGCInst(inst)) { 153 SaveStateInst *ss = inst->GetSaveState(); 154 if (inst->GetOpcode() == Opcode::SafePoint) { 155 ss = inst->CastToSafePoint(); 156 } 157 // Invalidate references not mentioned in SaveState 158 visitor.InvalidateRefs(block, ss); 159 } else if (inst->IsLoad()) { 160 visitor.VisitLoad(inst); 161 } else if (inst->IsStore()) { 162 visitor.VisitStore(inst, inst->GetStoredValue()); 163 } 164 } 165 } 166 // Append instructions, that can be eliminated due to loop preheader heap values, to elimination list 167 visitor.FinalizeLoops(); 168 169 // Erase collected instructions 170 for (auto elim : visitor.GetEliminations()) { 171 DeleteInstruction(elim.inst, elim.value); // Replace elim.inst with elim.value 172 } 173} 174 175void LSEVisitor::VisitStore(Inst *inst, Inst *val) { 176 BasicBlock *block = inst->GetBasicBlock(); 177 auto &block_heap = heap_.at(block); 178 // If a value stored on the heap is already there, we eliminate this store instruction 179 auto mem = HeapHasEqaulValue(inst, val); 180 if (mem != nullptr && LSE::CanEliminateInstruction(inst)) { 181 eliminations_[inst] = block_heap[mem]; 182 return; 183 } 184 185 // If this store MUST_ALIAS any inst from phis[block], it prohibits any replacements of this phi candidate 186 UpdatePhis(inst); 187 188 // Erase all aliased values, because they may be overwritten 189 for (auto heap_iter : block_heap) { 190 if (inst.CheckAlias(heap_iter) != NO_ALIAS) { 191 block_heap.erase(heap_iter); 192 } 193 } 194 195 // Record a stored value to heap[block] 196 block_heap[inst] = {inst, val}; 197} 198 199void LSEVisitor::VisitLoad(Inst *inst) { 200 BasicBlock *block = inst->GetBasicBlock(); 201 auto &block_heap = heap_.at(block); 202 // We already know the value on the heap -> eliminate inst 203 auto mem = HeapHasValue(inst); 204 if (mem != nullptr && LSE::CanEliminateInstruction(inst)) 205 eliminations_[inst] = block_heap[mem]; // Store the value to replace instruction with it 206 return; 207 } 208 209 // If this load MUST_ALIAS any inst from phis[block] it can be further replaced with value outside the loop 210 UpdatePhis(inst); 211 212 // Record loaded value to heap[block] and update all MUST_ALIASes 213 block_heap[inst] = {inst, inst}; 214} 215``` 216 217## Examples 218### Loading stored value in `if` block 219Before: 220``` 221BB 2 preds: [bb 0] 222 7.i32 StoreArray v0, v2, v1 223 9.b Compare GT i32 v1, v8 -> (v10) 224 10. IfImm NE b v9, 0x0 225succs: [bb 3, bb 4] 226 227BB 4 preds: [bb 2] 228 15.i32 LoadArray v0, v2 -> (v21) 229 20.i32 LoadArray v0, v2 -> (v21) 230 21.i32 Add v15, v20 -> (v22p) 231succs: [bb 3] 232 233BB 3 preds: [bb 2, bb 4] 234 22p.i32 Phi v1(bb2), v21(bb4) -> (v23) 235 23.i32 Return v22p 236succs: [bb 1] 237``` 238Here we can see that instruction `v15` and `v20` load a values stored by `v7`. As a result we can substitute these loads with stored value. 239 240After: 241``` 242BB 2 preds: [bb 0] 243 7.i32 StoreArray v0, v2, v1 244 9.b Compare GT i32 v1, v8 -> (v10) 245 10. IfImm NE b v9, 0x0 246succs: [bb 3, bb 4] 247 248BB 4 preds: [bb 2] 249 21.i32 Add v1, v1 -> (v22p) 250succs: [bb 3] 251 252BB 3 preds: [bb 2, bb 4] 253 22p.i32 Phi v1(bb2), v21(bb4) -> (v23) 254 23.i32 Return v22p 255succs: [bb 1] 256``` 257### Object access elimination in loop 258In this example we load array from object and sum its elements. 259``` 260BB 2 preds: [bb 0] 261 3.ref LoadObject 242 v0 -> (v6) 262 6.i32 LenArray v3 -> (v30, v16) 263 30.i32 Cmp v7, v6 -> (v31) 264 31.b Compare GE i32 v30, v7 -> (v32) 265 32. IfImm NE b v31, 0x0 266succs: [bb 3, bb 4] 267 268BB 4 preds: [bb 2, bb 4] 269prop: head, loop 1 270 10p.i32 Phi v7(bb2), v27(bb4) -> (v25, v27) 271 11p.i32 Phi v7(bb2), v26(bb4) -> (v26) 272 20.ref LoadObject 242 v0 -> (v25) 273 25.i32 LoadArray v20, v10p -> (v26) 274 26.i32 Add v25, v11p -> (v11p, v35p) 275 27.i32 Add v10p, v28 -> (v10p, v16) 276 16.b Compare GE i32 v27, v6 -> (v17) 277 17. IfImm NE b v16, 0x0 278succs: [bb 3, bb 4] 279``` 280Until the `LoadObject` accesses a volatile field, we can eliminate `v20` inside the loop and use `v3` instead. 281``` 282BB 2 preds: [bb 0] 283 3.ref LoadObject 242 v0 -> (v6, v25) 284 6.i32 LenArray v3 -> (v30, v16) 285 30.i32 Cmp v7, v6 -> (v31) 286 31.b Compare GE i32 v30, v7 -> (v32) 287 32. IfImm NE b v31, 0x0 288succs: [bb 3, bb 4] 289 290BB 4 preds: [bb 2, bb 4] 291prop: head, loop 1 292 10p.i32 Phi v7(bb2), v27(bb4) -> (v25, v27) 293 11p.i32 Phi v7(bb2), v26(bb4) -> (v26) 294 25.i32 LoadArray v3, v10p -> (v26) 295 26.i32 Add v25, v11p -> (v11p, v35p) 296 27.i32 Add v10p, v28 -> (v10p, v16) 297 16.b Compare GE i32 v27, v6 -> (v17) 298 17. IfImm NE b v16, 0x0 299succs: [bb 3, bb 4] 300``` 301## Links 302 303Source code: 304[lse.cpp](../optimizer/optimizations/lse.cpp) 305[lse.h](../optimizer/optimizations/lse.h) 306 307Tests: 308[lse_test.cpp](../tests/lse_test.cpp) 309