• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Load Store Elimination
2## Overview
3
4The idea of optimization is to delete store instructions that store a value to memory that has been already written as well as delete load instructions that attempt to load a value that has been loaded earlier.
5
6## Rationality
7
8Elimination of load and store instructions generally reduces the number of long latency memory instructions. It should be stated that the optimization increases the register pressure and as a result can increase the amount of required spills/fills.
9
10## Dependence
11
12* AliasAnalysis
13* LoopAnalysis
14* Reverse Post Order (RPO)
15
16## Algorithm
17
18Algorithm needs to know that two memory instructions access the same memory address.  This can be done using alias analysis that accepts two memory instructions and is able to say about these accesses the following
19*  `MUST_ALIAS` if the instructions definitely access the same memory address
20*  `NO_ALIAS` if the instructions definitely access different memory addresses.
21*  `MAY_ALIAS` if analysis can't say with confidence whether the instructions access the same memory address or different (or probably they may access the same address only under some conditions but under others – not)
22
23The main structure for the algorithm is a heap representation in a form of
24
25`"Memory Instruction" -> "Value stored at location pointed by instruction"`
26
27The heap is a representation of the real heap during execution. Each instruction that works with memory operates with a particular memory address. E.g. store instructions write values on the heap, load instructions read values from the heap. But if a load instruction tries to read a value that was written by store instruction above, this load can be eliminated because we already know the value without actual reading. The heap structure helps to keep track of values that memory instruction operates with (store instruction -- write, load instructions -- read).
28
29Each basic block has its own heap. The initial value of a heap for a basic block is a merged heap from its predecessors. The result of heap merge is an intersected set of heap values on the same addresses. The initial value for entry basic block is empty heap. The empty heap is an initial value for loop headers as well because back edges can rewrite the heap values.
30
31A dependence on predecessors needs us to traverse basic blocks in reverse post order so that for each basic block we already visited all it's predecessors (empty heap of one of predecessor is a back edge).
32
33Once a heap is initialized for a basic block we iterate over instructions and update heap by applying the following rules to current instruction:
34
35- if the instruction is a store and a stored value is equal to value from heap for this store then this store can be eliminated.
36- if the instruction is a store and a value from heap for this store is absent or differs from a new stored value then the new stored value is written into heap.  The values of memory instructions that `MUST_ALIAS` this store are updated as well.  All values in the heap that `MAY_ALIAS` this store instruction are invalidated.
37- if the instruction is a load and there is a value from the heap for this load then this load can be eliminated.
38- if the instruction is a load and there is no value from the heap for this load then we update heap value for this load with the result of this load.  All instructions that `MUST_ALIAS` this load updated as well.
39- If the instruction can invoke GC then all references on the heap that aren't mentioned in corresponded SaveState should be invalidated.
40- if the instruction is a volatile load then the whole heap is cleared.
41- if the instruction is a call then the whole heap is cleared.
42
43All instructions that can be eliminated are recorded in a separate list for eliminated instructions and erase them at the end of optimization.
44
45Loops have back edges by this reason we cannot eliminate anything in single traversal because further memory operations may update heap using back edge. Therefore an initial heap for loop header is empty. But still there is a case when we can use collected heap of loop preheader (e.g. if we can guarantee that memory access is read-only inside loop).
46
47The typical example is the following (in pandasm):
48
49```
50.record T {
51    i32[] array <static>
52}
53
54.function i32 T.foo() {
55    movi v2, 0       #res
56    movi v0, 0       #i
57    ldobj.obj v1, T.array
58    ldarr v1
59    sta v3
60loop:
61    jge v3, loop_exit
62    lda v0
63    ldobj.obj v1, T.array
64    ldarr v1
65    add2 v2
66    sta v2
67    inci v0, 1
68    lda v0
69    jmp loop
70loop_exit:
71    lda v2
72    return
73}
74```
75
76It will be translated in the following code:
77
78```
79BB 2  preds: [bb 0]
80    3.ref  LoadObject 242             v0 -> (v6)
81    6.i32  LenArray                   v3 -> (v30, v16)
82   30.i32  Cmp                        v7, v6 -> (v31)
83   31.b    Compare GE i32             v30, v7 -> (v32)
84   32.     IfImm NE b                 v31, 0x0
85succs: [bb 3, bb 4]
86
87BB 4  preds: [bb 2, bb 4]
88prop: head, loop 1
89  10p.i32  Phi                        v7(bb2), v27(bb4) -> (v25, v27)
90  11p.i32  Phi                        v7(bb2), v26(bb4) -> (v26)
91   20.ref  LoadObject 242             v0 -> (v25)
92   25.i32  LoadArray                  v20, v10p -> (v26)
93   26.i32  Add                        v25, v11p -> (v11p, v35p)
94   27.i32  Add                        v10p, v28 -> (v10p, v16)
95   16.b    Compare GE i32             v27, v6 -> (v17)
96   17.     IfImm NE b                 v16, 0x0
97succs: [bb 3, bb 4]
98
99BB 3  preds: [bb 2, bb 4]
100  35p.i32  Phi                        v7(bb2), v26(bb4) -> (v29)
101   29.i32  Return                     v35p
102succs: [bb 1]
103```
104
105Here is `v20` inside loop that can be eliminated due to `v3`. However we can't record `v20` to list of loads that will be eliminated because further in the loop we can have a store instruction like the following:
106
107```
108v28.ref StoreObject 242 v0, v40
109```
110
111To handle this, each loop has its list of memory accesses that was loaded before loop (heap of preheader). We call these memory accesses as phi-candidates. Each phi-candidate has its own list of aliased memory accesses that will be collected further. While traversing a loop we check aliasing of each memory access inside the loop with phi-candidates.  If a memory access is `MAY_ALIAS`ed or `MUST_ALIAS`ed we put this memory access to the corresponding lists of phi-candidates.  Irreducible loops do not have preheader so we ignore them for optimization.
112
113Finally, we iterate over all memory accesses collected in lists for each phi-candidate. If there is at least one store among them, we drop this phi-candidate because it is overwritten somewhere inside the loop. If there are only load instructions among them, then we can replace memory accesses inside the loop by values obtained from outside the loop (replace only `MUST_ALIAS`ed accesses).
114
115The following rules are added to the list above to handle loops:
116
117- visiting any memory access inside a reducible loop we check the aliasing with phi-candidates of this loop and all outer loop and record aliased ones to a corresponding candidate.
118- all phi-candidates of a loop (and of all outer loops of this loop) are invalidated if any of instructions that clear heap have been met in this loop.
119
120After iterating the whole graph the following actions are done
121- iterate over phi-candidates with aliased accesses.
122  * If any of aliased accesses for a candidate is a store, we do nothing.
123  * If among aliased accesses only loads, add `MUST_ALIAS`ed load to elimination list.
124- erase all instructions from elimination list
125
126## Pseudocode
127
128```
129bool LSE::RunImpl() {
130    Heap heap;
131    PhiCands phis;
132
133    LSEVisitor visitor(GetGraph(), &heap, &phis);
134    for (auto block : GetGraph()->GetBlocksRPO()) {
135        if (block->IsLoopHeader()) {
136            // Put heap[block->GetLoop()->GetPreheader()] values to phis[block->GetLoop()]
137            // heap[block] is empty
138            MergeHeapValuesForLoop(block, &heap, &phis);
139        } else {
140            // Put in heap[block] values that are the same in block's predecessors
141            MergeHeapValuesForBlock(block, &heap, &mustrix);
142        }
143
144        for (auto inst : block->Insts()) {
145            if (IsHeapInvalidatingInst(inst)) {
146                heap[block].clear();
147                auto loop = block->GetLoop();
148                while (!loop->IsRoot()) {
149                    phis.at(block->GetLoop()).clear();
150                    loop = loop->GetOuterLoop();
151                }
152            } else if (IsGCInst(inst)) {
153                SaveStateInst *ss = inst->GetSaveState();
154                if (inst->GetOpcode() == Opcode::SafePoint) {
155                    ss = inst->CastToSafePoint();
156                }
157                // Invalidate references not mentioned in SaveState
158                visitor.InvalidateRefs(block, ss);
159            } else if (inst->IsLoad()) {
160                visitor.VisitLoad(inst);
161            } else if (inst->IsStore()) {
162                visitor.VisitStore(inst, inst->GetStoredValue());
163            }
164        }
165    }
166    // Append instructions, that can be eliminated due to loop preheader heap values, to elimination list
167    visitor.FinalizeLoops();
168
169    // Erase collected instructions
170    for (auto elim : visitor.GetEliminations()) {
171        DeleteInstruction(elim.inst, elim.value); // Replace elim.inst with elim.value
172    }
173}
174
175void LSEVisitor::VisitStore(Inst *inst, Inst *val) {
176    BasicBlock *block = inst->GetBasicBlock();
177    auto &block_heap = heap_.at(block);
178    // If a value stored on the heap is already there, we eliminate this store instruction
179    auto mem = HeapHasEqaulValue(inst, val);
180    if (mem != nullptr && LSE::CanEliminateInstruction(inst)) {
181        eliminations_[inst] = block_heap[mem];
182        return;
183    }
184
185    // If this store MUST_ALIAS any inst from phis[block], it prohibits any replacements of this phi candidate
186    UpdatePhis(inst);
187
188    // Erase all aliased values, because they may be overwritten
189    for (auto heap_iter : block_heap) {
190        if (inst.CheckAlias(heap_iter) != NO_ALIAS) {
191          block_heap.erase(heap_iter);
192        }
193    }
194
195    // Record a stored value to heap[block]
196    block_heap[inst] = {inst, val};
197}
198
199void LSEVisitor::VisitLoad(Inst *inst) {
200    BasicBlock *block = inst->GetBasicBlock();
201    auto &block_heap = heap_.at(block);
202    // We already know the value on the heap -> eliminate inst
203    auto mem = HeapHasValue(inst);
204    if (mem != nullptr && LSE::CanEliminateInstruction(inst))
205        eliminations_[inst] = block_heap[mem]; // Store the value to replace instruction with it
206        return;
207    }
208
209    // If this load MUST_ALIAS any inst from phis[block] it can be further replaced with value outside the loop
210    UpdatePhis(inst);
211
212    // Record loaded value to heap[block] and update all MUST_ALIASes
213    block_heap[inst] = {inst, inst};
214}
215```
216
217## Examples
218### Loading stored value in `if` block
219Before:
220```
221BB 2  preds: [bb 0]
222    7.i32  StoreArray                 v0, v2, v1
223    9.b    Compare GT i32             v1, v8 -> (v10)
224   10.     IfImm NE b                 v9, 0x0
225succs: [bb 3, bb 4]
226
227BB 4  preds: [bb 2]
228   15.i32  LoadArray                  v0, v2 -> (v21)
229   20.i32  LoadArray                  v0, v2 -> (v21)
230   21.i32  Add                        v15, v20 -> (v22p)
231succs: [bb 3]
232
233BB 3  preds: [bb 2, bb 4]
234  22p.i32  Phi                        v1(bb2), v21(bb4) -> (v23)
235   23.i32  Return                     v22p
236succs: [bb 1]
237```
238Here we can see that instruction `v15` and `v20` load a values stored by `v7`. As a result we can substitute these loads with stored value.
239
240After:
241```
242BB 2  preds: [bb 0]
243    7.i32  StoreArray                 v0, v2, v1
244    9.b    Compare GT i32             v1, v8 -> (v10)
245   10.     IfImm NE b                 v9, 0x0
246succs: [bb 3, bb 4]
247
248BB 4  preds: [bb 2]
249   21.i32  Add                        v1, v1 -> (v22p)
250succs: [bb 3]
251
252BB 3  preds: [bb 2, bb 4]
253  22p.i32  Phi                        v1(bb2), v21(bb4) -> (v23)
254   23.i32  Return                     v22p
255succs: [bb 1]
256```
257### Object access elimination in loop
258In this example we load array from object and sum its elements.
259```
260BB 2  preds: [bb 0]
261    3.ref  LoadObject 242             v0 -> (v6)
262    6.i32  LenArray                   v3 -> (v30, v16)
263   30.i32  Cmp                        v7, v6 -> (v31)
264   31.b    Compare GE i32             v30, v7 -> (v32)
265   32.     IfImm NE b                 v31, 0x0
266succs: [bb 3, bb 4]
267
268BB 4  preds: [bb 2, bb 4]
269prop: head, loop 1
270  10p.i32  Phi                        v7(bb2), v27(bb4) -> (v25, v27)
271  11p.i32  Phi                        v7(bb2), v26(bb4) -> (v26)
272   20.ref  LoadObject 242             v0 -> (v25)
273   25.i32  LoadArray                  v20, v10p -> (v26)
274   26.i32  Add                        v25, v11p -> (v11p, v35p)
275   27.i32  Add                        v10p, v28 -> (v10p, v16)
276   16.b    Compare GE i32             v27, v6 -> (v17)
277   17.     IfImm NE b                 v16, 0x0
278succs: [bb 3, bb 4]
279```
280Until the `LoadObject` accesses a volatile field, we can eliminate `v20` inside the loop and use `v3` instead.
281```
282BB 2  preds: [bb 0]
283    3.ref  LoadObject 242             v0 -> (v6, v25)
284    6.i32  LenArray                   v3 -> (v30, v16)
285   30.i32  Cmp                        v7, v6 -> (v31)
286   31.b    Compare GE i32             v30, v7 -> (v32)
287   32.     IfImm NE b                 v31, 0x0
288succs: [bb 3, bb 4]
289
290BB 4  preds: [bb 2, bb 4]
291prop: head, loop 1
292  10p.i32  Phi                        v7(bb2), v27(bb4) -> (v25, v27)
293  11p.i32  Phi                        v7(bb2), v26(bb4) -> (v26)
294   25.i32  LoadArray                  v3, v10p -> (v26)
295   26.i32  Add                        v25, v11p -> (v11p, v35p)
296   27.i32  Add                        v10p, v28 -> (v10p, v16)
297   16.b    Compare GE i32             v27, v6 -> (v17)
298   17.     IfImm NE b                 v16, 0x0
299succs: [bb 3, bb 4]
300```
301## Links
302
303Source code:
304[lse.cpp](../optimizer/optimizations/lse.cpp)
305[lse.h](../optimizer/optimizations/lse.h)
306
307Tests:
308[lse_test.cpp](../tests/lse_test.cpp)
309