• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Memory Coalescing
2## Overview
3
4The optimization is based on the fact that some architectures (`AArch64` particularly) support simultaneous load and store operations for consecutive addresses instead of several separate operations.
5
6## Rationality
7
8Replacing two memory operations with one generally reduces the number of long latency memory instructions.
9
10| Code | Regular | Optimized |
11| ------ | ------ | ------|
12| `num[0] * num[1];` | `ldr     x1, [x0]` <br> `ldr     x0, [x0, 8]` <br> `mul     x0, x1, x0` | `ldp     x1, x0, [x0]` <br> `mul     x0, x1, x0` |
13
14## Dependence
15
16* DominatorsTree
17* LoopAnalyzer
18* AliasAnalysis
19* Reverse Post Order (RPO)
20
21## Assumptions
22
23The optimization was implemented for arrays' accesses for `AArch64` architecture.
24
25Array accesses cannot be volatile.
26
27`AArch64` has `32`-bit and `64`-bit versions of coalescing operations – `ldp` and `stp`. As a result it is possible only for following Panda's types: `INT32`, `UINT32`, `INT64`, `UINT64`, `REFERENCE`.
28
29## Algorithm
30
31To implement such kind of optimization the extra support from IR is required
32
33### IR Support
34
35The following actions are required
36
37* Separate instructions that will represent coalesced memory accesses from regular accesses.
38* Handle multiple output from one instruction in terms of SSA
39
40The case with a coalesced store is quite straightforward: having two consecutive stores we replace them by one instruction that accepts index and two values to store.
41
42| Consecutive Stores | Coalesced Store |
43| --- | --- |
44| `248.i64  StoreArrayI  v2, 0x0, v53` <br> `250.i64  StoreArrayI  v2, 0x1, v53` | `251.i64 StoreArrayPairI v2, 0x0, v53, v53` |
45
46The problem occurs with a coalesced load because a load instruction of multiple values produces multiple assignment that is not a part of SSA form. By this reason, we need additional pseudo instructions as `LoadPairPart` to divide multiple values into single values. The type of `LoadArrayPair` instruction corresponds to the type of a single element, so there is an assumption that `LoadArrayPair` loads only multiple values of the same type.
47
48| Consecutive Loads | Coalesced Load |
49| --- | --- |
50| `58.i64  LoadArrayI  v2, 0x0 -> (v37)` <br> `61.i64  LoadArrayI  v2, 0x1 -> (v43)`  | `62.i64  LoadArrayPairI v2, 0x0 -> (v63, v64)` <br> `63.i64  LoadPairPart  v62, 0x0 -> (v37)` <br> `64.i64  LoadPairPart  v62, 0x1 -> (v43)`  |
51
52### Transformation
53
54The optimization tries to coalesce memory operations in a scope of a basic block. It needs that two consecutive memory operations are placed near each other without intermediate instructions. By this reason we need to find a place to sunk an upper memory operation and to hoist the lower according to reordering rules.
55
56During hoisting and sinking of memory operations we use rules for memory instruction scheduling: do not move over monitors, calls, save states, save points and etc.
57
58Memory coalescing was implemented for array accesses. We process instructions of basic block in order. To find accesses of consecutive memory addresses we keep a queue of candidates. Each instruction that may be coalesced is inserted into this queue. A candidate is marked as invalid in the following conditions:
59 * it has been paired already
60 * store candidates are invalid if SaveState instruction has been met
61 * all candidates are invalid if a barrier is met: calls, control flow instructions, monitors, exceptions, intrinsic and etc.
62
63To track indices we use basic implementation of scalar evolution that allows to track how variables evolves: basic value (variable or constant), difference from the basic value (if basic value is a variable) and evolution (if basic value is a variable incremented on each iteration of a loop). It is a simple graph bypass by collecting assignments including Phi evolutions (supported only addition and subtraction).
64
65Processing each instruction in basic block we do the following:
66 1) If the instruction cannot be coalesced.
67     1) If the instruction is a barrier – invalidate all candidates.
68     2) If the instruction is a SaveState – invalidate all store candidates.
69     3) If the instruction is a memory operation – add it as invalid candidate.
70 2) If we can't determine anything about index variable, we add this instruction as a candidate and move on next instruction
71 3) Iterate candidates in backward order
72     2) If a candidate is invalid **or** candidate cannot be coalesced with the instruction **or** both refer to different objects **or** we have no information about candidate index, move on next candidate
73     3) If indices differs by one and there is a place to sunk the candidate instruction and hoist the currently processing instruction – add this candidate and the instruction as a pair for coalescing and invalidate both.
74 4) Add the instruction as a candidate.
75
76Finally, we replace collected pairs by coalesced instructions.
77
78To find a place for candidate and current instruction:
79 1) find the lowest position the candidate can be sunk
80 2) find the highest position the instruction can be hoisted
81 3) The place can be any between highest and lowest position. If the intersection is empty, coalescing is not possible.
82
83## Pseudocode
84
85```
86void MemoryCoalescing::RunImpl() {
87    VariableAnalysis variables(GetGraph());
88    for (auto block : GetGraph()->GetBlocksRPO()) {
89        for (auto inst : block.Insts()) {
90            if (IsArrayAccess(inst)) {
91                HandleArrayAccess(inst, variables);
92            } else if (inst->IsMemory()) {
93                inst->SetMarker(mrk_invalid_);
94                candidates.push_back(inst);
95            } else if (inst->IsBarrier()) {
96                // Remove all candidates -- do not move anything across barriers
97                candidates.clear();
98            }
99        }
100        // Work in scope of basic block
101        candidates.clear();
102    }
103
104    for (auto pair : pairs) {
105        // Replace a pair of instructions by a coalesced instruction
106        ReplacePair(pair);
107    }
108}
109
110void HandleArrayAccess(Inst *inst, VariableAnalysis &vars) {
111    Inst *obj = inst->GetObject();
112    Inst *idx = inst->GetIndex();
113    // If we don't know anything about index, do nothing
114    if (!vars.IsAnalyzed(idx)) {
115        candidates.push_back(inst);
116        return;
117    }
118    // Last candidates more likely to be coalesced
119    for (auto iter = candidates.rbegin(); iter != candidates.rend(); iter++) {
120        auto cand = *iter;
121        // Skip not interesting candidates: invalid and that cannot be coalesced with current inst
122        if (cand->IsMarked(invalid) || cand->GetOpcode() != inst->GetOpcode()) {
123            continue;
124        }
125
126        Inst *cand_obj = cand->GetObject();
127        auto cand_idx = cand->GetIndex();
128        // We need to have info about candidate's index and array objects must alias each other
129        if (!vars.IsAnalyzed(cand_idx) || obj.IsAlias(cand_obj) != MUST_ALIAS) {
130            continue;
131        }
132        // If both indices differs by one
133        Inst *position = FindBetterPlace(cand, inst);
134        if (poisition && vars.DiffersByConst(idx, cand_idx, 1)) {
135            pairs.push_back({cand, inst, position});
136            cand->SetMarker(invalid);
137            inst->SetMarket(invalid);
138        }
139    }
140
141    candidates.push_back(inst);
142}
143```
144
145## Examples
146
147### Loads and Stores with immediate indices
148Before optimization
149```
150BB 0
151prop: start
152    0.i64  Constant                   0x2a -> (v3)
153succs: [bb 2]
154
155BB 2  preds: [bb 0]
156    3.ref  NewArray 77                v0 -> (v42, v41)
157   41.     SaveState                  v3(vr7) -> (v42)
158   42.ref  NullCheck                  v3, v41 -> (v225, v229, v227, v230)
159  225.i64  LoadArrayI                 v42, 0x0 -> (v51)
160  227.i64  LoadArrayI                 v42, 0x1 -> (v51)
161   51.i64  Add                        v225, v227 -> (v229, v40, v230)
162  229.i64  StoreArrayI                v42, 0x0, v51
163  230.i64  StoreArrayI                v42, 0x1, v51
164   40.i64  Return                     v51
165succs: [bb 1]
166```
167After optimization
168```
169BB 0
170prop: start
171    0.i64  Constant                   0x2a -> (v3)
172succs: [bb 2]
173
174BB 2  preds: [bb 0]
175    3.ref  NewArray 77                v0 -> (v41, v42)
176   41.     SaveState                  v3(vr7) -> (v42)
177   42.ref  NullCheck                  v3, v41 -> (v231, v234)
178  231.i64  LoadArrayPairI             v42, 0x0 -> (v232, v233)
179  232.i64  LoadPairPart               v231, 0x0 -> (v51)
180  233.i64  LoadPairPart               v231, 0x1 -> (v51)
181   51.i64  Add                        v232, v233 -> (v234, v234, v40)
182  234.i64  StoreArrayPairI            v42, 0x0, v51, v51
183   40.i64  Return                     v51
184succs: [bb 1]
185```
186### Coalescing inside loop
187Before optimization
188```
189BB 2  preds: [bb 0]
190    3.i32  LenArray                   v0 -> (v35)
191succs: [bb 3]
192
193BB 3  preds: [bb 2, bb 3]
194prop: head, loop 1
195   6p.i32  Phi                        v4(bb2), v34(bb3) -> (v33, v17, v34)
196   7p.i32  Phi                        v5(bb2), v24(bb3) -> (v24, v17)
197   8p.i32  Phi                        v5(bb2), v25(bb3) -> (v25, v23, v24)
198   17.i32  StoreArray                 v0, v6p, v7p
199   33.i32  AddI                       v6p, 0x1 -> (v23)
200   23.i32  StoreArray                 v0, v33, v8p
201   24.i32  Add                        v7p, v8p -> (v7p, v25)
202   25.i32  Add                        v8p, v24 -> (v8p)
203   34.i32  AddI                       v6p, 0x2 -> (v6p, v35)
204   35.     If LT i32                  v34, v3
205succs: [bb 3, bb 4]
206
207BB 4  preds: [bb 3]
208   29.void ReturnVoid
209succs: [bb 1]
210```
211After optimization
212```
213BB 2  preds: [bb 0]
214    3.i32  LenArray                   v0 -> (v35)
215succs: [bb 3]
216
217BB 3  preds: [bb 2, bb 3]
218prop: head, loop 1
219   6p.i32  Phi                        v4(bb2), v34(bb3) -> (v33, v36, v34)
220   7p.i32  Phi                        v5(bb2), v24(bb3) -> (v36, v24)
221   8p.i32  Phi                        v5(bb2), v25(bb3) -> (v36, v24, v25)
222   33.i32  AddI                       v6p, 0x1
223   36.i32  StoreArrayPair             v0, v6p, v7p, v8p
224   24.i32  Add                        v7p, v8p -> (v7p, v25)
225   25.i32  Add                        v8p, v24 -> (v8p)
226   34.i32  AddI                       v6p, 0x2 -> (v6p, v35)
227   35.     If LT i32                  v34, v3
228succs: [bb 3, bb 4]
229
230BB 4  preds: [bb 3]
231   29.void ReturnVoid
232succs: [bb 1]
233```
234
235## Options
236
237| Option | Description | Default value |
238| --- | --- | --- |
239| `--compiler-memory-coalescing` | Enables optimization | `true` |
240| `--compiler-memory-coalescing-objects` | Allows coalescing of operations with `ref`s | `true` |
241| `--compiler-memory-coalescing-aligned` | Coalesces only aligned accesses (starting with even indices e.g. 0-1, 4-5 etc.) | `false` |
242
243## Links
244
245Source code:
246[memory_coalescing.cpp](../optimizer/optimizations/memory_coalescing.cpp)
247[memory_coalescing.h](../optimizer/optimizations/memory_coalescing.h)
248
249Tests:
250[memory_coalescing_test.cpp](../tests/memory_coalescing_test.cpp)