1# Memory Coalescing 2## Overview 3 4The optimization is based on the fact that some architectures (`AArch64` particularly) support simultaneous load and store operations for consecutive addresses instead of several separate operations. 5 6## Rationality 7 8Replacing two memory operations with one generally reduces the number of long latency memory instructions. 9 10| Code | Regular | Optimized | 11| ------ | ------ | ------| 12| `num[0] * num[1];` | `ldr x1, [x0]` <br> `ldr x0, [x0, 8]` <br> `mul x0, x1, x0` | `ldp x1, x0, [x0]` <br> `mul x0, x1, x0` | 13 14## Dependence 15 16* DominatorsTree 17* LoopAnalyzer 18* AliasAnalysis 19* Reverse Post Order (RPO) 20 21## Assumptions 22 23The optimization was implemented for arrays' accesses for `AArch64` architecture. 24 25Array accesses cannot be volatile. 26 27`AArch64` has `32`-bit and `64`-bit versions of coalescing operations – `ldp` and `stp`. As a result it is possible only for following Panda's types: `INT32`, `UINT32`, `INT64`, `UINT64`, `REFERENCE`. 28 29## Algorithm 30 31To implement such kind of optimization the extra support from IR is required 32 33### IR Support 34 35The following actions are required 36 37* Separate instructions that will represent coalesced memory accesses from regular accesses. 38* Handle multiple output from one instruction in terms of SSA 39 40The case with a coalesced store is quite straightforward: having two consecutive stores we replace them by one instruction that accepts index and two values to store. 41 42| Consecutive Stores | Coalesced Store | 43| --- | --- | 44| `248.i64 StoreArrayI v2, 0x0, v53` <br> `250.i64 StoreArrayI v2, 0x1, v53` | `251.i64 StoreArrayPairI v2, 0x0, v53, v53` | 45 46The problem occurs with a coalesced load because a load instruction of multiple values produces multiple assignment that is not a part of SSA form. By this reason, we need additional pseudo instructions as `LoadPairPart` to divide multiple values into single values. The type of `LoadArrayPair` instruction corresponds to the type of a single element, so there is an assumption that `LoadArrayPair` loads only multiple values of the same type. 47 48| Consecutive Loads | Coalesced Load | 49| --- | --- | 50| `58.i64 LoadArrayI v2, 0x0 -> (v37)` <br> `61.i64 LoadArrayI v2, 0x1 -> (v43)` | `62.i64 LoadArrayPairI v2, 0x0 -> (v63, v64)` <br> `63.i64 LoadPairPart v62, 0x0 -> (v37)` <br> `64.i64 LoadPairPart v62, 0x1 -> (v43)` | 51 52### Transformation 53 54The optimization tries to coalesce memory operations in a scope of a basic block. It needs that two consecutive memory operations are placed near each other without intermediate instructions. By this reason we need to find a place to sunk an upper memory operation and to hoist the lower according to reordering rules. 55 56During hoisting and sinking of memory operations we use rules for memory instruction scheduling: do not move over monitors, calls, save states, save points and etc. 57 58Memory coalescing was implemented for array accesses. We process instructions of basic block in order. To find accesses of consecutive memory addresses we keep a queue of candidates. Each instruction that may be coalesced is inserted into this queue. A candidate is marked as invalid in the following conditions: 59 * it has been paired already 60 * store candidates are invalid if SaveState instruction has been met 61 * all candidates are invalid if a barrier is met: calls, control flow instructions, monitors, exceptions, intrinsic and etc. 62 63To track indices we use basic implementation of scalar evolution that allows to track how variables evolves: basic value (variable or constant), difference from the basic value (if basic value is a variable) and evolution (if basic value is a variable incremented on each iteration of a loop). It is a simple graph bypass by collecting assignments including Phi evolutions (supported only addition and subtraction). 64 65Processing each instruction in basic block we do the following: 66 1) If the instruction cannot be coalesced. 67 1) If the instruction is a barrier – invalidate all candidates. 68 2) If the instruction is a SaveState – invalidate all store candidates. 69 3) If the instruction is a memory operation – add it as invalid candidate. 70 2) If we can't determine anything about index variable, we add this instruction as a candidate and move on next instruction 71 3) Iterate candidates in backward order 72 2) If a candidate is invalid **or** candidate cannot be coalesced with the instruction **or** both refer to different objects **or** we have no information about candidate index, move on next candidate 73 3) If indices differs by one and there is a place to sunk the candidate instruction and hoist the currently processing instruction – add this candidate and the instruction as a pair for coalescing and invalidate both. 74 4) Add the instruction as a candidate. 75 76Finally, we replace collected pairs by coalesced instructions. 77 78To find a place for candidate and current instruction: 79 1) find the lowest position the candidate can be sunk 80 2) find the highest position the instruction can be hoisted 81 3) The place can be any between highest and lowest position. If the intersection is empty, coalescing is not possible. 82 83## Pseudocode 84 85``` 86void MemoryCoalescing::RunImpl() { 87 VariableAnalysis variables(GetGraph()); 88 for (auto block : GetGraph()->GetBlocksRPO()) { 89 for (auto inst : block.Insts()) { 90 if (IsArrayAccess(inst)) { 91 HandleArrayAccess(inst, variables); 92 } else if (inst->IsMemory()) { 93 inst->SetMarker(mrk_invalid_); 94 candidates.push_back(inst); 95 } else if (inst->IsBarrier()) { 96 // Remove all candidates -- do not move anything across barriers 97 candidates.clear(); 98 } 99 } 100 // Work in scope of basic block 101 candidates.clear(); 102 } 103 104 for (auto pair : pairs) { 105 // Replace a pair of instructions by a coalesced instruction 106 ReplacePair(pair); 107 } 108} 109 110void HandleArrayAccess(Inst *inst, VariableAnalysis &vars) { 111 Inst *obj = inst->GetObject(); 112 Inst *idx = inst->GetIndex(); 113 // If we don't know anything about index, do nothing 114 if (!vars.IsAnalyzed(idx)) { 115 candidates.push_back(inst); 116 return; 117 } 118 // Last candidates more likely to be coalesced 119 for (auto iter = candidates.rbegin(); iter != candidates.rend(); iter++) { 120 auto cand = *iter; 121 // Skip not interesting candidates: invalid and that cannot be coalesced with current inst 122 if (cand->IsMarked(invalid) || cand->GetOpcode() != inst->GetOpcode()) { 123 continue; 124 } 125 126 Inst *cand_obj = cand->GetObject(); 127 auto cand_idx = cand->GetIndex(); 128 // We need to have info about candidate's index and array objects must alias each other 129 if (!vars.IsAnalyzed(cand_idx) || obj.IsAlias(cand_obj) != MUST_ALIAS) { 130 continue; 131 } 132 // If both indices differs by one 133 Inst *position = FindBetterPlace(cand, inst); 134 if (poisition && vars.DiffersByConst(idx, cand_idx, 1)) { 135 pairs.push_back({cand, inst, position}); 136 cand->SetMarker(invalid); 137 inst->SetMarket(invalid); 138 } 139 } 140 141 candidates.push_back(inst); 142} 143``` 144 145## Examples 146 147### Loads and Stores with immediate indices 148Before optimization 149``` 150BB 0 151prop: start 152 0.i64 Constant 0x2a -> (v3) 153succs: [bb 2] 154 155BB 2 preds: [bb 0] 156 3.ref NewArray 77 v0 -> (v42, v41) 157 41. SaveState v3(vr7) -> (v42) 158 42.ref NullCheck v3, v41 -> (v225, v229, v227, v230) 159 225.i64 LoadArrayI v42, 0x0 -> (v51) 160 227.i64 LoadArrayI v42, 0x1 -> (v51) 161 51.i64 Add v225, v227 -> (v229, v40, v230) 162 229.i64 StoreArrayI v42, 0x0, v51 163 230.i64 StoreArrayI v42, 0x1, v51 164 40.i64 Return v51 165succs: [bb 1] 166``` 167After optimization 168``` 169BB 0 170prop: start 171 0.i64 Constant 0x2a -> (v3) 172succs: [bb 2] 173 174BB 2 preds: [bb 0] 175 3.ref NewArray 77 v0 -> (v41, v42) 176 41. SaveState v3(vr7) -> (v42) 177 42.ref NullCheck v3, v41 -> (v231, v234) 178 231.i64 LoadArrayPairI v42, 0x0 -> (v232, v233) 179 232.i64 LoadPairPart v231, 0x0 -> (v51) 180 233.i64 LoadPairPart v231, 0x1 -> (v51) 181 51.i64 Add v232, v233 -> (v234, v234, v40) 182 234.i64 StoreArrayPairI v42, 0x0, v51, v51 183 40.i64 Return v51 184succs: [bb 1] 185``` 186### Coalescing inside loop 187Before optimization 188``` 189BB 2 preds: [bb 0] 190 3.i32 LenArray v0 -> (v35) 191succs: [bb 3] 192 193BB 3 preds: [bb 2, bb 3] 194prop: head, loop 1 195 6p.i32 Phi v4(bb2), v34(bb3) -> (v33, v17, v34) 196 7p.i32 Phi v5(bb2), v24(bb3) -> (v24, v17) 197 8p.i32 Phi v5(bb2), v25(bb3) -> (v25, v23, v24) 198 17.i32 StoreArray v0, v6p, v7p 199 33.i32 AddI v6p, 0x1 -> (v23) 200 23.i32 StoreArray v0, v33, v8p 201 24.i32 Add v7p, v8p -> (v7p, v25) 202 25.i32 Add v8p, v24 -> (v8p) 203 34.i32 AddI v6p, 0x2 -> (v6p, v35) 204 35. If LT i32 v34, v3 205succs: [bb 3, bb 4] 206 207BB 4 preds: [bb 3] 208 29.void ReturnVoid 209succs: [bb 1] 210``` 211After optimization 212``` 213BB 2 preds: [bb 0] 214 3.i32 LenArray v0 -> (v35) 215succs: [bb 3] 216 217BB 3 preds: [bb 2, bb 3] 218prop: head, loop 1 219 6p.i32 Phi v4(bb2), v34(bb3) -> (v33, v36, v34) 220 7p.i32 Phi v5(bb2), v24(bb3) -> (v36, v24) 221 8p.i32 Phi v5(bb2), v25(bb3) -> (v36, v24, v25) 222 33.i32 AddI v6p, 0x1 223 36.i32 StoreArrayPair v0, v6p, v7p, v8p 224 24.i32 Add v7p, v8p -> (v7p, v25) 225 25.i32 Add v8p, v24 -> (v8p) 226 34.i32 AddI v6p, 0x2 -> (v6p, v35) 227 35. If LT i32 v34, v3 228succs: [bb 3, bb 4] 229 230BB 4 preds: [bb 3] 231 29.void ReturnVoid 232succs: [bb 1] 233``` 234 235## Options 236 237| Option | Description | Default value | 238| --- | --- | --- | 239| `--compiler-memory-coalescing` | Enables optimization | `true` | 240| `--compiler-memory-coalescing-objects` | Allows coalescing of operations with `ref`s | `true` | 241| `--compiler-memory-coalescing-aligned` | Coalesces only aligned accesses (starting with even indices e.g. 0-1, 4-5 etc.) | `false` | 242 243## Links 244 245Source code: 246[memory_coalescing.cpp](../optimizer/optimizations/memory_coalescing.cpp) 247[memory_coalescing.h](../optimizer/optimizations/memory_coalescing.h) 248 249Tests: 250[memory_coalescing_test.cpp](../tests/memory_coalescing_test.cpp)