# Memory Coalescing
## Overview
The optimization is based on the fact that some architectures (`AArch64` particularly) support simultaneous load and store operations for consecutive addresses instead of several separate operations.
## Rationality
Replacing two memory operations with one generally reduces the number of long latency memory instructions.
| Code | Regular | Optimized |
| ------ | ------ | ------|
| `num[0] * num[1];` | `ldr x1, [x0]`
`ldr x0, [x0, 8]`
`mul x0, x1, x0` | `ldp x1, x0, [x0]`
`mul x0, x1, x0` |
## Dependence
* DominatorsTree
* LoopAnalyzer
* AliasAnalysis
* Reverse Post Order (RPO)
## Assumptions
The optimization was implemented for arrays' accesses for `AArch64` architecture.
Array accesses cannot be volatile.
`AArch64` has `32`-bit and `64`-bit versions of coalescing operations – `ldp` and `stp`. As a result it is possible only for following Panda's types: `INT32`, `UINT32`, `INT64`, `UINT64`, `REFERENCE`.
## Algorithm
To implement such kind of optimization the extra support from IR is required
### IR Support
The following actions are required
* Separate instructions that will represent coalesced memory accesses from regular accesses.
* Handle multiple output from one instruction in terms of SSA
The case with a coalesced store is quite straightforward: having two consecutive stores we replace them by one instruction that accepts index and two values to store.
| Consecutive Stores | Coalesced Store |
| --- | --- |
| `248.i64 StoreArrayI v2, 0x0, v53`
`250.i64 StoreArrayI v2, 0x1, v53` | `251.i64 StoreArrayPairI v2, 0x0, v53, v53` |
The problem occurs with a coalesced load because a load instruction of multiple values produces multiple assignment that is not a part of SSA form. By this reason, we need additional pseudo instructions as `LoadPairPart` to divide multiple values into single values. The type of `LoadArrayPair` instruction corresponds to the type of a single element, so there is an assumption that `LoadArrayPair` loads only multiple values of the same type.
| Consecutive Loads | Coalesced Load |
| --- | --- |
| `58.i64 LoadArrayI v2, 0x0 -> (v37)`
`61.i64 LoadArrayI v2, 0x1 -> (v43)` | `62.i64 LoadArrayPairI v2, 0x0 -> (v63, v64)`
`63.i64 LoadPairPart v62, 0x0 -> (v37)`
`64.i64 LoadPairPart v62, 0x1 -> (v43)` |
### Transformation
The optimization tries to coalesce memory operations in a scope of a basic block. It needs that two consecutive memory operations are placed near each other without intermediate instructions. By this reason we need to find a place to sunk an upper memory operation and to hoist the lower according to reordering rules.
During hoisting and sinking of memory operations we use rules for memory instruction scheduling: do not move over monitors, calls, save states, save points and etc.
Memory coalescing was implemented for array accesses. We process instructions of basic block in order. To find accesses of consecutive memory addresses we keep a queue of candidates. Each instruction that may be coalesced is inserted into this queue. A candidate is marked as invalid in the following conditions:
* it has been paired already
* store candidates are invalid if SaveState instruction has been met
* all candidates are invalid if a barrier is met: calls, control flow instructions, monitors, exceptions, intrinsic and etc.
To track indices we use basic implementation of scalar evolution that allows to track how variables evolves: basic value (variable or constant), difference from the basic value (if basic value is a variable) and evolution (if basic value is a variable incremented on each iteration of a loop). It is a simple graph bypass by collecting assignments including Phi evolutions (supported only addition and subtraction).
Processing each instruction in basic block we do the following:
1) If the instruction cannot be coalesced.
1) If the instruction is a barrier – invalidate all candidates.
2) If the instruction is a SaveState – invalidate all store candidates.
3) If the instruction is a memory operation – add it as invalid candidate.
2) If we can't determine anything about index variable, we add this instruction as a candidate and move on next instruction
3) Iterate candidates in backward order
2) If a candidate is invalid **or** candidate cannot be coalesced with the instruction **or** both refer to different objects **or** we have no information about candidate index, move on next candidate
3) If indices differs by one and there is a place to sunk the candidate instruction and hoist the currently processing instruction – add this candidate and the instruction as a pair for coalescing and invalidate both.
4) Add the instruction as a candidate.
Finally, we replace collected pairs by coalesced instructions.
To find a place for candidate and current instruction:
1) find the lowest position the candidate can be sunk
2) find the highest position the instruction can be hoisted
3) The place can be any between highest and lowest position. If the intersection is empty, coalescing is not possible.
## Pseudocode
```
void MemoryCoalescing::RunImpl() {
VariableAnalysis variables(GetGraph());
for (auto block : GetGraph()->GetBlocksRPO()) {
for (auto inst : block.Insts()) {
if (IsArrayAccess(inst)) {
HandleArrayAccess(inst, variables);
} else if (inst->IsMemory()) {
inst->SetMarker(mrk_invalid_);
candidates.push_back(inst);
} else if (inst->IsBarrier()) {
// Remove all candidates -- do not move anything across barriers
candidates.clear();
}
}
// Work in scope of basic block
candidates.clear();
}
for (auto pair : pairs) {
// Replace a pair of instructions by a coalesced instruction
ReplacePair(pair);
}
}
void HandleArrayAccess(Inst *inst, VariableAnalysis &vars) {
Inst *obj = inst->GetObject();
Inst *idx = inst->GetIndex();
// If we don't know anything about index, do nothing
if (!vars.IsAnalyzed(idx)) {
candidates.push_back(inst);
return;
}
// Last candidates more likely to be coalesced
for (auto iter = candidates.rbegin(); iter != candidates.rend(); iter++) {
auto cand = *iter;
// Skip not interesting candidates: invalid and that cannot be coalesced with current inst
if (cand->IsMarked(invalid) || cand->GetOpcode() != inst->GetOpcode()) {
continue;
}
Inst *cand_obj = cand->GetObject();
auto cand_idx = cand->GetIndex();
// We need to have info about candidate's index and array objects must alias each other
if (!vars.IsAnalyzed(cand_idx) || obj.IsAlias(cand_obj) != MUST_ALIAS) {
continue;
}
// If both indices differs by one
Inst *position = FindBetterPlace(cand, inst);
if (poisition && vars.DiffersByConst(idx, cand_idx, 1)) {
pairs.push_back({cand, inst, position});
cand->SetMarker(invalid);
inst->SetMarket(invalid);
}
}
candidates.push_back(inst);
}
```
## Examples
### Loads and Stores with immediate indices
Before optimization
```
BB 0
prop: start
0.i64 Constant 0x2a -> (v3)
succs: [bb 2]
BB 2 preds: [bb 0]
3.ref NewArray 77 v0 -> (v42, v41)
41. SaveState v3(vr7) -> (v42)
42.ref NullCheck v3, v41 -> (v225, v229, v227, v230)
225.i64 LoadArrayI v42, 0x0 -> (v51)
227.i64 LoadArrayI v42, 0x1 -> (v51)
51.i64 Add v225, v227 -> (v229, v40, v230)
229.i64 StoreArrayI v42, 0x0, v51
230.i64 StoreArrayI v42, 0x1, v51
40.i64 Return v51
succs: [bb 1]
```
After optimization
```
BB 0
prop: start
0.i64 Constant 0x2a -> (v3)
succs: [bb 2]
BB 2 preds: [bb 0]
3.ref NewArray 77 v0 -> (v41, v42)
41. SaveState v3(vr7) -> (v42)
42.ref NullCheck v3, v41 -> (v231, v234)
231.i64 LoadArrayPairI v42, 0x0 -> (v232, v233)
232.i64 LoadPairPart v231, 0x0 -> (v51)
233.i64 LoadPairPart v231, 0x1 -> (v51)
51.i64 Add v232, v233 -> (v234, v234, v40)
234.i64 StoreArrayPairI v42, 0x0, v51, v51
40.i64 Return v51
succs: [bb 1]
```
### Coalescing inside loop
Before optimization
```
BB 2 preds: [bb 0]
3.i32 LenArray v0 -> (v35)
succs: [bb 3]
BB 3 preds: [bb 2, bb 3]
prop: head, loop 1
6p.i32 Phi v4(bb2), v34(bb3) -> (v33, v17, v34)
7p.i32 Phi v5(bb2), v24(bb3) -> (v24, v17)
8p.i32 Phi v5(bb2), v25(bb3) -> (v25, v23, v24)
17.i32 StoreArray v0, v6p, v7p
33.i32 AddI v6p, 0x1 -> (v23)
23.i32 StoreArray v0, v33, v8p
24.i32 Add v7p, v8p -> (v7p, v25)
25.i32 Add v8p, v24 -> (v8p)
34.i32 AddI v6p, 0x2 -> (v6p, v35)
35. If LT i32 v34, v3
succs: [bb 3, bb 4]
BB 4 preds: [bb 3]
29.void ReturnVoid
succs: [bb 1]
```
After optimization
```
BB 2 preds: [bb 0]
3.i32 LenArray v0 -> (v35)
succs: [bb 3]
BB 3 preds: [bb 2, bb 3]
prop: head, loop 1
6p.i32 Phi v4(bb2), v34(bb3) -> (v33, v36, v34)
7p.i32 Phi v5(bb2), v24(bb3) -> (v36, v24)
8p.i32 Phi v5(bb2), v25(bb3) -> (v36, v24, v25)
33.i32 AddI v6p, 0x1
36.i32 StoreArrayPair v0, v6p, v7p, v8p
24.i32 Add v7p, v8p -> (v7p, v25)
25.i32 Add v8p, v24 -> (v8p)
34.i32 AddI v6p, 0x2 -> (v6p, v35)
35. If LT i32 v34, v3
succs: [bb 3, bb 4]
BB 4 preds: [bb 3]
29.void ReturnVoid
succs: [bb 1]
```
## Options
| Option | Description | Default value |
| --- | --- | --- |
| `--compiler-memory-coalescing` | Enables optimization | `true` |
| `--compiler-memory-coalescing-objects` | Allows coalescing of operations with `ref`s | `true` |
| `--compiler-memory-coalescing-aligned` | Coalesces only aligned accesses (starting with even indices e.g. 0-1, 4-5 etc.) | `false` |
## Links
Source code:
[memory_coalescing.cpp](../optimizer/optimizations/memory_coalescing.cpp)
[memory_coalescing.h](../optimizer/optimizations/memory_coalescing.h)
Tests:
[memory_coalescing_test.cpp](../tests/memory_coalescing_test.cpp)