1# Loop Unrolling 2## Overview 3 4`Loop unrolling` optimization increases loop body by copying instructions of the original loop body. 5 6## Rationality 7 8Increase number of instructions for each loop iteration, reduce branch penalty. 9 10## Dependence 11 12* Loop Analysis 13* Dominators Tree 14* Reverse Post Order (RPO) 15* Loop Peeling (to make loop with exit-point from backedge) 16 17 18## Algorithm 19 20`Loop unrolling` modifies loops with the following requirements: 21 22* loop is not irreducible; 23* loop-header is not OSR-entry; 24* there is only 1 back-edge; 25* loop-backedge is a single loop-exit point; 26* there are no inner loops; 27 28Optimization settings: 29 30**Instructions limit** - the maximum number of loop instructions after its unrolling; 31 32**Unroll factor** - the number of loop body copies including the original one; 33 34There two types of unrolling: with side-exits and without them. Unrolling without side-exits is applied for countable loops. 35 36### Countable loops 37 38Loop is countable if it contains compare between loop-index instruction and test-instruction defined outside loop. 39Loop-index should be incremented or decremented by a constant. Currently signed integer indexes are supported. 40 41``` 42[Loop-header] 43Phi(init, update) 44... 45 46[Loop-backedge] 47update(phi, constant) 48Compare(update, test) 49... 50where `update` is Add or Sub instruction 51``` 52### Unrolling without side-exits 53 54There are 3 stages of unrolling loop without side-exits: 55 561. Clone loop-body without loop-backedge `factor` times; 572. Fix loop-backedge compare by incrementing/decrementing test input with constant, counted using formula: `factor * loop_step`. If compare has `not-equal` condition code, replace it by `less-than`/`greater-than` 583. Clone loop-body with the original loop-backedge `factor` times, but replace edge to the loop-header with edge the loop-outer block; 59 60Here `factor` means number of cloned loop bodies. 61 62```cpp 63/---->[header] 64| | 65| v 66| [loop-body] 67| | 68| v 69\-----[backedge]----> ... 70``` 71``` 72 /---->[header] 73 | | 74 | v 75 | [loop-body] 76 | | 77 | v 78 | [loop-body'] 79 | | 80 | v 81 | [loop-body''] 82 | | 83 | v 84 \-----[backedge]----> ... 85 | 86 v 87 [loop-body] 88 | 89 v 90 [backedge]------\ 91 | | 92 v | 93 [loop-body] | 94 | | 95 v | 96 [outer-block]<---/ 97``` 98 99 100 101### Unrolling with side-exits 102 103For this case both loop-body and loop-backedge are cloned: 104```cpp 105 /---->[header] 106 | | 107 | v 108 | [loop-body] 109 | | 110 | v 111 | [backedge]------------\ << exit-block 112 | | | 113 | v | 114 | [loop-body-clone] | 115 | | | 116 | v | 117 \-----[backedge-clone]----->| << last-block 118 | 119 v 120 [outer]-----> ... 121``` 122## Pseudocode 123```cpp 124if (IsLoopCountable(loop)) { 125 auto clone_loop = CloneLoop(loop); 126 UnrollLoopBodyWithoutSideExits(loop); 127 FixCompareInst(loop); 128 UnrollLoopBodyWithSideExits(clone_loop); 129 RemoveEdgeToLoopHeader(clone_loop); 130} else { 131 UnrollLoopBodyWithSideExits(loop); 132} 133``` 134## Examples 135 136Countable loop unrolling: 137```cpp 138 auto graph = CreateEmptyGraph(); 139 GRAPH(graph) { 140 CONSTANT(0, stop); 141 CONSTANT(1, 0); // a = 0, b = 0 142 CONSTANT(2, step); 143 BASIC_BLOCK(2, 3, 4) { 144 INST(3, Opcode::Compare).b().SrcType(DataType::INT32).CC(CC_LT).Inputs(1, 0); 145 INST(4, Opcode::IfImm).SrcType(DataType::BOOL).CC(CC_NE).Imm(0).Inputs(3); // if a < stop 146 } 147 BASIC_BLOCK(3, 3, 4) { 148 INST(5, Opcode::Phi).s32().Inputs(1, 7); // a 149 INST(6, Opcode::Phi).s32().Inputs(1, 8); // b 150 INST(7, Opcode::Add).s32().Inputs(5, 2); // a += step 151 INST(8, Opcode::Add).s32().Inputs(6, 7); // b += a 152 INST(9, Opcode::Compare).b().SrcType(DataType::INT32).CC(CC_LT).Inputs(7, 0); 153 INST(10, Opcode::IfImm).SrcType(DataType::BOOL).CC(CC_NE).Imm(0).Inputs(9); // if a < stop 154 } 155 BASIC_BLOCK(4, -1) { 156 INST(11, Opcode::Phi).s32().Inputs(1, 6); 157 INST(12, Opcode::Return).s32().Inputs(11); // return b; 158 } 159 } 160 return graph; 161 162``` 163```cpp 164 uint32_t UNROLL_FACTOR = 2; 165 166 GRAPH(graph_unroll) { 167 CONSTANT(0, 10); 168 CONSTANT(1, 0); // a = 0, b = 0 169 CONSTANT(2, 1); 170 BASIC_BLOCK(2, 3, 5) { 171 INST(20, Opcode::SubI).s32().Inputs(0).Imm(UNROLL_FACTOR - 1); 172 INST(3, Opcode::Compare).b().SrcType(DataType::INT32).CC(CC_LT).Inputs(1, 20); // if (a < 10 - (UNROLL_FACTOR - 1)) 173 INST(4, Opcode::IfImm).SrcType(DataType::BOOL).CC(CC_NE).Imm(0).Inputs(3); 174 } 175 BASIC_BLOCK(3, 3, 5) { 176 INST(5, Opcode::Phi).s32().Inputs(1, 21); // a 177 INST(6, Opcode::Phi).s32().Inputs(1, 22); // b 178 INST(7, Opcode::Add).s32().Inputs(5, 2); // a + 1 179 INST(8, Opcode::Add).s32().Inputs(6, 7); // b + 1 180 INST(21, Opcode::Add).s32().Inputs(7, 2); // a + 1 181 INST(22, Opcode::Add).s32().Inputs(8, 21); // b + 1 182 INST(9, Opcode::Compare).b().SrcType(DataType::INT32).CC(CC_LT).Inputs(21, 20); 183 INST(10, Opcode::IfImm).SrcType(DataType::BOOL).CC(CC_NE).Imm(0).Inputs(9); // if (a < 10 - (UNROLL_FACTOR - 1)) 184 } 185 BASIC_BLOCK(5, 6, 4) { 186 INST(11, Opcode::Phi).s32().Inputs(1, 8); 187 INST(25, Opcode::Phi).s32().Inputs(1, 21); // a 188 INST(26, Opcode::Phi).s32().Inputs(1, 22); // b 189 INST(27, Opcode::Compare).b().SrcType(DataType::INT32).CC(CC_LT).Inputs(25, 0); // if (a < 10) 190 INST(28, Opcode::IfImm).SrcType(DataType::BOOL).CC(CC_NE).Imm(0).Inputs(27); 191 } 192 BASIC_BLOCK(6, 4) { 193 INST(29, Opcode::Add).s32().Inputs(25, 2); // a + 1 194 INST(30, Opcode::Add).s32().Inputs(26, 29); // b + 1 195 } 196 BASIC_BLOCK(4, -1) { 197 INST(31, Opcode::Phi).s32().Inputs(11, 26); 198 INST(12, Opcode::Return).s32().Inputs(31); // return b 199 } 200 } 201 202``` 203## Links 204Source code: 205 206[loop_unroll.cpp](../optimizer/optimizations/loop_unroll.cpp) 207 208[loop_unroll.h](../optimizer/optimizations/loop_unroll.h) 209 210Tests: 211 212[loop_unroll_test.cpp](../tests/loop_unroll_test.cpp)