• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Panda Intermediate representation(IR) design
2
3This document describes Panda IR design with the following goals
4* Possibility to implement various optimizations and analyses
5* Support all the features and instructions of Panda bytecode
6* Focus on ARM64 architecture
7* Compiler overhead about 100000 native instructions per a bytecode instruction(standard for JIT compilers)
8* Be able to convert to other IR and back
9
10## Optimizations and analyses
11
12In the development process, it is very important to have auxiliary functionality for various code transformations and analyses. The structure of the IR should be as clear as possible and make it possible to implement various algorithms. The panda IR should contribute to this.
13Also in the compilation process, the order of execution of optimizations and analyses is very important. Firstly there are dependencies between different passes. Second, often some optimization creates a context for others.
14The first goal of the Panda IR to be able to change the order of the passes, add and delete passes(If 2 passes have a dependency we must take this into account). We should be able to  change the order of the passes by options.
15Second, we need to support the transfer of information between optimizations.
16
17### List of the optimizations
18
19* [IrBuilder](../compiler/docs/ir_builder.md)
20* [BranchElimination](../compiler/docs/branch_elimination_doc.md)
21* [ChecksElimination](../compiler/docs/check_elimination_doc.md)
22* [Cleanup](../compiler/docs/cleanup_doc.md)
23* [Codegen](../compiler/docs/codegen_doc.md)
24* [CodeSink](../compiler/docs/code_sink_doc.md)
25* [Constant Folding](../compiler/docs/constant_folding_doc.md)
26* [IfConversion](../compiler/docs/if_conversion_doc.md)
27* [Inlining](../compiler/docs/inlining.md)
28* Inlining
29* [LICM](../compiler/docs/licm_doc.md)
30* [Loop peeling](../compiler/docs/loop_peeling.md)
31* [Loop unrolling](../compiler/docs/loop_unrolling.md)
32* [Lowering](../compiler/docs/lowering_doc.md)
33* [Load Store Elimination (LSE)](../compiler/docs/lse_doc.md)
34* [Memory Coalescing](../compiler/docs/memory_coalescing_doc.md)
35* [Peepholes](../compiler/docs/peephole_doc.md)
36* [Redundant Loop Elimination](../compiler/docs/redundant_loop_elimination_doc.md)
37* [Scheduler](../compiler/docs/scheduler_doc.md)
38* [Value Numbering](../compiler/docs/vn_doc.md)
39
40### Analyses
41
42* Alias Analysis
43* Bounds Analysis
44* Domtree
45* Linear Order
46* Liveness Analysis
47* Monitor Analysis
48* Reverse Post Order(RPO)
49
50### Potential optimizations
51
52The benefits of some optimizations are not obvious or do need profiling information to implement them. We will have them in mind, but will make the implementation, after performance analyzing the code.
53
54* Remove cold path
55* MAW(Memory access widening)/Merge memory
56* [Block duplication](https://en.wikipedia.org/wiki/Loop_unswitching)
57
58!NOTE It is possible to write other optimizations based on the specifics of the language and VM
59
60### The order of optimizations
61
62We will try to make it possible to pass optimizations in an arbitrary order. Some restrictions will still be: register allocation and code generation at the end, inlining at the beginning. Some optimization(DCE, Peephole) will be called several times.
63
64## Features
65
66* Using profile information for IFC and speculative optimizations
67* Supporting side exits for de-optimizations and removing cold code.
68* Converting to LLVM IR
69* Independence from Runtime(all profile and runtime information will be contained in a special class with default values)
70* Common properties will be introduced for the instructions, making it easier to add new instructions
71
72## Instruction set
73
74Panda IR needs to combine the properties of high and low level IRs.
75
76High level:
77
78Panda bytecode has more than 200 instructions. We need to convert all Bytecode instructions in IR instructions with minimal overhead(ideally one to one).
79The specifics and properties of instructions should be taken into account in optimizations and codegen.
80
81Low level:
82
83The main target is ARM64. So Panda IR should be able to do arm specific optimizations. For this, need to support ARMv8-M Instruction Set(only those instructions that are needed)
84
85Proposal:
86
87IR contains high- and low-level instructions with a single interface.
88In the first step, Panda bytecode is converted to high level instruction and architecturally independent optimizations are made.
89At the second step, the instructions will be split on several low level instructions(close to assembler instructions) for additional optimizations.
90
91## Overhead
92
93Overhead is the time that requires for compile.
94Typically, an overhead is considered to be the average number of 'native' instructions(ARM) that are spent compiling a single 'guest' instruction(from Bytecode).
95The more and more complex optimizations we do, the more overhead we get. We need to find a balance between performance and the overhead needed to achieve it. For example, the optimization [Unroll](https://en.wikipedia.org/wiki/Loop_unrolling) allows to remove unnecessary loop induction variables and dependencies between loop iterations, but it increases the size of the code that leads to increases overhead. We should apply this optimization only if the benefit from it exceeds the increase in overhead costs.
96
97In Ahead-Of-Time(AOT) mode the overhead is less critical for us, so we can do more optimizations.
98In Just-In-Time(JIT) mode need to strictly control the overhead to get the overall performance increase(time on compile + time on execution).
99
100The goal is overhead about 100000 native instructions per guest (standard for JIT compilers)
101
102## Compatibility
103
104To be able to integrate into existing compilers, as well as to compare efficiency, to need the ability to convert to Panda Ir and back.
105The converter from LLVM IR and back will allow using different LLVM optimizations.
106
107## IR structure
108
109### Rationale
110
111The most of used IR in compilers: classical CFG(Control Flow Graph) with SSA(Static Single Assignment) form(used in LLVM, WebKit, HHVM, CoreCLR, IonMonkey) and Sea-of-Nodes(Hotspot, V8 Turbofan).
112We decided to choose the CFG with SSA form for the following reasons:
1131. It is more common in compilers and easier to understand
1142. Sea-of-Nodes has a big overhead for IR constructing and scheduling phases, that makes impossible to make lightweight tier 1 (applying a small number of optimizations with minimal overhead for fast code generation)
115
116### Graph
117
118The main class is a **Graph**. It contains all information for compiler such as:
119 * Information about the method for which transformations are made
120 * pointer to RuntimeInterface - class with all Runtime information
121 * Vector of pointers to **BasicBlocks**
122 * Information about the current status(constructerd or not RPO, DomTree e.t.c)
123 * Information to be transmitted between passes
124 * Pass manager
125
126Class **Graph** allows creating new instructions, adding and removing blocks, constructing RPO, DomTree and e.t.c
127
128### BasicBlock
129
130**BasicBlock** is a class that describes a linear part of executable code. BasicBlock contains:
131 * A double-linked list of instructions, which are contained in the block
132 * List of predecessors: vector of pointers to the BasicBlocks from which we can get into the current block
133 * List of successors: vector of pointers to the BasicBlocks in which we can get from the current block
134 * Information about DomTree
135
136Class **BasicBlock** allows adding and removing instructions in the BasicBlock, adding and removing successors and predecessors, getting dominate block and dominated blocks e.t.c
137The Graph always begins from **start** BasicBlock and finishes **end** BasicBlock.
138**Start** BasicBlock doesn't have predecessors and have one successor. Only SafePoint, Constants and Parameter instructions can be contained in start BasicBlock.
139**End** BasicBlock doesn't have successors and doesn't contain instructions.
140
141**BasicBlock** can not have more than one incoming or outgoing edges into the same block.
142When control flow look like that, we must keep an empty block on the edge. Left graph can not be optimized to right one when block 3 have no instructions.
143
144```
145      [1]           [1]
146      |  \          |  \
147      |  [3]  -x->  |   |
148      |  /          |  /
149      [2]           [2]
150
151```
152
153Empty blocks pass covers this situation and do not remove such an empty block when there are `Phi` instructions in block 2 with different inputs from those incoming edges. When there are no such `Phi`s we can easily remove the second edge too.
154
155Another solution may be to introduce `Select` instructions on early stage. Third solution is to keep special `Mov` instructions in block 3, but this contradicts to SSA form ideas and experiments show that this is less effective, as we keep to much `Mov`-only blocks.
156
157| Bench | Empty Blocks | Mov-Blocks |
158| ------ | ------ | ------ |
159| access-fannkuch-c2p | 0 | 1 |
160| math-spectral-norm-c2p | 0 | 1 |
161| bitops-bitwise-and-c2p | 0 | 0 |
162| bitops-bits-in-byte-c2p | 0 | 1 |
163| bitops-3bit-bits-in-byte-c2p | 0 | 1 |
164| access-nsieve-c2p | 0 | 1 |
165| controlflow-recursive-c2p | 0 | 25 |
166| 3d-morph-c2p  | 0 | 3 |
167| math-partial-sums | 1 | 1 |
168| controlflow-recursive | 1 | 86 |
169| bitops-nsieve-bits | 1 | 2 |
170| access-binary-trees | 3 | 4 |
171| access-nbody | 1 | 11 |
172| 3d-morph | 1 | 3 |
173| access-fannkuch | 1 | 2 |
174| access-nsieve | 1 | 2 |
175| bitops-3bit-bits-in-byte | 1 | 3 |
176| bitops-bits-in-byte | 1 | 3 |
177| math-spectral-norm  | 1 | 4 |
178| bitops-bitwise-and | 0 | 0 |
179| math-cordic  | 1 | 2 |
180
181### Instructions
182
183Instructions are implemented by class inheritance.
184
185**Inst** is a base class with main information about an instruction.
186 * Opcode(name) of the instruction
187 * pc(address) instruction in bytecode/file
188 * Type of instruction(bool, uint8, uint32, float, double e.t.c)
189 * Pointers to next and previous  Inst in the BasicBlock
190 * Array of inputs (instructions whose result this Inst uses)(class Inst has virtual method that returns empty array. Derived classes override this method and return non empty array)
191 * List of users (instructions which use result from the Inst)
192 * Properties
193
194Class **Inst** allows adding and removing users and inputs
195
196Class **FixedInputsInst** inherits from **Inst** for instruction with a fixed number of inputs(operands).
197Class **DynamicInputsInst** inherits from **Inst** for instruction with a variable number of inputs(operands).
198Class **CompareInst** inherits from **Inst** for instruction with predicate. It contain information about type of conditional code(EQ, NE, LT, LE and e.t.c).
199Class **ConstantInst** inherits from **Inst** for constant instruction. It contains a constant and type of the constant. Constants are contained only in start block.
200Class **ParameterInst** inherits from **Inst** for input parameter. It contains a type of parameter and parameter number. Parameters are contained only in start block.
201Class **UnaryOperation** inherits from **FixedInputsInst** for instruction with a single input. The class is used for instructions NOT, NEG, ABS e.t.c.
202Class **BinaryOperation** inherits from **FixedInputsInst** for instruction with two inputs. The class is used for instructions ADD, SUB, MUL e.t.c.
203
204Class **CallInst** inherits from **DynamicInputsInst** for call instructions.
205Class **PhiInst** inherits from **DynamicInputsInst** for phi instructions.
206
207#### Mixin
208
209**Mixin** are classes with properties or data which uses different instruction classes. For example:
210
211**ImmediateMixin** is inherited in instruction classes with immediate(BinaryImmOperation, ReturnInstI and so on)
212**ConditionMixin** is inherited in instruction classes with conditional code(CompareInst, SelectInst, IfInst and so on)
213**TypeIdMixin** is inherited in instruction classes wich uses TypeId(LoadObjectInst, StoreObjectInst, NewObjectInst and so on)
214
215#### Constant instruction
216
217Constant instructions(**ConstantInst**) can have type FLOAT32, FLOAT64 and INT64. Constants all integer types and reference saves as INT64. All integer instructions can have constant input with INT64 type.
218All constants instruction are contaned in **Start BasicBlock**. There are not two equal constant(equal value and type) in Graph. The Graph function *indOrCreateConstant* is used for adding constant in Graph.
219
220#### Parameter instruction
221
222Parameter instruction(**ParameterInst**) contain a type of parameter and parameter number. Parameters are contained only in  **Start BasicBlock**. The Graph function *AddNewParameter* is used for adding parameter in Graph.
223
224#### instruction.yaml
225
226**instruction.yaml** contains next information for each instruction:
227* Opcode
228* class
229* signature(supported type of inputs and type of destination for the instruction)
230* flags
231* description
232
233**instruction.yaml** is used for generating instructions and describing them.
234
235!NOTE **instruction.yaml** isn't used for generating checks for instruction. We plan to support this.
236
237### Exceptions
238
239Details: (../compiler/docs/try_catch_blocks_ir.md)
240
241## Reverse Post Order(RPO) tree
242
243**RPO** builds blocks list for reverse post-order traversal. In RPO iteration, a BasicBlock is visited before any of its successor BasicBlocks has been visited, except when the successor is reached by a back edge. **RPO** is implemented as a separate class, which returns the vector of pointers to BasicBlocks for the Graph.  There is an option to invalidate the vector. In this case, the vector will be built from scratch after the next request for it(if the option to invalidate isn't set, the current RPO vector is returned). RPO is invalidated after Control Flow transformations: removing or adding blocks or edges between blocks. Also, it provides methods for updating an existing tree.
244
245Class **RPO** allows constructing RPO vector, adding or removing blocks to the vector.
246
247## DomTree building
248
249A BasicBlock "A" dominates a BasicBlock "B" if every path from the "start" to "B" must go through "A".
250**DomTree** is implemented as a separate class, but it makes only constructing the tree. The Dominator tree itself is stored in class **BasicBlock**. Each BasicBlock has a pointer on dominate block and vector of pointers to blocks which he dominates. **BasicBlock** has function changing the dominate block and the vector(adding, removing). As in the case of
251 **RPO**, class **DomTree** has an option to invalidate the tree, but unlike **RPO**, the tree rebuilding doesn't happen automatically, the developer has to monitor it himself and call the construct function if necessary.
252
253## Instruction Iterators
254
255The block instructions form a doubly linked list. At the beginning are Phi instructions, and then all the rest.
256**Iteration** over instructions can be passed in direct/reverse order. ‘IterationType’ defines instructions that are iterated: phi-instructions, non-phi-instructions or all instructions. “SafeIterator“ is keeping the next instruction in case of removing current instruction.
257List of the **iterators**: *PhiInstIter*, *InstIter*, *AllInstIter*, *InstReverseIter*, *PhiInstSafeIter*, *InstSafeIter*, *AllInstSafeIter*, *PhiInstSafeReverseIter*, *InstSafeReverseIter*, *AllInstSafeReverseIter*
258
259## Data Flow Graph
260
261Data flow graph is widely used by almost all optimizations, therefore it greatly affects overhead of the JIT. The most basic and frequent use is an iterating over inputs or users. One of the approaches to make iterating more effective is to store data in sequence container, such as array or vector, thereby elements much more likely will be in the processor cache.
262
263**User** of the instruction is an object that points to the consumer instruction and its corresponding input.
264
265**Input** is the object that describes which instruction defines value for corresponding operand of the owned instruction.
266
267Instructions can have various count of users and this count doesn't depend on the instruction type. Therefore storing users in sequence container has one big drawback - frequent storage reallocation, that leads to memory fragmentation (IR uses arena allocator) and additional overhead.
268
269On the other hand, inputs depend on instruction type and mostly have fixed count. Thus, they should be stored in sequence container.
270
271Following scheme shows how Panda JIT organizes inputs and users in the memory:
272
273![def-use structure](images/def-use-structure.png)
274
275There are two types of def-use storage: in memory right before instruction class and in separate memory chunk.
276- First case is used in instructions with fixed number of inputs. Storage is allocated right before instruction object and it is never reallocated. Most instructions belongs to this category.
277- Second category is the instructions with dynamic number of inputs, such as Phi instructions. Its def-use storage is allocated separately from instruction object, both storage and instruction are coupled by pointers to each other. In case when new input is appended and the capacity of the storage is equal to its size, whole storage is reallocated. This behavior is exactly similar to classical vector implementations. This brings additional amortized complexity to this category.
278
279Both, user and input have properties field that have following information:
280- index of input/user
281- overall number of inputs
282- flag that shows whether storage is dynamic
283- additional info about input:
284  1. if instruction is SaveState: virtual register number
285  2. if instruction is Phi: number of basic block predecessor edge
286
287With this field it is possible to get instruction that owns given user or input.
288
289This kind of storage have been chosen because it avoids usage of virtual methods and dynamic containers for all instruction. Instead, each access to def-use structures is just checking whether storage is dynamic and further processing of the corresponding type of def-use storage. Instead of indirect call we have one condition branch which has good branch prediction, because of most instructions have fixed number of inputs.
290
291## Visitor
292
293Class **GraphVisitor** allows to go through the blocks of the graph in RPO order and then all the instructions of the block. At the same time Visit functions are called by the opcode of the instruction or by its group affiliation.
294
295## Pass manager
296
297!TODO Sherstennikov Mikhail add description
298
299## Lowering
300
301**Lowering pass** makes low level instructions(which are more close to machine code).
302Some instructions may not appear before this pass. But at the moment we do not have any checks on this
303
304## Register allocation
305
306Register allocation is a process of assigning CPU registers to instructions.
307There are 2 based algorithm: Graph-coloring allocation(by Gregory John Chaitin) and Linear Scan(by Massimiliano Poletto)
308We use "Linear Scan" algorithm because it has less overhead(the graph coloring algorithm having a quadratic cost).
309
310In the future, we plan to implement Graph-coloring algorithm, because it gives better code, and select the type of allocator depending on the context.
311
312## Code generator
313
314Code generation is a complex process that converts IR code into the machine code.
315At the moment, we consider Arm64 as the main architecture.
316We chose the standard vixl library for сode generation to make implementation faster and avoid possible errors.
317The vixl-library created by ARM-developers for easy implement assembly-generation and emulation.
318It is used in HHVM, IonMonkey, DartVM and proved its reliability.
319
320In the future, we plan to make fully own implementation for more optimal code generation(in terms of overhead and performance).
321
322!TODO Gorban Igor update description
323
324## Example of use
325
326### Create Graph
327
328```
329Graph* graph = new (allocator) Graph(&allocator_, panda_file_, /*method_idx*/ -1, /*is_arm64*/ true);
330```
331
332### Create blocks and CFG
333
334```
335BasicBlock* start = graph->CreateStartBlock();
336BasicBlock* end = graph->CreateEndBlock();
337BasicBlock* block = graph->CreateEmptyBlock();
338
339start->AddSucc(block);
340block->AddSucc(end);
341block->AddSucc(block);
342```
343
344### Create instruction and add in a block
345
346```
347ConstantInst* constant = graph->FindOrCreateConstant(value);
348ParameterInst* param = graph->AddNewParameter(slot_num, type);
349Inst* phi1 = graph->CreateInst(Opcode::Phi);
350Inst* ph2 = graph->CreateInst(Opcode::Phi);
351Inst* compare = graph->CreateInst(Opcode::Compare);
352Inst* add = graph->CreateInst(Opcode::Add);
353block->AppendPhi(phi1);
354block->AppendInst(compare);
355block->InsertAfter(phi2, phi1);
356block->InsertBefore(add, compare);
357
358for (auto inst : block->PhiInsts()) {
359    ASSERT(inst->GetOpcode() == Opcode::Phi)
360    ......
361}
362for (auto inst : block->Insts()) {
363    ASSERT(inst->GetOpcode() != Opcode::Phi)
364    ......
365}
366for (auto inst : block->AllInsts()) {
367    ......
368}
369for (auto inst : block->InstsSafe()) {
370    if (inst->GetOpcode() == Opcode::Add) {
371        block->EraseInst(inst);
372    }
373}
374```
375
376### Visitors:
377
378```
379 struct ExampleVisitor: public GraphVisitor {
380     using GraphVisitor::GraphVisitor;
381
382     // Specify blocks to visit and their order
383     const ArenaVector<BasicBlock *> &GetBlocksToVisit() const override
384     {
385         return GetGraph()->GetBlocksRPO();
386     }
387     // Print special message for Mul instruction
388     static void VisitMul(GraphVisitor* v, Inst* inst) {
389         std::cerr << "Multiply instruction\n";
390     }
391     // For all other instructions print its opcode
392     void VisitDefault(Inst* inst) override {
393         std::cerr << OPCODE_NAMES[(int)inst->GetOpcode()] << std::endl;
394     }
395     // Visitor for all instructions which are the instance of the BinaryOperation
396     void VisitInst(BinaryOperation* inst) override {
397         std::cerr << "Visit binary operation\n";
398     }
399     #include "visitor.inc"
400};
401....
402    ExampleVisitor visitor(graph);
403    visitor.VisitGraph();
404    visitor.VisitGraphGrouped();
405```
406
407
408