• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Rationale for Bytecode
2
3## Introduction
4
5This document sets up some context about bytecode design principles and provides rationales for
6bytecode design in Panda Runtime.
7
8## Bytecode basics
9
10Before discussing bytecode per se, let's take a look at an over-simplified picture of a real
11hardware running a program.
12
13There is a central processing unit (CPU) that reads commands (or _instructions_) from
14somewhere in memory and executes corresponding _operations_ on operation's arguments,
15also known as operation's _operands_. Operands may be _registers_ (very fast "variables" located
16directly on the CPU) or _memory_ (some locations in computer's RAM). An important subset of memory
17operands are _stack operands_ that reside in a special data structure called _stack_. The program
18must maintain the stack in the correct state during runtime because exactly this data structure
19is used for storing local variables along with function arguments and doing function calls.
20
21In real world, different CPU manufacturers provide different sets of commands for their devices –
22or, in other words, different CPUs have different _instruction set architectures_. This means
23that the number and purpose of registers differs, too. Some nuances of working with stack may also
24vary across CPUs and/or different operating systems.
25
26Here comes the bytecode. Simply said, it is an attempt to build an abstract CPU on top of real
27ones. A program written for such abstract CPU can be run on any real hardware with the help of a
28special program called _interpreter_. The goal of the interpreter is to read our unified _virtual_
29commands (or bytecode) and execute them. Of course, this implies additional performance overhead
30making interpretation slower than _native code execution_. In return, we get the ability to
31abstract from CPU limitations and run our program wherever our interpreter runs. Tooling
32(debugger, profilers, etc.) is also unified, as well as the ecosystem for managing libraries,
33frameworks, etc.
34
35Although bytecode represents some abstraction, it mirrors all the mentioned concepts from the
36hardware world: the terms "operations", "operands", "registers" and "stack" have the same meaning.
37In case there is a chance for ambiguity, the terms "virtual registers" and "virtual stack" are used
38to distinguish between an abstract system and the hardware.
39
40Just as real CPUs can expose different instruction set architectures, there is no universal way of
41building bytecode. Following sections explain advantages and disadvantages of various approaches.
42
43## Encoding operands
44
45One very important question is how an operation refers to its operands.
46
47In _stack-based_ approach, operands are implicitly encoded in the operation, which results in
48following code:
49
50```
51.function foo(arg1, arg2)
52    push_arg1 ; copy the first argument to the top of the stack
53    push_arg2 ; copy the second argument to the top of stack
54    add       ; remove two top-most values from the stack, add them and put the result at the top
55    ; at this point, the top of the stack contains arg1 + arg2
56    ...
57.end
58```
59
60In _register-based approach_, operands are explicitly encoded in the operation, which results in
61following code:
62
63```
64.function foo(arg1, arg2)
65    add vreg0, arg1, arg2 ; vreg0 = arg1 + arg21
66    ; at this point, virtual register 0 contains arg1 + arg2
67    ...
68.end
69```
70
71This example demonstrates a fundamental difference between two approaches. Stack-based approach
72operates with smaller instructions. Indeed, each instruction `push_arg1`, `push_arg1`, and `add`
73can be represented with a single byte, while register-based `add reg_dst, reg_src1, reg_src2` may
74require up to 4 bytes to encode.
75
76At the same time, to execute a stack-based addition we need to run 3 instructions compared to
77just a single register-based instruction. Since the interpreter has an extra work to do to read
78each bytecode instruction, execute it and move to the next one, running more instruction results in
79more _dispatch overhead_. Which means that the stack-based bytecode is slower by nature.
80
81According to our experiment, uncompressed register-based bytecode can be reduced by ~26%
82if substituted by a stack-based analogue. At the same time, performance becomes 10%-40% worse
83(depending on the benchmark).
84
85Since bytecode interpretation is a required program execution mode for Panda, performance of the
86interpreter is very important, that's why
87**Panda uses register-based instruction set architecture**.
88
89However, to address the issue of compactness, two main tweaks are used:
90
91* Implicitly addressed accumulator register.
92* Variable size of instructions with frequent instructions are encoded to be smaller.
93
94According to our research, these tweaks will allow to reduce the size of uncompressed bytecode by
95~20% compared to pure register-based bytecode.
96
97### Implicitly addressed accumulator register
98
99Panda bytecode has a dedicated register called _accumulator_, which is addressed implicitly
100by some bytecodes. With this tweak, our example can be rewritten as follows:
101
102```
103.function foo(arg1, arg2)
104    adda arg1, arg2 ; acc = arg1 + arg21
105    ; at this point, accumulator register contains arg1 + arg2
106    ...
107.end
108```
109
110With this approach, we are no longer required to encode destination register, it is "hardcoded" to
111be an accumulator register. Having an implicitly addressed accumulator register de facto borrows
112some "stack-based'ness" into an otherwise register-based instruction set in attempt to make the
113encoding more compact.
114
115In an ideal case, accumulator register may safe us ~25% of size. But it needs to be used carefully:
116
117* Sometimes you might want to write directly into virtual register. e.g. for register moves (that
118  are popular) and for increment/decrement instructions (when loop variable is only read in a loop
119  body forming a separate def-use chain, i.e. in the majority of loops.
120* You don't need to pass object reference in accumulator in the object call. Usually objects live
121  longer than accumulator value (otherwise calls will be accompanied with moves from and to
122  accumulator, reducing performance and increasing encoding size).
123* The same goes with object and array loads and stores.
124
125To address the risk of producing inefficient bytecode with redundant moves from and to
126accumulator, a simple optimizer will be introduced as a part of the toolchain.
127
128Finally, using accumulator allows getting rid of the instructions for writing the result to the register,
129which also saves us encoding space and improves performance
130
131### Variable size of instructions
132
133Let's take a closer look at `adda arg1, arg2`. Assume that arguments map to virtual registers on
134the virtual stack as follows:
135
136```
137+--------------+----------------------+
138| accumulator  |                      |
139| virt. reg. 0 | some local variable  |
140| virt. reg. 1 | some local variable  |
141| virt. reg. 2 | some temporary value |
142| virt. reg. 3 | some temporary value |
143| virt. reg. 4 | arg1                 |
144| virt. reg. 5 | arg2                 |
145+--------------+----------------------+
146```
147
148It easy to see that to address virtual registers 4 and 5 we need just 3 bits which allows to encode
149the instruction as follows:
150
151```
152|<-       8 bits       ->|<- 4 bits ->|<- 4 bits ->|
153|     operation code     |   vreg 1   |   vreg 2   |
154```
155
156This trick gives us just `1 + 0.5 + 0.5 = 2` bytes for a single instructions, which get us closer
157to the stack-based approach. Of course, if virtual registers have large numbers that do no fit
158into 4 bits, we have to use a wider encoding:
159
160```
161|<-       8 bits       ->|<-       8 bits       ->|<-       8 bits       ->|
162|     operation code     |         vreg 1         |         vreg 2         |
163```
164
165How to make sure that we benefit from the shorter encoding most of the time? An observation shows
166that most of operations inside a function happen on local and/or temporary variables, while
167function arguments participate as operands in a fewer number of cases. With that in mind, let's map
168function arguments to virtual registers with larger numbers reserving smaller ones for local
169and/or variables.
170
171Please note also that we don't need "full-range" versions for all instructions. In case some
172instruction lacks a wide-range form, we can prepare operands for it with moves that have all
173needed forms. Thus we save on opcode space without losing in encoding size (on average).
174
175With such approach, we can carefully introduce various "overloads" for instruction when it could
176be beneficial. For example, we have three types of instructions for integer-sized arithmetic
177(acc-reg-reg, acc-reg, acc-imm) and integer-based jumps, but not for floating-point arithmetic
178(which is rare) and which is supposed to have only acc-reg form. Another good candidates for
179overloads are calls (different number of operands) and calls are the most popular instructions in
180applications (thus we again save encoding space).
181
182## Handling various data types
183
184Another important question is how bytecode is supposed to handle various data types. Back to our
185`adda ...` instruction, what are types of its operands?
186
187One option is to make the operation _statically typed_, i.e. specify explicitly that it works only
188with say 64-bit integers. In this case, if we want to add two double-precision floating point
189numbers and store the result into accumulator, we will need a dedicated `adda_d ...`, etc.
190
191Another option is to make the operation _dynamically typed_, i.e. specify that `adda ...` handles
192all kinds of addition (for short and long integers, for signed and unsigned integers, for
193single- and double-precision numbers, etc.).
194
195The first approach bloats the instruction set, but keeps the semantics of each instruction simple
196and compact. The second approach keeps the instruction set small, but bloats the semantics of
197each instruction.
198
199It may seem that the dynamically typed approach is better for dynamically typed languages, but it
200is true only if the platform is **not** supposed to support multiple languages.
201Consider a simple example: what is the result of the expression `4 + "2"` in JavaScript and, say,
202Python? In JavaScript, it evaluates to the string `"42"`, while Python forbids adding a string to
203a number without an explicit type cast. This means that if we would like to run these two languages
204on the same platform with the same bytecode, we would have to handle both JavaScript-style addition
205and Python-style addition within a single instruction, which would eventually lead us to an
206unmaintainable bytecode.
207
208Thus, as we are required to support multiple languages (both statically and dynamically typed),
209**Panda uses statically-typed bytecode**.
210
211There may be a concern: Does a statically typed bytecode forbid us to support a dynamically typed
212language? No, it does not. In practice, it is always possible to compile a dynamically typed
213language to some statically typed instruction set: after all, all native hardware instructions
214sets are "statically-typed" in our terminology.
215
216There may be another concern: Does a statically-typed bytecode imply statically-typed registers?
217I.e. does it mean that if `adda reg1, reg2` operates only on 64-bit integers, registers `reg1`
218and `reg2` **must** hold only integer values throughout the function? Fortunately, the answer is
219no, they must not, virtual registers may hold value of different types (just as hardware registers,
220which do not distinguish between integers and pointers on many platforms). The key constraint is
221that once a value of a certain type is store into a virtual register, all operations on that value
222must be of this very type, unless the virtual register is redefined. Language compilers and
223bytecode verifiers take the responsibility to control this invariant.
224