• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# High-Level Design
2![High-Level Design](img/engines_high_level_design.png)
3
4The diagram above shows the interactions between the major components of JerryScript: Parser and Virtual Machine (VM). Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to [Bytecode](#byte-code) and [Parser](#parser) page for details). Prepared bytecode is executed by the Virtual Machine that performs interpretation (refer to [Virtual Machine](#virtual-machine) and [ECMA](#ecma) pages for details).
5
6# Parser
7
8The parser is implemented as a recursive descent parser. The parser converts the JavaScript source code directly into byte-code without building an Abstract Syntax Tree. The parser depends on the following subcomponents.
9
10## Lexer
11
12The lexer splits input string (ECMAScript program) into sequence of tokens. It is able to scan the input string not only forward, but it is possible to move to an arbitrary position. The token structure described by structure `lexer_token_t` in `./jerry-core/parser/js/js-lexer.h`.
13
14## Scanner
15
16Scanner (`./jerry-core/parser/js/js-parser-scanner.c`) pre-scans the input string to find certain tokens. For example, scanner determines whether the keyword `for` defines a general for or a for-in loop. Reading tokens in a while loop is not enough because a slash (`/`) can indicate the start of a regular expression or can be a division operator.
17
18## Expression Parser
19
20Expression parser is responsible for parsing JavaScript expressions. It is implemented in `./jerry-core/parser/js/js-parser-expr.c`.
21
22## Statement Parser
23
24JavaScript statements are parsed by this component. It uses the [Expression parser](#expression-parser) to parse the constituent expressions. The implementation of Statement parser is located in `./jerry-core/parser/js/js-parser-statm.c`.
25
26Function `parser_parse_source` carries out the parsing and compiling of the input ECMAScript source code. When a function appears in the source `parser_parse_source` calls `parser_parse_function` which is responsible for processing the source code of functions recursively including argument parsing and context handling. After the parsing, function `parser_post_processing` dumps the created opcodes and returns an `ecma_compiled_code_t*` that points to the compiled bytecode sequence.
27
28The interactions between the major components shown on the following figure.
29
30![Parser dependency](img/parser_dependency.png)
31
32# Byte-code
33
34This section describes the compact byte-code (CBC) representation. The key focus is reducing memory consumption of the byte-code representation without sacrificing considerable performance. Other byte-code representations often focus on performance only so inventing this representation is an original research.
35
36CBC is a CISC like instruction set which assigns shorter instructions for frequent operations. Many instructions represent multiple atomic tasks which reduces the bytecode size. This technique is basically a data compression method.
37
38## Compiled Code Format
39
40The memory layout of the compiled bytecode is the following.
41
42![CBC layout](img/CBC_layout.png)
43
44The header is a `cbc_compiled_code` structure with several fields. These fields contain the key properties of the compiled code.
45
46The literals part is an array of ecma values. These values can contain any ECMAScript value types, e.g. strings, numbers, functions and regexp templates. The number of literals is stored in the `literal_end` field of the header.
47
48CBC instruction list is a sequence of bytecode instructions which represents the compiled code.
49
50## Byte-code Format
51
52The memory layout of a byte-code is the following:
53
54![byte-code layout](img/opcode_layout.png)
55
56Each byte-code starts with an opcode. The opcode is one byte long for frequent and two byte long for rare instructions. The first byte of the rare instructions is always zero (`CBC_EXT_OPCODE`), and the second byte represents the extended opcode. The name of common and rare instructions start with `CBC_` and `CBC_EXT_` prefix respectively.
57
58The maximum number of opcodes is 511, since 255 common (zero value excluded) and 256 rare instructions can be defined. Currently around 215 frequent and 70 rare instructions are available.
59
60There are three types of bytecode arguments in CBC:
61
62 * __byte argument__: A value between 0 and 255, which often represents the argument count of call like opcodes (function call, new, eval, etc.).
63
64 * __literal argument__: An integer index which is greater or equal than zero and less than the `literal_end` field of the header. For further information see next section Literals (next).
65
66 * __relative branch__: An 1-3 byte long offset. The branch argument might also represent the end of an instruction range. For example the branch argument of `CBC_EXT_WITH_CREATE_CONTEXT` shows the end of a `with` statement. More precisely the position after the last instruction in the with clause.
67
68Argument combinations are limited to the following seven forms:
69
70* no arguments
71* a literal argument
72* a byte argument
73* a branch argument
74* a byte and a literal arguments
75* two literal arguments
76* three literal arguments
77
78## Literals
79
80Literals are organized into groups whose represent various literal types. Having these groups consuming less space than assigning flag bits to each literal.
81(In the followings, the mentioned ranges represent those indicies which are greater than or equal to the left side and less than the right side of the range. For example a range between `ident_end` and `literal_end` fields of the byte-code header contains those indicies, which are greater than or equal to `ident_end`
82and less than `literal_end`. If `ident_end` equals to `literal_end` the range is empty.)
83
84The two major group of literals are _identifiers_ and _values_.
85
86  * __identifier__: A named reference to a variable. Literals between zero and `ident_end` of the header belongs to here. All of these literals must be a string or undefined. Undefined can only be used for those literals which cannot be accessed by a literal name. For example `function (arg,arg)` has two arguments, but the `arg` identifier only refers to the second argument. In such cases the name of the first argument is undefined. Furthermore optimizations such as *CSE* may also introduce literals without name.
87
88  * __value__: A reference to an immediate value. Literals between `ident_end` and `const_literal_end` are constant values such as numbers or strings. These literals can be used directly by the Virtual Machine. Literals between `const_literal_end` and `literal_end` are template literals. A new object needs to be constructed each time when their value is accessed. These literals are functions and regular expressions.
89
90There are two other sub-groups of identifiers. *Registers* are those identifiers which are stored in the function call stack. *Arguments* are those registers which are passed by a caller function.
91
92There are two types of literal encoding in CBC. Both are variable length, where the length is one or two byte long.
93
94  * __small__: maximum 511 literals can be encoded.
95
96One byte encoding for literals 0 - 254.
97
98```c
99byte[0] = literal_index
100```
101
102Two byte encoding for literals 255 - 510.
103
104```c
105byte[0] = 0xff
106byte[1] = literal_index - 0xff
107```
108
109  * __full__: maximum 32767 literal can be encoded.
110
111One byte encoding for literals 0 - 127.
112
113```c
114byte[0] = literal_index
115```
116
117Two byte encoding for literals 128 - 32767.
118
119```c
120byte[0] = (literal_index >> 8) | 0x80
121byte[1] = (literal_index & 0xff)
122```
123
124Since most functions require less than 255 literal, small encoding provides a single byte literal index for all literals. Small encoding consumes less space than full encoding, but it has a limited range.
125
126## Literal Store
127
128JerryScript does not have a global string table for literals, but stores them into the Literal Store. During the parsing phase, when a new literal appears with the same identifier that has already occurred before, the string won't be stored once again, but the identifier in the Literal Store will be used. If a new literal is not in the Literal Store yet, it will be inserted.
129
130## Byte-code Categories
131
132Byte-codes can be placed into four main categories.
133
134### Push Byte-codes
135
136Byte-codes of this category serve for placing objects onto the stack. As there are many instructions representing multiple atomic tasks in CBC, there are also many instructions for pushing objects onto the stack according to the number and the type of the arguments. The following table list a few of these opcodes with a brief description.
137
138<span class="CSSTableGenerator" markdown="block">
139
140| byte-code             | description                                           |
141| --------------------- | ----------------------------------------------------- |
142| CBC_PUSH_LITERAL      | Pushes the value of the given literal argument.       |
143| CBC_PUSH_TWO_LITERALS | Pushes the values of the given two literal arguments. |
144| CBC_PUSH_UNDEFINED    | Pushes an undefined value.                            |
145| CBC_PUSH_TRUE         | Pushes a logical true.                                |
146| CBC_PUSH_PROP_LITERAL | Pushes a property whose base object is popped from the stack, and the property name is passed as a literal argument. |
147
148</span>
149
150### Call Byte-codes
151
152The byte-codes of this category perform calls in different ways.
153
154<span class="CSSTableGenerator" markdown="block">
155
156| byte-code             | description                                                                          |
157| --------------------- | ------------------------------------------------------------------------------------ |
158| CBC_CALL0             | Calls a function without arguments. The return value won't be pushed onto the stack. |
159| CBC_CALL1             | Calls a function with one argument. The return value won't be pushed onto the stack. |
160| CBC_CALL              | Calls a function with n arguments. n is passed as a byte argument. The return value won't be pushed onto the stack. |
161| CBC_CALL0_PUSH_RESULT | Calls a function without arguments. The return value will be pushed onto the stack.  |
162| CBC_CALL1_PUSH_RESULT | Calls a function with one argument. The return value will be pushed onto the stack.  |
163| CBC_CALL2_PROP        | Calls a property function with two arguments. The base object, the property name, and the two arguments are on the stack. |
164
165</span>
166
167### Arithmetic, Logical, Bitwise and Assignment Byte-codes
168
169The opcodes of this category perform arithmetic, logical, bitwise and assignment operations.
170
171<span class="CSSTableGenerator" markdown="block">
172
173| byte-code               | description                                                                                         |
174| ----------------------- | --------------------------------------------------------------------------------------------------- |
175| CBC_LOGICAL_NOT         | Negates the logical value that popped from the stack. The result is pushed onto the stack.          |
176| CBC_LOGICAL_NOT_LITERAL | Negates the logical value that given in literal argument. The result is pushed onto the stack.      |
177| CBC_ADD                 | Adds two values that are popped from the stack. The result is pushed onto the stack.                |
178| CBC_ADD_RIGHT_LITERAL   | Adds two values. The left one popped from the stack, the right one is given as literal argument.    |
179| CBC_ADD_TWO_LITERALS    | Adds two values. Both are given as literal arguments.                                               |
180| CBC_ASSIGN              | Assigns a value to a property. It has three arguments: base object, property name, value to assign. |
181| CBC_ASSIGN_PUSH_RESULT  | Assigns a value to a property. It has three arguments: base object, property name, value to assign. The result will be pushed onto the stack. |
182
183</span>
184
185### Branch Byte-codes
186
187Branch byte-codes are used to perform conditional and unconditional jumps in the byte-code. The arguments of these instructions are 1-3 byte long relative offsets. The number of bytes is part of the opcode, so each byte-code with a branch argument has three forms. The direction (forward, backward) is also defined by the opcode since the offset is an unsigned value. Thus, certain branch instructions has six forms. Some examples can be found in the following table.
188
189<span class="CSSTableGenerator" markdown="block">
190
191| byte-code                  | description                                                 |
192| -------------------------- | ----------------------------------------------------------- |
193| CBC_JUMP_FORWARD           | Jumps forward by the 1 byte long relative offset argument.  |
194| CBC_JUMP_FORWARD_2         | Jumps forward by the 2 byte long relative offset argument.  |
195| CBC_JUMP_FORWARD_3         | Jumps forward by the 3 byte long relative offset argument.  |
196| CBC_JUMP_BACKWARD          | Jumps backward by the 1 byte long relative offset argument. |
197| CBC_JUMP_BACKWARD_2        | Jumps backward by the 2 byte long relative offset argument. |
198| CBC_JUMP_BACKWARD_3        | Jumps backward by the 3 byte long relative offset argument. |
199| CBC_BRANCH_IF_TRUE_FORWARD | Jumps forward if the value on the top of the stack is true by the 1 byte long relative offset argument. |
200
201</span>
202
203## Snapshot
204
205The compiled byte-code can be saved into a snapshot, which also can be loaded back for execution. Directly executing the snapshot saves the costs of parsing the source in terms of memory consumption and performance. The snapshot can also be executed from ROM, in which case the overhead of loading it into the memory can also be saved.
206
207
208# Virtual Machine
209
210Virtual machine is an interpreter which executes byte-code instructions one by one. The function that starts the interpretation is `vm_run` in `./jerry-core/vm/vm.c`. `vm_loop` is the main loop of the virtual machine, which has the peculiarity that it is *non-recursive*. This means that in case of function calls it does not calls itself recursively but returns, which has the benefit that it does not burdens the stack as a recursive implementation.
211
212# ECMA
213
214ECMA component of the engine is responsible for the following notions:
215
216* Data representation
217* Runtime representation
218* Garbage collection (GC)
219
220## Data Representation
221
222The major structure for data representation is `ECMA_value`. The lower three bits of this structure encode value tag, which determines the type of the value:
223
224* simple
225* number
226* string
227* object
228* symbol
229* error
230
231![ECMA value representation](img/ecma_value.png)
232
233In case of number, string and object the value contains an encoded pointer, and
234simple value is a pre-defined constant which can be:
235
236* undefined
237* null
238* true
239* false
240* empty (uninitialized value)
241
242### Compressed Pointers
243
244Compressed pointers were introduced to save heap space.
245
246![Compressed Pointer](img/ecma_compressed.png)
247
248These pointers are 8 byte aligned 16 bit long pointers which can address 512 Kb of
249memory which is also the maximum size of the JerryScript heap. To support even more
250memory the size of compressed pointers can be extended to 32 bit to cover the entire
251address space of a 32 bit system by passing "--cpointer_32_bit on" to the build
252system. These "uncompressed pointers" increases the memory consumption by around 20%.
253
254### Number
255
256There are two possible representation of numbers according to standard IEEE 754:
257The default is 8-byte (double),
258but the engine supports the 4-byte (single precision) representation by setting JERRY_NUMBER_TYPE_FLOAT64 to 0 as well.
259
260![Number](img/number.png)
261
262Several references to single allocated number are not supported. Each reference holds its own copy of a number.
263
264### String
265
266Strings in JerryScript are not just character sequences, but can hold numbers and so-called magic ids too. For common character sequences (defined in `./jerry-core/lit/lit-magic-strings.ini`) there is a table in the read only memory that contains magic id and character sequence pairs. If a string is already in this table, the magic id of its string is stored, not the character sequence itself. Using numbers speeds up the property access. These techniques save memory.
267
268### Object / Lexical Environment
269
270An object can be a conventional data object or a lexical environment object. Unlike other data types, object can have references (called properties) to other data types. Because of circular references, reference counting is not always enough to determine dead objects. Hence a chain list is formed from all existing objects, which can be used to find unreferenced objects during garbage collection. The `gc-next` pointer of each object shows the next allocated object in the chain list.
271
272[Lexical environments](http://www.ecma-international.org/ecma-262/5.1/#sec-10.2) are implemented as objects in JerryScript, since lexical environments contains key-value pairs (called bindings) like objects. This simplifies the implementation and reduces code size.
273
274![Object/Lexicat environment structures](img/ecma_object.png)
275
276The objects are represented as following structure:
277
278  * Reference counter - number of hard (non-property) references
279  * Next object pointer for the garbage collector
280  * type (function object, lexical environment, etc.)
281
282### Properties of Objects
283
284![Object properties](img/ecma_object_property.png)
285
286Objects have a linked list that contains their properties. This list actually contains property pairs, in order to save memory described in the followings:
287A property is 7 bit long and its type field is 2 bit long which consumes 9 bit which does not fit into 1 byte but consumes 2 bytes. Hence, placing together two properties (14 bit) with the 2 bit long type field fits into 2 bytes.
288
289#### Property Hashmap
290
291If the number of property pairs reach a limit (currently this limit is defined to 16), a hash map (called [Property Hashmap](#property-hashmap)) is inserted at the first position of the property pair list, in order to find a property using it, instead of finding it by iterating linearly over the property pairs.
292
293Property hashmap contains 2<sup>n</sup> elements, where 2<sup>n</sup> is larger than the number of properties of the object. Each element can have tree types of value:
294
295* null, indicating an empty element
296* deleted, indicating a deleted property, or
297* reference to the existing property
298
299This hashmap is a must-return type cache, meaning that every property that the object have, can be found using it.
300
301#### Internal Properties
302
303Internal properties are special properties that carry meta-information that cannot be accessed by the JavaScript code, but important for the engine itself. Some examples of internal properties are listed below:
304
305* [[Class]] - class (type) of the object (ECMA-defined)
306* [[Code]] - points where to find bytecode of the function
307* native code - points where to find the code of a native function
308* [[PrimitiveValue]] for Boolean - stores the boolean value of a Boolean object
309* [[PrimitiveValue]] for Number - stores the numeric value of a Number object
310
311### LCache
312
313LCache is a hashmap for finding a property specified by an object and by a property name. The object-name-property layout of the LCache presents multiple times in a row as it is shown in the figure below.
314
315![LCache](img/ecma_lcache.png)
316
317When a property access occurs, a hash value is extracted from the demanded property name and than this hash is used to index the LCache. After that, in the indexed row the specified object and property name will be searched.
318
319It is important to note, that if the specified property is not found in the LCache, it does not mean that it does not exist (i.e. LCache is a may-return cache). If the property is not found, it will be searched in the property-list of the object, and if it is found there, the property will be placed into the LCache.
320
321### Collections
322
323Collections are array-like data structures, which are optimized to save memory. Actually, a collection is a linked list whose elements are not single elements, but arrays which can contain multiple elements.
324
325### Exception Handling
326
327In order to implement a sense of exception handling, the return values of JerryScript functions are able to indicate their faulty or "exceptional" operation. The return values are ECMA values (see section [Data Representation](#data-representation)) and if an erroneous operation occurred the ECMA_VALUE_ERROR simple value is returned.
328
329### Value Management and Ownership
330
331Every ECMA value stored by the engine is associated with a virtual "ownership", that defines how to manage the value: when to free it when it is not needed anymore and how to pass the value to an other function.
332
333Initially, value is allocated by its owner (i.e. with ownership). The owner has the responsibility for freeing the allocated value. When the value is passed to a function as an argument, the ownership of it will not pass, the called function have to make an own copy of the value. However, as long as a function returns a value, the ownership will pass, thus the caller will be responsible for freeing it.
334