• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<?xml version="1.0"?>
2<!--
3    * Licensed to the Apache Software Foundation (ASF) under one
4    * or more contributor license agreements.  See the NOTICE file
5    * distributed with this work for additional information
6    * regarding copyright ownership.  The ASF licenses this file
7    * to you under the Apache License, Version 2.0 (the
8    * "License"); you may not use this file except in compliance
9    * with the License.  You may obtain a copy of the License at
10    *
11    *   http://www.apache.org/licenses/LICENSE-2.0
12    *
13    * Unless required by applicable law or agreed to in writing,
14    * software distributed under the License is distributed on an
15    * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16    * KIND, either express or implied.  See the License for the
17    * specific language governing permissions and limitations
18    * under the License.
19-->
20<document>
21  <properties>
22    <title>The Java Virtual Machine</title>
23  </properties>
24
25  <body>
26    <section name="The Java Virtual Machine">
27      <p>
28        Readers already familiar with the Java Virtual Machine and the
29        Java class file format may want to skip this section and proceed
30        with <a href="bcel-api.html">section 3</a>.
31      </p>
32
33      <p>
34        Programs written in the Java language are compiled into a portable
35        binary format called <em>byte code</em>. Every class is
36        represented by a single class file containing class related data
37        and byte code instructions. These files are loaded dynamically
38        into an interpreter (<a
39              href="http://docs.oracle.com/javase/specs/">Java
40        Virtual Machine</a>, aka. JVM) and executed.
41      </p>
42
43      <p>
44        <a href="#Figure 1">Figure 1</a> illustrates the procedure of
45        compiling and executing a Java class: The source file
46        (<tt>HelloWorld.java</tt>) is compiled into a Java class file
47        (<tt>HelloWorld.class</tt>), loaded by the byte code interpreter
48        and executed. In order to implement additional features,
49        researchers may want to transform class files (drawn with bold
50        lines) before they get actually executed. This application area
51        is one of the main issues of this article.
52      </p>
53
54      <p align="center">
55        <a name="Figure 1">
56          <img src="../images/jvm.gif"/>
57          <br/>
58          Figure 1: Compilation and execution of Java classes</a>
59      </p>
60
61      <p>
62        Note that the use of the general term "Java" implies in fact two
63        meanings: on the one hand, Java as a programming language, on the
64        other hand, the Java Virtual Machine, which is not necessarily
65        targeted by the Java language exclusively, but may be used by <a
66              href="http://www.robert-tolksdorf.de/vmlanguages.html">other
67        languages</a> as well. We assume the reader to be familiar with
68        the Java language and to have a general understanding of the
69        Virtual Machine.
70      </p>
71
72    <subsection name="Java class file format">
73      <p>
74        Giving a full overview of the design issues of the Java class file
75        format and the associated byte code instructions is beyond the
76        scope of this paper. We will just give a brief introduction
77        covering the details that are necessary for understanding the rest
78        of this paper. The format of class files and the byte code
79        instruction set are described in more detail in the <a
80              href="http://docs.oracle.com/javase/specs/">Java
81        Virtual Machine Specification</a>. Especially, we will not deal
82        with the security constraints that the Java Virtual Machine has to
83        check at run-time, i.e. the byte code verifier.
84      </p>
85
86      <p>
87        <a href="#Figure 2">Figure 2</a> shows a simplified example of the
88        contents of a Java class file: It starts with a header containing
89        a "magic number" (<tt>0xCAFEBABE</tt>) and the version number,
90        followed by the <em>constant pool</em>, which can be roughly
91        thought of as the text segment of an executable, the <em>access
92        rights</em> of the class encoded by a bit mask, a list of
93        interfaces implemented by the class, lists containing the fields
94        and methods of the class, and finally the <em>class
95        attributes</em>, e.g.,  the <tt>SourceFile</tt> attribute telling
96        the name of the source file. Attributes are a way of putting
97        additional, user-defined information into class file data
98        structures. For example, a custom class loader may evaluate such
99        attribute data in order to perform its transformations. The JVM
100        specification declares that unknown, i.e., user-defined attributes
101        must be ignored by any Virtual Machine implementation.
102      </p>
103
104      <p align="center">
105        <a name="Figure 2">
106          <img src="../images/classfile.gif"/>
107          <br/>
108          Figure 2: Java class file format</a>
109      </p>
110
111      <p>
112        Because all of the information needed to dynamically resolve the
113        symbolic references to classes, fields and methods at run-time is
114        coded with string constants, the constant pool contains in fact
115        the largest portion of an average class file, approximately
116        60%. In fact, this makes the constant pool an easy target for code
117        manipulation issues. The byte code instructions themselves just
118        make up 12%.
119      </p>
120
121      <p>
122        The right upper box shows a "zoomed" excerpt of the constant pool,
123        while the rounded box below depicts some instructions that are
124        contained within a method of the example class. These
125        instructions represent the straightforward translation of the
126        well-known statement:
127      </p>
128
129      <p align="center">
130        <source>System.out.println("Hello, world");</source>
131      </p>
132
133      <p>
134        The first instruction loads the contents of the field <tt>out</tt>
135        of class <tt>java.lang.System</tt> onto the operand stack. This is
136        an instance of the class <tt>java.io.PrintStream</tt>. The
137        <tt>ldc</tt> ("Load constant") pushes a reference to the string
138        "Hello world" on the stack. The next instruction invokes the
139        instance method <tt>println</tt> which takes both values as
140        parameters (instance methods always implicitly take an instance
141        reference as their first argument).
142      </p>
143
144      <p>
145        Instructions, other data structures within the class file and
146        constants themselves may refer to constants in the constant pool.
147        Such references are implemented via fixed indexes encoded directly
148        into the instructions. This is illustrated for some items of the
149        figure emphasized with a surrounding box.
150      </p>
151
152      <p>
153        For example, the <tt>invokevirtual</tt> instruction refers to a
154        <tt>MethodRef</tt> constant that contains information about the
155        name of the called method, the signature (i.e., the encoded
156        argument and return types), and to which class the method belongs.
157        In fact, as emphasized by the boxed value, the <tt>MethodRef</tt>
158        constant itself just refers to other entries holding the real
159        data, e.g., it refers to a <tt>ConstantClass</tt> entry containing
160        a symbolic reference to the class <tt>java.io.PrintStream</tt>.
161        To keep the class file compact, such constants are typically
162        shared by different instructions and other constant pool
163        entries. Similarly, a field is represented by a <tt>Fieldref</tt>
164        constant that includes information about the name, the type and
165        the containing class of the field.
166      </p>
167
168      <p>
169        The constant pool basically holds the following types of
170        constants: References to methods, fields and classes, strings,
171        integers, floats, longs, and doubles.
172      </p>
173
174    </subsection>
175
176    <subsection name="Byte code instruction set">
177      <p>
178        The JVM is a stack-oriented interpreter that creates a local stack
179        frame of fixed size for every method invocation. The size of the
180        local stack has to be computed by the compiler. Values may also be
181        stored intermediately in a frame area containing <em>local
182        variables</em> which can be used like a set of registers. These
183        local variables are numbered from 0 to 65535, i.e., you have a
184        maximum of 65536 of local variables per method. The stack frames
185        of caller and callee method are overlapping, i.e., the caller
186        pushes arguments onto the operand stack and the called method
187        receives them in local variables.
188      </p>
189
190      <p>
191        The byte code instruction set currently consists of 212
192        instructions, 44 opcodes are marked as reserved and may be used
193        for future extensions or intermediate optimizations within the
194        Virtual Machine. The instruction set can be roughly grouped as
195        follows:
196      </p>
197
198      <p>
199        <b>Stack operations:</b> Constants can be pushed onto the stack
200        either by loading them from the constant pool with the
201        <tt>ldc</tt> instruction or with special "short-cut"
202        instructions where the operand is encoded into the instructions,
203        e.g.,  <tt>iconst_0</tt> or <tt>bipush</tt> (push byte value).
204      </p>
205
206      <p>
207        <b>Arithmetic operations:</b> The instruction set of the Java
208        Virtual Machine distinguishes its operand types using different
209        instructions to operate on values of specific type. Arithmetic
210        operations starting with <tt>i</tt>, for example, denote an
211        integer operation. E.g., <tt>iadd</tt> that adds two integers
212        and pushes the result back on the stack. The Java types
213        <tt>boolean</tt>, <tt>byte</tt>, <tt>short</tt>, and
214        <tt>char</tt> are handled as integers by the JVM.
215      </p>
216
217      <p>
218        <b>Control flow:</b> There are branch instructions like
219        <tt>goto</tt>, and <tt>if_icmpeq</tt>, which compares two integers
220        for equality. There is also a <tt>jsr</tt> (jump to sub-routine)
221        and <tt>ret</tt> pair of instructions that is used to implement
222        the <tt>finally</tt> clause of <tt>try-catch</tt> blocks.
223        Exceptions may be thrown with the <tt>athrow</tt> instruction.
224        Branch targets are coded as offsets from the current byte code
225        position, i.e., with an integer number.
226      </p>
227
228      <p>
229        <b>Load and store operations</b> for local variables like
230        <tt>iload</tt> and <tt>istore</tt>. There are also array
231        operations like <tt>iastore</tt> which stores an integer value
232        into an array.
233      </p>
234
235      <p>
236        <b>Field access:</b> The value of an instance field may be
237        retrieved with <tt>getfield</tt> and written with
238        <tt>putfield</tt>. For static fields, there are
239        <tt>getstatic</tt> and <tt>putstatic</tt> counterparts.
240      </p>
241
242      <p>
243        <b>Method invocation:</b> Static Methods may either be called via
244        <tt>invokestatic</tt> or be bound virtually with the
245        <tt>invokevirtual</tt> instruction. Super class methods and
246        private methods are invoked with <tt>invokespecial</tt>. A
247        special case are interface methods which are invoked with
248        <tt>invokeinterface</tt>.
249      </p>
250
251      <p>
252        <b>Object allocation:</b> Class instances are allocated with the
253        <tt>new</tt> instruction, arrays of basic type like
254        <tt>int[]</tt> with <tt>newarray</tt>, arrays of references like
255        <tt>String[][]</tt> with <tt>anewarray</tt> or
256        <tt>multianewarray</tt>.
257      </p>
258
259      <p>
260        <b>Conversion and type checking:</b> For stack operands of basic
261        type there exist casting operations like <tt>f2i</tt> which
262        converts a float value into an integer. The validity of a type
263        cast may be checked with <tt>checkcast</tt> and the
264        <tt>instanceof</tt> operator can be directly mapped to the
265        equally named instruction.
266      </p>
267
268      <p>
269        Most instructions have a fixed length, but there are also some
270        variable-length instructions: In particular, the
271        <tt>lookupswitch</tt> and <tt>tableswitch</tt> instructions, which
272        are used to implement <tt>switch()</tt> statements.  Since the
273        number of <tt>case</tt> clauses may vary, these instructions
274        contain a variable number of statements.
275      </p>
276
277      <p>
278        We will not list all byte code instructions here, since these are
279        explained in detail in the <a
280              href="http://docs.oracle.com/javase/specs/">JVM
281        specification</a>. The opcode names are mostly self-explaining,
282        so understanding the following code examples should be fairly
283        intuitive.
284      </p>
285
286    </subsection>
287
288    <subsection name="Method code">
289      <p>
290        Non-abstract (and non-native) methods contain an attribute
291        "<tt>Code</tt>" that holds the following data: The maximum size of
292        the method's stack frame, the number of local variables and an
293        array of byte code instructions. Optionally, it may also contain
294        information about the names of local variables and source file
295        line numbers that can be used by a debugger.
296      </p>
297
298      <p>
299        Whenever an exception is raised during execution, the JVM performs
300        exception handling by looking into a table of exception
301        handlers. The table marks handlers, i.e., code chunks, to be
302        responsible for exceptions of certain types that are raised within
303        a given area of the byte code. When there is no appropriate
304        handler the exception is propagated back to the caller of the
305        method. The handler information is itself stored in an attribute
306        contained within the <tt>Code</tt> attribute.
307      </p>
308
309    </subsection>
310
311    <subsection name="Byte code offsets">
312      <p>
313        Targets of branch instructions like <tt>goto</tt> are encoded as
314        relative offsets in the array of byte codes. Exception handlers
315        and local variables refer to absolute addresses within the byte
316        code.  The former contains references to the start and the end of
317        the <tt>try</tt> block, and to the instruction handler code. The
318        latter marks the range in which a local variable is valid, i.e.,
319        its scope. This makes it difficult to insert or delete code areas
320        on this level of abstraction, since one has to recompute the
321        offsets every time and update the referring objects. We will see
322        in <a href="bcel-api.html#ClassGen">section 3.3</a> how <font
323              face="helvetica,arial">BCEL</font> remedies this restriction.
324      </p>
325
326    </subsection>
327
328    <subsection name="Type information">
329      <p>
330        Java is a type-safe language and the information about the types
331        of fields, local variables, and methods is stored in so called
332        <em>signatures</em>. These are strings stored in the constant pool
333        and encoded in a special format. For example the argument and
334        return types of the <tt>main</tt> method
335      </p>
336
337      <p align="center">
338        <source>public static void main(String[] argv)</source>
339      </p>
340
341      <p>
342        are represented by the signature
343      </p>
344
345      <p align="center">
346        <source>([java/lang/String;)V</source>
347      </p>
348
349      <p>
350        Classes are internally represented by strings like
351        <tt>"java/lang/String"</tt>, basic types like <tt>float</tt> by an
352        integer number. Within signatures they are represented by single
353        characters, e.g., <tt>I</tt>, for integer. Arrays are denoted with
354        a <tt>[</tt> at the start of the signature.
355      </p>
356
357    </subsection>
358
359    <subsection name="Code example">
360      <p>
361        The following example program prompts for a number and prints the
362        factorial of it. The <tt>readLine()</tt> method reading from the
363        standard input may raise an <tt>IOException</tt> and if a
364        misspelled number is passed to <tt>parseInt()</tt> it throws a
365        <tt>NumberFormatException</tt>. Thus, the critical area of code
366        must be encapsulated in a <tt>try-catch</tt> block.
367      </p>
368
369      <source>
370import java.io.*;
371
372public class Factorial {
373    private static BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
374
375    public static int fac(int n) {
376        return (n == 0) ? 1 : n * fac(n - 1);
377    }
378
379    public static int readInt() {
380        int n = 4711;
381        try {
382            System.out.print("Please enter a number&gt; ");
383            n = Integer.parseInt(in.readLine());
384        } catch (IOException e1) {
385            System.err.println(e1);
386        } catch (NumberFormatException e2) {
387            System.err.println(e2);
388        }
389        return n;
390    }
391
392    public static void main(String[] argv) {
393        int n = readInt();
394        System.out.println("Factorial of " + n + " is " + fac(n));
395    }
396}
397      </source>
398
399      <p>
400        This code example typically compiles to the following chunks of
401        byte code:
402      </p>
403
404      <source>
405        0:  iload_0
406        1:  ifne            #8
407        4:  iconst_1
408        5:  goto            #16
409        8:  iload_0
410        9:  iload_0
411        10: iconst_1
412        11: isub
413        12: invokestatic    Factorial.fac (I)I (12)
414        15: imul
415        16: ireturn
416
417        LocalVariable(start_pc = 0, length = 16, index = 0:int n)
418      </source>
419
420      <p><b>fac():</b>
421        The method <tt>fac</tt> has only one local variable, the argument
422        <tt>n</tt>, stored at index 0. This variable's scope ranges from
423        the start of the byte code sequence to the very end.  If the value
424        of <tt>n</tt> (the value fetched with <tt>iload_0</tt>) is not
425        equal to 0, the <tt>ifne</tt> instruction branches to the byte
426        code at offset 8, otherwise a 1 is pushed onto the operand stack
427        and the control flow branches to the final return.  For ease of
428        reading, the offsets of the branch instructions, which are
429        actually relative, are displayed as absolute addresses in these
430        examples.
431      </p>
432
433      <p>
434        If recursion has to continue, the arguments for the multiplication
435        (<tt>n</tt> and <tt>fac(n - 1)</tt>) are evaluated and the results
436        pushed onto the operand stack.  After the multiplication operation
437        has been performed the function returns the computed value from
438        the top of the stack.
439      </p>
440
441      <source>
442        0:  sipush        4711
443        3:  istore_0
444        4:  getstatic     java.lang.System.out Ljava/io/PrintStream;
445        7:  ldc           "Please enter a number&gt; "
446        9:  invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V
447        12: getstatic     Factorial.in Ljava/io/BufferedReader;
448        15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;
449        18: invokestatic  java.lang.Integer.parseInt (Ljava/lang/String;)I
450        21: istore_0
451        22: goto          #44
452        25: astore_1
453        26: getstatic     java.lang.System.err Ljava/io/PrintStream;
454        29: aload_1
455        30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
456        33: goto          #44
457        36: astore_1
458        37: getstatic     java.lang.System.err Ljava/io/PrintStream;
459        40: aload_1
460        41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
461        44: iload_0
462        45: ireturn
463
464        Exception handler(s) =
465        From    To      Handler Type
466        4       22      25      java.io.IOException(6)
467        4       22      36      NumberFormatException(10)
468      </source>
469
470      <p><b>readInt():</b> First the local variable <tt>n</tt> (at index 0)
471        is initialized to the value 4711.  The next instruction,
472        <tt>getstatic</tt>, loads the references held by the static
473        <tt>System.out</tt> field onto the stack. Then a string is loaded
474        and printed, a number read from the standard input and assigned to
475        <tt>n</tt>.
476      </p>
477
478      <p>
479        If one of the called methods (<tt>readLine()</tt> and
480        <tt>parseInt()</tt>) throws an exception, the Java Virtual Machine
481        calls one of the declared exception handlers, depending on the
482        type of the exception.  The <tt>try</tt>-clause itself does not
483        produce any code, it merely defines the range in which the
484        subsequent handlers are active. In the example, the specified
485        source code area maps to a byte code area ranging from offset 4
486        (inclusive) to 22 (exclusive).  If no exception has occurred
487        ("normal" execution flow) the <tt>goto</tt> instructions branch
488        behind the handler code. There the value of <tt>n</tt> is loaded
489        and returned.
490      </p>
491
492      <p>
493        The handler for <tt>java.io.IOException</tt> starts at
494        offset 25. It simply prints the error and branches back to the
495        normal execution flow, i.e., as if no exception had occurred.
496      </p>
497
498    </subsection>
499    </section>
500  </body>
501
502</document>