1<?xml version="1.0"?> 2<!-- 3 * Licensed to the Apache Software Foundation (ASF) under one 4 * or more contributor license agreements. See the NOTICE file 5 * distributed with this work for additional information 6 * regarding copyright ownership. The ASF licenses this file 7 * to you under the Apache License, Version 2.0 (the 8 * "License"); you may not use this file except in compliance 9 * with the License. You may obtain a copy of the License at 10 * 11 * http://www.apache.org/licenses/LICENSE-2.0 12 * 13 * Unless required by applicable law or agreed to in writing, 14 * software distributed under the License is distributed on an 15 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 * KIND, either express or implied. See the License for the 17 * specific language governing permissions and limitations 18 * under the License. 19--> 20<document> 21 <properties> 22 <title>The Java Virtual Machine</title> 23 </properties> 24 25 <body> 26 <section name="The Java Virtual Machine"> 27 <p> 28 Readers already familiar with the Java Virtual Machine and the 29 Java class file format may want to skip this section and proceed 30 with <a href="bcel-api.html">section 3</a>. 31 </p> 32 33 <p> 34 Programs written in the Java language are compiled into a portable 35 binary format called <em>byte code</em>. Every class is 36 represented by a single class file containing class related data 37 and byte code instructions. These files are loaded dynamically 38 into an interpreter (<a 39 href="http://docs.oracle.com/javase/specs/">Java 40 Virtual Machine</a>, aka. JVM) and executed. 41 </p> 42 43 <p> 44 <a href="#Figure 1">Figure 1</a> illustrates the procedure of 45 compiling and executing a Java class: The source file 46 (<tt>HelloWorld.java</tt>) is compiled into a Java class file 47 (<tt>HelloWorld.class</tt>), loaded by the byte code interpreter 48 and executed. In order to implement additional features, 49 researchers may want to transform class files (drawn with bold 50 lines) before they get actually executed. This application area 51 is one of the main issues of this article. 52 </p> 53 54 <p align="center"> 55 <a name="Figure 1"> 56 <img src="../images/jvm.gif"/> 57 <br/> 58 Figure 1: Compilation and execution of Java classes</a> 59 </p> 60 61 <p> 62 Note that the use of the general term "Java" implies in fact two 63 meanings: on the one hand, Java as a programming language, on the 64 other hand, the Java Virtual Machine, which is not necessarily 65 targeted by the Java language exclusively, but may be used by <a 66 href="http://www.robert-tolksdorf.de/vmlanguages.html">other 67 languages</a> as well. We assume the reader to be familiar with 68 the Java language and to have a general understanding of the 69 Virtual Machine. 70 </p> 71 72 <subsection name="Java class file format"> 73 <p> 74 Giving a full overview of the design issues of the Java class file 75 format and the associated byte code instructions is beyond the 76 scope of this paper. We will just give a brief introduction 77 covering the details that are necessary for understanding the rest 78 of this paper. The format of class files and the byte code 79 instruction set are described in more detail in the <a 80 href="http://docs.oracle.com/javase/specs/">Java 81 Virtual Machine Specification</a>. Especially, we will not deal 82 with the security constraints that the Java Virtual Machine has to 83 check at run-time, i.e. the byte code verifier. 84 </p> 85 86 <p> 87 <a href="#Figure 2">Figure 2</a> shows a simplified example of the 88 contents of a Java class file: It starts with a header containing 89 a "magic number" (<tt>0xCAFEBABE</tt>) and the version number, 90 followed by the <em>constant pool</em>, which can be roughly 91 thought of as the text segment of an executable, the <em>access 92 rights</em> of the class encoded by a bit mask, a list of 93 interfaces implemented by the class, lists containing the fields 94 and methods of the class, and finally the <em>class 95 attributes</em>, e.g., the <tt>SourceFile</tt> attribute telling 96 the name of the source file. Attributes are a way of putting 97 additional, user-defined information into class file data 98 structures. For example, a custom class loader may evaluate such 99 attribute data in order to perform its transformations. The JVM 100 specification declares that unknown, i.e., user-defined attributes 101 must be ignored by any Virtual Machine implementation. 102 </p> 103 104 <p align="center"> 105 <a name="Figure 2"> 106 <img src="../images/classfile.gif"/> 107 <br/> 108 Figure 2: Java class file format</a> 109 </p> 110 111 <p> 112 Because all of the information needed to dynamically resolve the 113 symbolic references to classes, fields and methods at run-time is 114 coded with string constants, the constant pool contains in fact 115 the largest portion of an average class file, approximately 116 60%. In fact, this makes the constant pool an easy target for code 117 manipulation issues. The byte code instructions themselves just 118 make up 12%. 119 </p> 120 121 <p> 122 The right upper box shows a "zoomed" excerpt of the constant pool, 123 while the rounded box below depicts some instructions that are 124 contained within a method of the example class. These 125 instructions represent the straightforward translation of the 126 well-known statement: 127 </p> 128 129 <p align="center"> 130 <source>System.out.println("Hello, world");</source> 131 </p> 132 133 <p> 134 The first instruction loads the contents of the field <tt>out</tt> 135 of class <tt>java.lang.System</tt> onto the operand stack. This is 136 an instance of the class <tt>java.io.PrintStream</tt>. The 137 <tt>ldc</tt> ("Load constant") pushes a reference to the string 138 "Hello world" on the stack. The next instruction invokes the 139 instance method <tt>println</tt> which takes both values as 140 parameters (instance methods always implicitly take an instance 141 reference as their first argument). 142 </p> 143 144 <p> 145 Instructions, other data structures within the class file and 146 constants themselves may refer to constants in the constant pool. 147 Such references are implemented via fixed indexes encoded directly 148 into the instructions. This is illustrated for some items of the 149 figure emphasized with a surrounding box. 150 </p> 151 152 <p> 153 For example, the <tt>invokevirtual</tt> instruction refers to a 154 <tt>MethodRef</tt> constant that contains information about the 155 name of the called method, the signature (i.e., the encoded 156 argument and return types), and to which class the method belongs. 157 In fact, as emphasized by the boxed value, the <tt>MethodRef</tt> 158 constant itself just refers to other entries holding the real 159 data, e.g., it refers to a <tt>ConstantClass</tt> entry containing 160 a symbolic reference to the class <tt>java.io.PrintStream</tt>. 161 To keep the class file compact, such constants are typically 162 shared by different instructions and other constant pool 163 entries. Similarly, a field is represented by a <tt>Fieldref</tt> 164 constant that includes information about the name, the type and 165 the containing class of the field. 166 </p> 167 168 <p> 169 The constant pool basically holds the following types of 170 constants: References to methods, fields and classes, strings, 171 integers, floats, longs, and doubles. 172 </p> 173 174 </subsection> 175 176 <subsection name="Byte code instruction set"> 177 <p> 178 The JVM is a stack-oriented interpreter that creates a local stack 179 frame of fixed size for every method invocation. The size of the 180 local stack has to be computed by the compiler. Values may also be 181 stored intermediately in a frame area containing <em>local 182 variables</em> which can be used like a set of registers. These 183 local variables are numbered from 0 to 65535, i.e., you have a 184 maximum of 65536 of local variables per method. The stack frames 185 of caller and callee method are overlapping, i.e., the caller 186 pushes arguments onto the operand stack and the called method 187 receives them in local variables. 188 </p> 189 190 <p> 191 The byte code instruction set currently consists of 212 192 instructions, 44 opcodes are marked as reserved and may be used 193 for future extensions or intermediate optimizations within the 194 Virtual Machine. The instruction set can be roughly grouped as 195 follows: 196 </p> 197 198 <p> 199 <b>Stack operations:</b> Constants can be pushed onto the stack 200 either by loading them from the constant pool with the 201 <tt>ldc</tt> instruction or with special "short-cut" 202 instructions where the operand is encoded into the instructions, 203 e.g., <tt>iconst_0</tt> or <tt>bipush</tt> (push byte value). 204 </p> 205 206 <p> 207 <b>Arithmetic operations:</b> The instruction set of the Java 208 Virtual Machine distinguishes its operand types using different 209 instructions to operate on values of specific type. Arithmetic 210 operations starting with <tt>i</tt>, for example, denote an 211 integer operation. E.g., <tt>iadd</tt> that adds two integers 212 and pushes the result back on the stack. The Java types 213 <tt>boolean</tt>, <tt>byte</tt>, <tt>short</tt>, and 214 <tt>char</tt> are handled as integers by the JVM. 215 </p> 216 217 <p> 218 <b>Control flow:</b> There are branch instructions like 219 <tt>goto</tt>, and <tt>if_icmpeq</tt>, which compares two integers 220 for equality. There is also a <tt>jsr</tt> (jump to sub-routine) 221 and <tt>ret</tt> pair of instructions that is used to implement 222 the <tt>finally</tt> clause of <tt>try-catch</tt> blocks. 223 Exceptions may be thrown with the <tt>athrow</tt> instruction. 224 Branch targets are coded as offsets from the current byte code 225 position, i.e., with an integer number. 226 </p> 227 228 <p> 229 <b>Load and store operations</b> for local variables like 230 <tt>iload</tt> and <tt>istore</tt>. There are also array 231 operations like <tt>iastore</tt> which stores an integer value 232 into an array. 233 </p> 234 235 <p> 236 <b>Field access:</b> The value of an instance field may be 237 retrieved with <tt>getfield</tt> and written with 238 <tt>putfield</tt>. For static fields, there are 239 <tt>getstatic</tt> and <tt>putstatic</tt> counterparts. 240 </p> 241 242 <p> 243 <b>Method invocation:</b> Static Methods may either be called via 244 <tt>invokestatic</tt> or be bound virtually with the 245 <tt>invokevirtual</tt> instruction. Super class methods and 246 private methods are invoked with <tt>invokespecial</tt>. A 247 special case are interface methods which are invoked with 248 <tt>invokeinterface</tt>. 249 </p> 250 251 <p> 252 <b>Object allocation:</b> Class instances are allocated with the 253 <tt>new</tt> instruction, arrays of basic type like 254 <tt>int[]</tt> with <tt>newarray</tt>, arrays of references like 255 <tt>String[][]</tt> with <tt>anewarray</tt> or 256 <tt>multianewarray</tt>. 257 </p> 258 259 <p> 260 <b>Conversion and type checking:</b> For stack operands of basic 261 type there exist casting operations like <tt>f2i</tt> which 262 converts a float value into an integer. The validity of a type 263 cast may be checked with <tt>checkcast</tt> and the 264 <tt>instanceof</tt> operator can be directly mapped to the 265 equally named instruction. 266 </p> 267 268 <p> 269 Most instructions have a fixed length, but there are also some 270 variable-length instructions: In particular, the 271 <tt>lookupswitch</tt> and <tt>tableswitch</tt> instructions, which 272 are used to implement <tt>switch()</tt> statements. Since the 273 number of <tt>case</tt> clauses may vary, these instructions 274 contain a variable number of statements. 275 </p> 276 277 <p> 278 We will not list all byte code instructions here, since these are 279 explained in detail in the <a 280 href="http://docs.oracle.com/javase/specs/">JVM 281 specification</a>. The opcode names are mostly self-explaining, 282 so understanding the following code examples should be fairly 283 intuitive. 284 </p> 285 286 </subsection> 287 288 <subsection name="Method code"> 289 <p> 290 Non-abstract (and non-native) methods contain an attribute 291 "<tt>Code</tt>" that holds the following data: The maximum size of 292 the method's stack frame, the number of local variables and an 293 array of byte code instructions. Optionally, it may also contain 294 information about the names of local variables and source file 295 line numbers that can be used by a debugger. 296 </p> 297 298 <p> 299 Whenever an exception is raised during execution, the JVM performs 300 exception handling by looking into a table of exception 301 handlers. The table marks handlers, i.e., code chunks, to be 302 responsible for exceptions of certain types that are raised within 303 a given area of the byte code. When there is no appropriate 304 handler the exception is propagated back to the caller of the 305 method. The handler information is itself stored in an attribute 306 contained within the <tt>Code</tt> attribute. 307 </p> 308 309 </subsection> 310 311 <subsection name="Byte code offsets"> 312 <p> 313 Targets of branch instructions like <tt>goto</tt> are encoded as 314 relative offsets in the array of byte codes. Exception handlers 315 and local variables refer to absolute addresses within the byte 316 code. The former contains references to the start and the end of 317 the <tt>try</tt> block, and to the instruction handler code. The 318 latter marks the range in which a local variable is valid, i.e., 319 its scope. This makes it difficult to insert or delete code areas 320 on this level of abstraction, since one has to recompute the 321 offsets every time and update the referring objects. We will see 322 in <a href="bcel-api.html#ClassGen">section 3.3</a> how <font 323 face="helvetica,arial">BCEL</font> remedies this restriction. 324 </p> 325 326 </subsection> 327 328 <subsection name="Type information"> 329 <p> 330 Java is a type-safe language and the information about the types 331 of fields, local variables, and methods is stored in so called 332 <em>signatures</em>. These are strings stored in the constant pool 333 and encoded in a special format. For example the argument and 334 return types of the <tt>main</tt> method 335 </p> 336 337 <p align="center"> 338 <source>public static void main(String[] argv)</source> 339 </p> 340 341 <p> 342 are represented by the signature 343 </p> 344 345 <p align="center"> 346 <source>([java/lang/String;)V</source> 347 </p> 348 349 <p> 350 Classes are internally represented by strings like 351 <tt>"java/lang/String"</tt>, basic types like <tt>float</tt> by an 352 integer number. Within signatures they are represented by single 353 characters, e.g., <tt>I</tt>, for integer. Arrays are denoted with 354 a <tt>[</tt> at the start of the signature. 355 </p> 356 357 </subsection> 358 359 <subsection name="Code example"> 360 <p> 361 The following example program prompts for a number and prints the 362 factorial of it. The <tt>readLine()</tt> method reading from the 363 standard input may raise an <tt>IOException</tt> and if a 364 misspelled number is passed to <tt>parseInt()</tt> it throws a 365 <tt>NumberFormatException</tt>. Thus, the critical area of code 366 must be encapsulated in a <tt>try-catch</tt> block. 367 </p> 368 369 <source> 370import java.io.*; 371 372public class Factorial { 373 private static BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); 374 375 public static int fac(int n) { 376 return (n == 0) ? 1 : n * fac(n - 1); 377 } 378 379 public static int readInt() { 380 int n = 4711; 381 try { 382 System.out.print("Please enter a number> "); 383 n = Integer.parseInt(in.readLine()); 384 } catch (IOException e1) { 385 System.err.println(e1); 386 } catch (NumberFormatException e2) { 387 System.err.println(e2); 388 } 389 return n; 390 } 391 392 public static void main(String[] argv) { 393 int n = readInt(); 394 System.out.println("Factorial of " + n + " is " + fac(n)); 395 } 396} 397 </source> 398 399 <p> 400 This code example typically compiles to the following chunks of 401 byte code: 402 </p> 403 404 <source> 405 0: iload_0 406 1: ifne #8 407 4: iconst_1 408 5: goto #16 409 8: iload_0 410 9: iload_0 411 10: iconst_1 412 11: isub 413 12: invokestatic Factorial.fac (I)I (12) 414 15: imul 415 16: ireturn 416 417 LocalVariable(start_pc = 0, length = 16, index = 0:int n) 418 </source> 419 420 <p><b>fac():</b> 421 The method <tt>fac</tt> has only one local variable, the argument 422 <tt>n</tt>, stored at index 0. This variable's scope ranges from 423 the start of the byte code sequence to the very end. If the value 424 of <tt>n</tt> (the value fetched with <tt>iload_0</tt>) is not 425 equal to 0, the <tt>ifne</tt> instruction branches to the byte 426 code at offset 8, otherwise a 1 is pushed onto the operand stack 427 and the control flow branches to the final return. For ease of 428 reading, the offsets of the branch instructions, which are 429 actually relative, are displayed as absolute addresses in these 430 examples. 431 </p> 432 433 <p> 434 If recursion has to continue, the arguments for the multiplication 435 (<tt>n</tt> and <tt>fac(n - 1)</tt>) are evaluated and the results 436 pushed onto the operand stack. After the multiplication operation 437 has been performed the function returns the computed value from 438 the top of the stack. 439 </p> 440 441 <source> 442 0: sipush 4711 443 3: istore_0 444 4: getstatic java.lang.System.out Ljava/io/PrintStream; 445 7: ldc "Please enter a number> " 446 9: invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V 447 12: getstatic Factorial.in Ljava/io/BufferedReader; 448 15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String; 449 18: invokestatic java.lang.Integer.parseInt (Ljava/lang/String;)I 450 21: istore_0 451 22: goto #44 452 25: astore_1 453 26: getstatic java.lang.System.err Ljava/io/PrintStream; 454 29: aload_1 455 30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V 456 33: goto #44 457 36: astore_1 458 37: getstatic java.lang.System.err Ljava/io/PrintStream; 459 40: aload_1 460 41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V 461 44: iload_0 462 45: ireturn 463 464 Exception handler(s) = 465 From To Handler Type 466 4 22 25 java.io.IOException(6) 467 4 22 36 NumberFormatException(10) 468 </source> 469 470 <p><b>readInt():</b> First the local variable <tt>n</tt> (at index 0) 471 is initialized to the value 4711. The next instruction, 472 <tt>getstatic</tt>, loads the references held by the static 473 <tt>System.out</tt> field onto the stack. Then a string is loaded 474 and printed, a number read from the standard input and assigned to 475 <tt>n</tt>. 476 </p> 477 478 <p> 479 If one of the called methods (<tt>readLine()</tt> and 480 <tt>parseInt()</tt>) throws an exception, the Java Virtual Machine 481 calls one of the declared exception handlers, depending on the 482 type of the exception. The <tt>try</tt>-clause itself does not 483 produce any code, it merely defines the range in which the 484 subsequent handlers are active. In the example, the specified 485 source code area maps to a byte code area ranging from offset 4 486 (inclusive) to 22 (exclusive). If no exception has occurred 487 ("normal" execution flow) the <tt>goto</tt> instructions branch 488 behind the handler code. There the value of <tt>n</tt> is loaded 489 and returned. 490 </p> 491 492 <p> 493 The handler for <tt>java.io.IOException</tt> starts at 494 offset 25. It simply prints the error and branches back to the 495 normal execution flow, i.e., as if no exception had occurred. 496 </p> 497 498 </subsection> 499 </section> 500 </body> 501 502</document>