ASM

ASM 6 Developer Guide

by Eric Bruneton (last update 04/05/2018)

  1. Introduction
  2. Installation
    1. Overview
    2. Building
    3. Running tests
    4. Creating a distribution
  3. Code review
    1. Code organization
    2. Main data structures
      1. Object model
      2. Collections
    3. Main algorithms
      1. ClassReader
      2. ClassWriter, AnnotationWriter, FieldWriter, MethodWriter and ModuleWriter
      3. Label
      4. toByteArray
    4. Jump instruction encoding algorithm
      1. Basic algorithm
      2. Impact on stack map frames
      3. An example
    5. Control and data flow analysis algorithm
      1. Basic data flow analysis algorithm
      2. Uninitialized types
      3. Exception handlers
      4. Dead code
      5. Subroutines
  4. Code optimizations
    1. Optimizing performance
    2. Optimizing code size
    3. An example

1 Introduction

This guide is primarily intended for ASM users that would like to contribute to the ASM code base, although it can be of interest for other users as well. It explains the organization of the code, the main data structures and the most complex algorithms. It also explains the strategies that have been used to optimize ASM performances as well as to minimize its code size, and illustrates this through a concrete example.

2 Installation

This section explains how to compile the ASM source code, how to test it, and how to build a distribution, which is the first thing to know before contributing to this project. It assumes you have already downloaded the ASM source code, which can be done with git clone https://gitlab.ow2.org/asm/asm.git, or by downloading a snapshot of the git repository from https://gitlab.ow2.org/asm/asm.

2.1 Overview

The ASM sources are organized in several directories, with one directory per distribution artifact (asm.jar, asm-util.jar, etc), plus a few other directories for building, testing and benchmarking them:

The build.gradle file defines the targets that are used to build, test and package the ASM source code. These targets are presented in the following sections.

2.2 Building

To build everything, use gradle/gradlew build from the main directory. To build a single project, for instance asm-tree, use gradle/gradlew :asm-tree:build. This compiles the code, run the unit tests, and perform other checks to make sure that the code is properly formatted, that it is binary backward compatible with previous ASM versions, and that it is fully covered by unit tests.

Note: if the source code formatting verification fails, you can fix this by reformatting it with gradle/gradlew googleJavaFormat.

2.3 Running tests

To test everything, use gradle/gradlew test from the main directory. To test a single project, for instance asm-util, use gradle/gradlew :asm-util:test.

Test coverage reports can be generated with gradle/gradlew jacocoTestReport. The result is available in the build/reports/jacoco directory of each project.

2.4 Creating a distribution

A binary distribution, containing the compiled artifacts with their Javadoc and sources jar, a POM and digital signatures, can be generated with gradle/gradlew uploadArchives. The result is generated in the /tmp/myRepo directory.

Note: the version number used in the files produced when creating a distribution is defined in the header part of the build.gradle file.

3 Code review

This section presents the ASM source code organization, the main data structures and the most complex algorithms, which is the next important thing to know, after the installation and building procedures, in order to contribute to this project.

3.1 Code organization

ASM is organized in several packages:

Relationships between the ASM packages

From an implementation point of view the core package is the most complex one. The tree, util and xml packages are very simple (they just convert classes from one high level representation to another one, which is much simpler than converting classes from their byte array form to a high level representation one or vice versa). The signature package is also quite simple (it consists in a parser and a pretty printer for a small grammar). In fact the only packages that are not completely trivial, except the core package, are the commons and analysis packages. But the algorithms used in the analysis package are similar to the one explained in section 3.5. This explains why the rest of this guide is focused on the core package only.

3.2 Main data structures

3.2.1 Object model

The core package is made of 28 classes and interfaces. If we exclude the Opcodes interface, the 5 abstract visitor classes ( AnnotationVisitor, ClassVisitor, FieldVisitor, MethodVisitor and ModuleVisitor) and the 6 utility classes (ConstantDynamic, Constants, Handle, Type, TypePath and TypeReference), this leaves only 16 classes, which are depicted in the figure below.

UML Class Diagram of the main classes of the ASM package

The conversion of compiled classes to visit events is done by only one class, namely the ClassReader class, which uses the auxiliary class Context. The inverse process is done by the other 14 classes, which are organized around the ClassWriter class:

3.2.2 Collections

The core package does not use any java.util class. Instead, lists, sets and graphs are encoded in dedicated fields of their elements:

3.3 Main algorithms

3.3.1 ClassReader

The ClassReader algorithm is quite straightforward. It is summarized below.

Some points are interesting to note:

3.3.2 ClassWriter, AnnotationWriter, FieldWriter, MethodWriter and ModuleWriter

Since the visit of the class members can be interleaved (it is possible to start visiting a field, then start visiting a method, go back to visit annotations of the field, continue with some instructions of the method, visit attributes of the field, add new instructions to the method, and so on), it is not possible to construct the class file's byte array in a sequential order, from beginning to end. Instead it is necessary to use several byte vectors that can grow simultaneously. This is why there are several writer classes, unlike for the reader case.

The ClassWriter class is the main entry point. It contains the class header elements and the lists of its fields and methods, as well as a SymbolTable instance, containing the constant pool items and the bootstrap methods of the class. This symbol table uses a hash set of Symbol objects, in order to avoid adding the same item several times in the constant pool or in the bootstrap methods array. The symbol table can be created from an existing class by passing a ClassReader argument to its constructor. This allows unchanged methods to be copied as is from a class reader to a class writer, without visiting their content (see section 3.3.1).

The AnnotationWriter, FieldWriter and ModuleWriter classes are quite simple: they convert visit events to a byte array representation, by using the class writer's SymbolTable in order to add constant pool items when necessary. The most complex writer class is the MethodWriter class, because it manages advanced features such as the automatic computation of the maximum stack size and number of local variables of a method, the automatic computation of its stack map frames, as well as the automatic management of short versus long jump instructions. Without these features, each visitXxxInsn method would be very simple, i.e. it would just add the bytecode representation of an instruction to the code byte vector.

Instead, in order to be able to automatically compute the maximum stack size, the maximum number of local variables and the stack map frames, each visitXxxInsn method does the following:

The maximum stack size or the stack map frames are actually computed in the computeMaxStackAndLocal and computeAllFrames methods, by using the control flow graph constructed in the visitXxxInsn methods (see section 3.5 for more details).

3.3.3 Label

From a user point of view the Label class is used to reference instructions. Internally it is used to store the bytecode offset of an instruction (i.e. the bytecode array index of the first byte of the instruction), and to compute relative bytecode offsets (i.e. the difference between the bytecode offsets of two instructions). It is also used to represent basic blocks, which are used for the automatic computation of the maximum stack size and of the stack map frames of a method (see section 3.5).

Jump instructions such as IFEQ or GOTO are stored in bytecode as an opcode followed by a relative bytecode offset to the target instruction (this relative offset is the difference between the bytecode offset of the target instruction and the bytecode offset of the jump instruction). This relative offset can be computed easily in the case of a backward jump (i.e. a jump to an instruction that is before the jump instruction in the bytecode), but it cannot be computed at all in the case of a forward jump (i.e. a jump to an instruction that is after the jump instruction) since, in this case, the target instruction has not been visited yet, and so its bytecode offset is unknown. The case of forward jumps is solved in the following way:

3.3.4 toByteArray

The toByteArray method in ClassWriter puts together all the pieces constructed in the various writer classes in order to get the full byte representation of the class. This is done in two steps:

3.4 Jump instructions encoding algorithm

The jump instructions such as IFEQ and GOTO store relative bytecode offsets with signed values encoded in two bytes. This offset can therefore vary between -32768 and 32767. However the bytecode of a method can be as large as 65535 bytes: it is therefore possible to have relative bytecode offsets that cannot be represented with signed, two bytes values. Hopefully there are two special jump instructions that store their relative bytecode offset in signed, four bytes values, namely GOTO_W and JSR_W.

In the case of backward jumps the jump offset is known when the jump instruction is visited. It is then easy to use GOTO or GOTO_W, depending on the value of this relative offset. But in the case of forward jumps this is not possible. A first solution is to assume that this offset will require 4 bytes, and to always use a GOTO_W for forward jumps. But this is not very optimal. A second solution is to assume that this offset will require 2 bytes, and to use a normal jump instruction, with 2 bytes reserved to store the offset when it becomes known. The problem is that if the offset turns out to require 4 bytes, the jump instruction must be replaced with another one using a 4 bytes offset, which can invalidate already computed offsets used in other jump instructions. In ASM the second solution was chosen. It requires a method to replace the forward jump instructions that turn out to require long offsets. The algorithm used for this is presented in the rest of this section.

3.4.1 Basic algorithm

The algorithm uses the following properties:

Given these properties, if it turns out, in visitLabel, that a forward offset is greater than 32737, the opcode of the corresponding instruction is changed to the equivalent non standard opcode with unsigned offset, and the offset is stored as an unsigned short.

Then, when all the class has been visited and converted to a byte array, if some non standard opcodes have been used, they must be replaced with standard ones. For this:

This process removes the existing ASM specific instructions, but can introduce new ones. Indeed, because the ASM specific instructions are replaced with longer sequences of standard instructions, some relative offsets between existing instructions, which were just below the 32737 limit, can become larger than this limit. In this case the process is repeated: the byte array containing the new ASM specific instructions is parsed again with a new ClassReader (after the ClassWriter content has been cleared) chained to the ClassWriter. Eventually this process will converge and the result will be a class with only standard instruction opcodes.

3.4.2 Impact on stack map frames

An IFXxx instruction with a forward offset larger than 32767 must be replaced with an IFNotX GOTO_W sequence, which requires a stack map frame after the GOTO_W (for Java 7 or more classes). This stack map frame may not be already present, in which case it needs to be computed and inserted. For this, one solution would be to recompute all the stack map frames, but this is not always possible (computing the stack map frames may require access to the class hierarchy, may load classes, etc) and would not be efficient. Instead, we use the existing stack map frames to compute the one we need to insert. Indeed, by definition, the existing stack map frames are sufficient to compute the state of the local variables and of the stack at each instruction, with a linear algorithm parsing the instructions and the frames from the beginning to the end.

This is implemented as follows:

3.4.3 An example

Consider the following Java code:

public void m(int i, int j) {
    for (; cond(i); --i) {
        if (j == 0) {
            break;
        }
        ...
    }
}

It is compiled into:

public m(II)V
  GOTO L1
 L2
  ILOAD 2
  IFNE L3
  GOTO L4
 L3
  ...
  IINC 1 -1
 L1
  ALOAD 0
  ILOAD 1
  INVOKEVIRTUAL C.cond(I)Z
  IFNE L2
 L4
  RETURN

During the visit of each instruction, offsets are computed step by step as follows:

Offset Instruction Comment
0  GOTO L1 relative offset unknown when this instruction is visited
3 L2
3  ILOAD 2
4  IFNE L3 relative offset unknown when this instruction is visited
7  GOTO L4 relative offset unknown when this instruction is visited
10 L3 relative offset for IFNE L3 becomes known: 10 - 4 = 6
10  ...
32764  IINC 1 -1
32767 L1 relative offset for GOTO L1 becomes known: 32767 - 0 = 32767
32767  ALOAD 0
32768  ILOAD 1
32769  INVOKEVIRTUAL
32772  IFNE L2 relative offset = 3 - 32772 = -32769
L4
 RETURN

The offset computed for IFNE L2 is -32769, which cannot be stored in a short. The instruction is therefore transformed into IFEQ GOTO_W L2 on the fly:

0  GOTO L1 relative offset = 32767, changed during visit of L1
3 L2
3  ILOAD 2
4  IFNE L3 relative offset = 6, changed during visit of L3
7  GOTO L4 relative offset still unknown
10 L3
10  ...
32764  IINC 1 -1
32767 L1
32767  ALOAD 0
32768  ILOAD 1
32769  INVOKEVIRTUAL
32772  IFEQ L4 relative offset = 8
32775  GOTO_W L2 relative offset = 3 - 32775 = -32772
32780 L4 relative offset for GOTO L4 becomes known: 32780 - 7 = 32773
 RETURN

When L4 is reached the offset computed for GOTO L4 becomes knwown: we get 32773, which cannot be stored as a signed short. As explained in section 3.4.1, the GOTO instruction is then replaced with a non standard ASM_GOTO instruction, whose offset is stored as an unsigned short:

0  GOTO L1 relative offset = 32767, changed during visit of L1
3 L2
3  ILOAD 2
4  IFNE L3 relative offset = 6, changed during visit of L3
7  ASM_GOTO L4 relative offset = 32773, changed during visit of L4
10 L3
10  ...
32764  IINC 1 -1
32767 L1
32767  ALOAD 0
32768  ILOAD 1
32769  INVOKEVIRTUAL
32772  IFEQ L4 relative offset = 8
32775  GOTO_W L2 relative offset = 3 - 32775 = -32772
32780 L4
32780  RETURN

Since the bytecode contains at least one non standard instruction, the ClassWriter content is cleared, just after the byte array representation of the class has been computed (in toByteArray), and reconstructed again by parsing the byte array with a ClassReader chained to this ClassWriter. During this process the ClassReader converts the ASM_GOTO into a GOTO_W (and preserves the existing GOTO_W). When L1 is reached, we get:

0  GOTO L1 relative offset unknown
3 L2
3  ILOAD 2
4  IFNE L3 relative offset = 8, changed during visit of L3
7  GOTO_W L4 relative offset unknown
12 L3
12  ...
32766  IINC 1 -1
32769 L1
 ALOAD 0
 ILOAD 1
 INVOKEVIRTUAL
 IFEQ L4 relative offset = 8
 GOTO_W L2
L4
 RETURN

At this point the relative offset for GOTO L1 becomes known, namely 32769. This is too large for a signed short offset, so GOTO L1 is replaced with ASM_GOTO L1 with an unsigned short offset. The class parsing and reconstruction then continues, and eventually a new byte array for the class is produced. Since it still contains a non standard opcode, the same process is repeated: the ClassWriter content is cleared and reconstructed by parsing the byte array with a ClassReader. Thanks to the special EXPAND_ASM_INSNS flag, GOTO_W and JSR_W are never converted back to GOTO or JSR, which ensures that this iterative process will eventually converge. In our example, the second iteration does not produce any new non standard opcode, and we get the final result:

0  GOTO_W L1 relative offset = 32771
5 L2
5  ILOAD 2
6  IFNE L3 relative offset = 8
9  GOTO_W L4 relative offset = 32775
14 L3
14  ...
32768  IINC 1 -1
32771 L1
32771  ALOAD 0
32772  ILOAD 1
32773  INVOKEVIRTUAL
32776  IFEQ L4 reative offset = 8
32779  GOTO_W L2 relative offset = -32774
32784 L4
32784  RETURN

3.5 Control and data flow analysis algorithm

This section presents the algorithm used to compute the maximum stack size and the stack map frames of a method. This algorithm is a control and data flow analysis algorithm, based on the decomposition of the method into a control flow graph of basic blocks. A basic block is a sequence of instructions where only the first instruction is the target of a jump instruction, and where only the last instruction can jump to other basic blocks. The control flow graph of a method is the graph whose nodes are the basic blocks, and whose edges connect the basic blocks that are linked by jump instructions. This graph is constructed during the visit of each instruction. As an example, the control flow graph of the method defined in section 3.5.1 is the one shown in section 2.

In the following we explain the algorithm used to compute the stack map frames, implemented in the computeAllFrames method. The algorithm to compute the maximum stack size, implemented in the computeMaxStackAndLocal method, is similar but simpler (since only the size of the stack is needed), so it is not detailed here (except for the handling of subroutines).

3.5.1 Basic data flow analysis algorithm

Stack map frames are computed in a two steps process:

Let's take a simple example in order to explain the details of this algorithm. Consider the following very simple method (where B is a subclass of A):

public static m(Z)LA;
  GETSTATIC B.VALUE : LB;
  ASTORE 1
  GOTO L0
 L1
  GETSTATIC A.VALUE : LA;
  ASTORE 1
 L0
  ILOAD 0
  IFNE L1
  ALOAD 1
  ARETURN
First step

As stated above, during the first step of the algorithm, which takes place in each visitXxxInsn method in ClassWriter, the state of the output frame of each basic block is updated by simulating the action of the visited instruction. It is important to note that the algorithm works at the basic block level, and not at the instruction level. This means that input and output frames are associated to basic blocks, and not to individual instructions. In practice, they are stored in a Frame object that is associated to a Label object, which marks the beginning of basic blocks.

The effect of this first step for the above example method is illustrated on the table below:

Label Instruction Output frame Comment
GETSTATIC B.VALUE : LB; O1 = [??] [?B] getstatic pushes a value of type B on the stack
ASTORE 1 O1 = [?B] [?] astore consumes this value and stores it in local 1
GOTO L0 O1 = [?B] [?] goto does not change the frame
 
L1 GETSTATIC A.VALUE : LA; O2 = [??] [?A] each basic block starts with a new, unknown frame
ASTORE 1 O2 = [?A] [?] astore stores the value produced by getstatic in local 1
 
L0 ILOAD 0 O3 = [??] [?I] iload pushes the value of local 0, which is of type int
IFNE L1 O3 = [??] [?] ifne consumes this value
 
ALOAD 1 O4 = [??] [?L1] aload pushes the value of local 1, which is unknown
ARETURN O4 = [??] [?] areturn consumes this value

At the beginning, the output frame O1 of the first basic block is completely unknown. During the visit of the first instruction, the action of GETSTATIC is simulated: the result is that a new value of type B is pushed on the stack, on top of the previous values (although we know here that the stack is initially empty, we do not take this into account and do as if the stack could previously contain any number of values of any type - hence the [?B]). During the visit of the second instruction, the output frame O1 is updated to simulate the effect of ASTORE: the result is that the value previously pushed on the stack is popped and stored in the local variable 1. The visit of the third instruction does not change the output frame O1, but changes the current basic block to null.

The visit of the L1 label makes L1 become the new current basic block. Like for the first basic block, the output frame O2 of this basic block is initially completely unknown. The visit of the instructions of this basic block is then similar to the visit of the previous instructions.

The visit of the L0 label makes L0 become the new current basic block. Here again we start with a completely unknown output frame O3 although, in this case, we could start from the value of O2 (since this basic block is a successor of the previous one). The ILOAD instruction loads the value of local variable 0, which is necessarily of type int (the whole algorithm is based on the assumption that the code is correct), and pushes it on the stack. The IFNE instruction consumes this value.

Another effect of simulating the action of the IFNE instruction is to create a new basic block, and to make it the new current basic block. This is why, although there is no label just after this instruction, the basic block changes. Here again, the output frame O4 of this basic block is initially completely unknown (although, as before and for the same reason, we could start from the value of O3). The ALOAD instruction loads the value of local variable 1, whose type is unknown since the frame is initially completely unknown. The only thing we know is that, after the execution of this method, the stack contains one additional value whose type is the type of local variable 1 before this instruction was executed (hence the [?L1]).

Second step

During the second step of the algorithm, which takes place in the computeAllFrames method, the input frames of each basic block are computed by using an iterative fix point algorithm (i.e. an algorithm to find a fix point x of a function f, defined by f(x)=x. If x values define a complete lattice and if f is monotonic, xn+1=f(xn) converges to the smallest fix point of f, according to the Tarski theorem. Here x is a control flow graph with input and output frames, f is a "merge" function and the order relation is based on the subtyping relation for the verification type system). First the input frame I1 of the first basic block is computed from the method descriptor "public static m(Z)LA;", which gives I1 = [I] []. Then the first basic block is marked as "changed", and the fix point algorithm starts:

Iteration Changed Output frames Input frames Comment
0 B1 O1 = [?B] [?]
O2 = [?A] [?]
O3 = [??] [?]
O4 = [??] [?]
I1= [I-] []
I2 = [??] [?]
I3 = [??] [?]
I4 = [??] [?]
Initialization of input frame I1 from the method's descriptor,
B1 marked as "changed"
1 B3 I1= [I-] []
I2 = [??] [?]
I3 = [IB] []
I4 = [??] [?]
B1 marked as "unchanged",
merge of I1 and O1 into I3 (B3 is a successor of B1),
B3 marked as "changed"
2 B2, B4 I1= [I-] []
I2 = [IB][]
I3 = [IB] []
I4 = [IB] []
B3 marked as "unchanged",
merge of I3 and O3 into I2 (B2 is a successor of B3),
B2 marked as changed,
merge of I3 and O3 into I4 (B4 is as successor of B3),
B4 marked as changed
3 B4, B3 I1= [I-] []
I2 = [IB] []
I3 = [IA] []
I4 = [IB] []
B2 marked as "unchanged",
merge of I2 and O2 into I3 (B3 is a successor of B2),
B3 marked as changed.
4 B3 I1= [I-] []
I2 = [IB] []
I3 = [IA] []
I4 = [IB] []
B4 marked as "unchanged"
5 B2, B4 I1= [I-] []
I2 = [IA] []
I3 = [IA] []
I4 = [IA] []
B3 marked as "unchanged",
merge of I3 and O3 into I2 (B2 is a successor of B3),
B2 marked as changed,
merge of I3 and O3 into I4 (B4 is as successor of B3),
B4 marked as changed
6 B4 I1= [I-] []
I2 = [IA] []
I3 = [IA] []
I4 = [IA] []
B2 marked as "unchanged",
merge of I2 and O2 into I3 (B3 is a successor of B2),
B3 not marked as changed.
7 I1= [I-] []
I2 = [IA] []
I3 = [IA] []
I4 = [IA] []
B4 marked as "unchanged"

3.5.2 Uninitialized types

The simulation of a NEW T instruction results in a special uninitialized type being pushed on the stack. This special type contains the offset of the NEW instruction that created it. When an INVOKESPECIAL instruction for a T constructor is simulated, all occurrences of this special type in the current stack map frame must be replaced with the normal T type.

The basic block input and output frame data structures presented in the previous section are not sufficient to take uninitialized types into account. Indeed, when a constructor invocation is visited, the type of the target may be unknown, and many local variable and operand stack types may also be unknown. It is therefore impossible to replace all occurrences of the target type everywhere in the stack map frame.

For this reason an additional data structure is associated with each basic block, namely the list of types that are initialized in this basic block (these types can be relative to the unknown input stack map frame of the basic block). This data structure is constructed during the visit of instructions, and is used in computeAllFrames, when all the types are known.

3.5.3 Exception handlers

For all the instructions covered by an exception handler, the control flow can jump to the exception handler block. This means that, inside the region covered by an exception handler, and as a consequence of the definition of basic blocks, basic blocks are reduced to individual instructions. In this case the advantage of using an algorithm working at the basic block level is lost, since there are as many basic blocks as instructions.

Hopefully it is not necessary to really use one basic block per instruction inside regions covered by exception handlers. This is because not all the frames associated to these instructions have an impact on the input frame of the exception handler. Indeed this input frame only contains local variable types, and its stack is reduced to a single element that depends only on the type of exception caught by this handler. As a consequence only the frames associated with the instructions that affect local variables are important. In practice, this means that, inside regions covered by exception handlers, xSTORE instructions end the current basic block (and start a new one) like, for instance, an IFEQ instruction.

As an example, consider the following method:

public m(Ljava/lang/Integer;Ljava/lang/Float;)Ljava/lang/Number;
  TRYCATCHBLOCK L0 L1 L1 java/lang/Exception
  ACONST_NULL
  ASTORE 3
 L0
  ALOAD 1
  ASTORE 3
  ALOAD 2
  ASTORE 3
  ALOAD 3
  ARETURN
 L1
  ASTORE 4
  ALOAD 3
  ARETURN

Normally, due to the exception handler, each instruction between L0 and L1 should be a distinct basic block, which would give 6 basic blocks inside this region. In practice, thanks to the above optimization, only the ASTORE instructions change the current basic block, which gives 3 basic blocks (ALOAD 1 ASTORE 3, ALOAD 2 ASTORE 3 and ALOAD 3 ARETURN). Note that if the instructions between L0 and L1 were considered as a single basic block, the frame computed for L1 would be incorrect: it would indeed be the same as the output frame of the previous block, where local variable 3 is of type Float (whereas the correct value is the common super type of Integer and Float, i.e. Number).

Note: the edges of the control flow graph that connect basic blocks to exception handler blocks are not constructed in the visitTryCatchBlock method but in the computeAllFrames method. Indeed, since the visitTryCatchBlock method must be called before the labels passed as arguments to this method are visited, it is not possible to know in this method which labels belong to the range protected by this handler.

3.5.4 Dead code

The fix point algorithm used in the second step of the algorithm described in section 3.5.1 is limited to reachable code. Indeed, by definition, reachable code is code that can be reached from the initial basic block in the control flow graph, and the fix point algorithm is precisely looking at these blocks. As a consequence the algorithm does not compute the input frame of dead basic blocks.

Unfortunately the Java 6 split verifier requires a stack map frame for every basic block, even unreachable ones. The problem, as shown above, is that these frames are not computed and, even worse, cannot be computed. Indeed an unreachable basic block can contain illegal bytecode sequences, such as ISTORE 1 ALOAD 1 (more precisely this was possible with the JVM verifier prior to Java 6; but this is no longer possible with the new verifier).

The consequence of all this is that dead code must either be removed or replaced with a valid bytecode sequence whose stack map frame can be easily computed. The first solution was judged too complicated. So the second solution has been chosen.

A simple solution is to replace dead code with NOP instructions. In this case any stack map frame will be ok for these blocks. The only problem is that execution can fall from the end of a dead code block to the end of the method or to the start of a reachable block. So either the stack map frame for the dead code block must be consistent with the frame of the next block, or the last instruction of the dead code block must not be replaced with a NOP but with an instruction without successor, such as RETURN, GOTO, or ATHROW.

The first solution is too complicated; the second one is possible, but the fact that the dead code block can be reduced to a single byte must be taken into account: there is not always enough room to replace it with NOP instructions followed by a GOTO, for example. xRETURN can be used (this is a single byte instruction), but this requires adjustments to the method's return type. The ATHROW instruction does not have this problem, and is also a single byte instruction. It was therefore chosen to end the dead code blocks.

Note that it is not necessary to insert instructions to create something on stack before ATHROW: the stack map frame for this block can "declare" that, when the execution of this basic block begins, there is already an exception on stack (there will be no consistency problem with other frames since this block is unreachable. Also there is no problem with declared exceptions, because the verifier does not check that ATHROW instructions are consistent with declared exceptions).

But declaring a stack map frame of the form [] [java/lang/Throwable] for dead basic blocks may cause a problem if this block is in the range of an exception handler: the stack map frame for this handler may be inconsistent, with a non empty list of local variable types (the stack part will always be consistent since it always has the form [exception type] for an exception handler). To solve this, we remove the dead basic blocks from the exception handler ranges (which may remove a handler, decrease its range, or split its range in two sub ranges).

In summary, the solution for dead code blocks is to replace these blocks with NOP ... NOP ATHROW, to declare a [] [java/lang/Throwable] stack map frame for these blocks, and to remove the dead code blocks from the exception handlers.

3.5.5 Subroutines

The JSR and RET instructions, used for subroutines, complicate the control flow and data flow analysis algorithms. Hopefully they must not be used with Java 6 classes, so they do not have any impact on the algorithm used to compute stack map frames. But they do complicate the algorithm used to compute the maximum stack size, with the COMPUTE_MAXS option. More precisely they do not impact the algorithm used to compute the maximum stack size from the control flow graph, but they complicate the construction of this graph. Like for exception handlers, the edges in the control flow graph that correspond to subroutines are computed in computeMaxStackAndLocal.

The first step consists in finding, for each basic block, to which subroutine it "belongs". A basic block may be reached from several subroutines, but we say that it "belongs" to the "oldest" of these subroutines (with the convention that a subroutine calling another one is "older" than the callee). We compute the owner subroutine of each basic block as follows:

The second step consists in finding, for each RET instruction, to which possible basic blocks it can return to. The possible basic blocks are those that follow JSR instructions. We therefore examine each JSR instruction in turn. For each such JSR instruction I we look at the basic blocks that belong to the called subroutine. When a RET instruction is found we add the block following I as a successor of the RET instruction (except if I and the RET instruction belong to the same subroutine, which can happen if a subroutine returns to its parent subroutine without a RET).

Let's take an example to illustrate this:

public m(Z)V
 L0
  JSR L2
 L1
  RETURN
 L2
  ASTORE 2
  JSR L4
 L3
  GOTO L5
 L4
  ASTORE 3
  ILOAD 1
  IFEQ L5
  RET 3
 L5
  RET 2

After all the instructions have been visited in MethodWriter, and just before computeMaxStackAndLocal is called, the control flow graph is the following (L1 and L3 are not real successors of L0 and L2: these edges are only used to keep track of JSR's return addresses, and are ignored during the control flow graph analysis used to compute the maximum stack size):

As explained above, the first step in computeMaxStackAndLocal consists in finding, for each basic block, to which subroutine it belongs. The blocks that are reachable from L0 without following a JSR or a RET are L0 and L1: they are marked as belonging to subroutine #1, and the JSR target L2 is enqueued. L2 is then removed from the queue, and the blocks that are reachable in the same way from L2, namely L2, L3 and L5, are marked as belonging to subroutine #2. In the process the JSR target L4 is enqueued. Finally L4 is removed from the queue, and the blocks that are reachable in the same way from L4, and not already marked, are marked as belonging to subroutine #3. Here this means only L4 is marked, because L5 was already marked as belonging to subroutine #2 (this happened because the nested subroutine #3 can "return" to its parent subroutine #2 without a RET):

The second step consists in finding the successors of the RET instructions. As explained above, this is done by examining each JSR instruction in turn. The first one, in L0, leads to the analysis of the basic blocks that belong to subroutine #2. In these basic blocks we find only one RET instruction, in L5. These JSR and RET instructions do not belong to the same subroutine, so we add L1 (the next basic block after the JSR in L0) as a successor of L5. The second JSR instruction, in L2, leads to the analysis of the basic blocks of subroutine #3. In these basic blocks we find one RET instruction, in L4. L2 and L4 do not belong to the same subroutine, so we add L3 (the next basic block after the JSR in L2) as a successor of L4.

The final control flow graph is the following:

Note: you may have noticed that the L4 "basic block" is not a real basic block, because it contains several instructions that can lead to other blocks. In fact it is the union of two consecutive basic blocks. This is not an error: with the COMPUTE_MAXS option it is not always necessary to decompose the control flow graph down to individual basic blocks. Therefore, when possible, several consecutive basic blocks are represented as a single block, for optimization purposes.

4 Code optimizations

The main objective of ASM is to get the smallest and fastest code as possible, while keeping a quite "clean" public API. This section explains the techniques used in this project to achieve these objectives (many of them are not recommended for mainstream development).

4.1 Optimizing performance

The first and most important step to get good performances is to use the right API and good algorithms [R0]. Sometimes it is hard to decide which API or algorithm is best without actually implementing and testing all options. In this case, take the time to implement and test all options to find the best one. There are however some general rules that can be used when designing an API and when implementing it, in order to get good performances (a side effect of these rules is to reduce code size):

Once the best API and algorithms have been found, several "low level" techniques can be used to optimize performances:

These techniques must be used with care (because they can make code harder to understand and to maintain) and only when they bring a real performance benefit. In order to see if it is the case, the performances of every proposed optimization must be compared to the performances of the current code (if possible on several JVMs, in order to exclude singularities due to specific JVM optimization strategies).

Of course optimizations must be applied in priority to the code that is responsible for the most important part of the total execution time, in order to get the best performance improvements. In order to detect these "hotspots", a code profiler can be used. It is also possible to turn some code sections on or off, in order to measure, by difference to the normal case, the cost of this section.

4.2 Optimizing code size

The obvious way to reduce code size is to reduce the number of classes and methods! This begins by reducing the number of abstract classes in the public API (not because they are big, but because it indirectly reduces the number of classes and methods) [R10]. This reduction generally requires compromises with what a "clean" object oriented API would be. Whether these compromises are worthwhile or not must be decided on a case by case basis. Several such compromises can be detected in the current public ASM API. For example the AnnotationVisitor abstract class is used both for annotations and annotation arrays although, in the second case, the name argument is useless and would not have been used if an AnnotationArrayVisitor abstract class were defined. Another example is the SignatureVisitor abstract classs a single class is used to represent a quite complex recursive data structure that, in a "normal" object oriented design, would be represented with up to 8 abstract classes.

The number of classes can be reduced in several ways. The most "elegant" one is to use appropriate design patterns [R11]. For instance the ClassReader class needs Attribute factories in order to create attribute objects for the class, method and field attributes. But instead of using the Factory design pattern, which requires two classes, ASM uses the Prototype pattern, which uses only one class.

The number of classes can also be reduced by merging several related classes into a single one [R12]. For instance a single Symbol class is used to represent all the constant pool item types, and the Label class is used to represent both bytecode offsets and basic blocks. The drawback of this approach is that it increases memory requirements (an integer Symbol object, for example, uses 7 fields while an IntegerSymbol class would require only 2 fields) and object instantiation times (Label and Frame classes were initially merged but have been separated in order to improve performances).

The number of methods can be reduced by inlining methods that are called at only one place in the code (a method that is called from several places can also be inlined if the client code can be refactored to call it at only one place) [R13]. Another benefit of this technique is that it increases performances.

4.3 An example

This section illustrates the previous optimization techniques on a real example, namely the asm.signature package. The goal of this package is to provide an event based API, a parser and a writer for the generics signature grammar, given below:

ClassSignature:
  FormalTypeParameters? ClassTypeSignature ClassTypeSignature*

FormalTypeParameters:
  < FormalTypeParameter+ >

FormalTypeParameter:
  Identifier FieldTypeSignature? FieldTypeSignature*

FieldTypeSignature:
  ClassTypeSignature | ArrayTypeSignature | TypeVariableSignature

ClassTypeSignature:
  L Identifier ( / Identifier )* TypeArguments? ( . Identifier TypeArguments? )* ;

TypeArguments:
  < TypeArgument+ >

TypeArgument:
  * | ( ( + | - )? FieldTypeSignature )

ArrayTypeSignature:
  [ TypeSignature

TypeVariableSignature:
  T Identifier ;

TypeSignature:
  Z | C | B | S | I | F | J | D | FieldTypeSignature

MethodTypeSignature:
  FormalTypeParameters? ( TypeSignature* ) ( TypeSignature | V ) ( ^ClassTypeSignature | ^TypeVariableSignature )*

The first design of this package used 7 abstract classes, roughly corresponding to the grammar non terminals:

The parser was generated with JavaCC and the writer used 7 classes, one per abstract class.

The second step was to replace the generated parser with a hand written, recursive descent parser, containing 11 recursive methods (one per rule of the grammar). This led to major performance improvements, and to a huge code size reduction. This was due to the fact that hand written code can be more efficient than generated code, and also to the fact that all lexical and syntactic verifications were removed ([R2]).

In the third step a refactoring reduced the number of writer classes to 6 instead of 7 (by using inheritance).

In the fourth step the number of abstract classes was reduced to 4 ([R10]), and the 6 writer classes were merged into a single one ([R12]), using a boolean stack encoded as an int ([R8]). The parser was mostly unchanged. The reduced API allowed invalid signatures to be generated (it corresponded to a generalized grammar where FieldType, ClassType, ArrayType, TypeVariable and Type are merged into a single Type non terminal), but this was seen as an acceptable compromise.

In the sixth step the parser was optimized by replacing some fields that represented the last parsed character and its index with method arguments ([R6]). In addition some methods were inlined, which removed 6 parsing methods out of 11 ([R13]).

Finally, after some feedback from users, the API was reduced to only one abstract class (although it allowed even more invalid signatures to be generated, this was judged more practical to define SignatureAdapters), and this provided more opportunities to inline parser methods: the final parser contains only 3 parsing methods.

The code size decreased from 22KB at the first step, to 6.8KB at the second step, 4.5KB at the fourth step, and less than 4KB at the last step. At the same time the average time to parse and rebuild a signature decreased from 74 micro seconds at the first step, to 15 micro seconds at the second step, 5.4 micro seconds at the fourth step, and less than 5 micro seconds at the last step.

As can be seen from this example (more details can be found on the ASM mailing list archives of January 2005), improving performances often leads to code size reductions, and vice versa!