ASM 6 Developer Guide
by Eric Bruneton (last update 04/05/2018)
- Introduction
- Installation
- Code review
- Code optimizations
1 Introduction
This guide is primarily intended for ASM users that would like to contribute to the ASM code base, although it can be of interest for other users as well. It explains the organization of the code, the main data structures and the most complex algorithms. It also explains the strategies that have been used to optimize ASM performances as well as to minimize its code size, and illustrates this through a concrete example.
2 Installation
This section explains how to compile the ASM source code,
how to test it, and how to build a distribution, which is the
first thing to know before contributing to this project. It
assumes you have already downloaded the ASM source code, which
can be done with git clone
https://gitlab.ow2.org/asm/asm.git
, or by downloading a
snapshot of the git repository from https://gitlab.ow2.org/asm/asm.
2.1 Overview
The ASM sources are organized in several directories, with one directory per distribution artifact (asm.jar, asm-util.jar, etc), plus a few other directories for building, testing and benchmarking them:
-
asm
,asm-analysis
,asm-commons
,asm-tree
,asm-util
, andasm-xml
contain the sources of the corresponding artifacts, together with their unit tests. They use the standard Maven directory layout, which is also the default layout used by Gradle. asm-test
contains base classes which are used in the previous projects to implement the unit tests, as well as precompiled Java classes which are used as test cases in these projects.-
benchmarks
defines some JMH benchmarks to measure the performance of ASM, and to compare it to previous versions (to make sure there are no performance regressions) and to other bytecode manipulation libraries. gradle
contains the Gradle wrapper and the Linux and Windows scripts to invoke it.-
tools
contains two Gradle projects which are used to build the ASM artifacts: a retrofitter tool used to retrofit classes compiled with Java 6 into Java 5 class files, and a BND plugin to generate a module-info.class for each ASM artifact. Both tools use ASM itself.
The build.gradle
file defines the targets that
are used to build, test and package the ASM source code. These
targets are presented in the following sections.
2.2 Building
To build everything, use gradle/gradlew build
from the main directory. To build a single project, for
instance asm-tree
, use gradle/gradlew
:asm-tree:build
. This compiles the code, run the unit
tests, and perform other checks to make sure that the code is
properly formatted, that it is binary backward compatible with
previous ASM versions, and that it is fully covered by unit
tests.
Note: if the source code formatting verification
fails, you can fix this by reformatting it with
gradle/gradlew googleJavaFormat
.
2.3 Running tests
To test everything, use gradle/gradlew test
from the main directory. To test a single project, for instance
asm-util
, use gradle/gradlew
:asm-util:test
.
Test coverage reports can be generated with
gradle/gradlew jacocoTestReport
. The result is
available in the build/reports/jacoco
directory of
each project.
2.4 Creating a distribution
A binary distribution, containing the compiled artifacts
with their Javadoc and sources jar, a POM and digital
signatures, can be generated with gradle/gradlew
uploadArchives
. The result is generated in the
/tmp/myRepo
directory.
Note: the version number used in the files produced
when creating a distribution is defined in the header part of
the build.gradle
file.
3 Code review
This section presents the ASM source code organization, the main data structures and the most complex algorithms, which is the next important thing to know, after the installation and building procedures, in order to contribute to this project.
3.1 Code organization
ASM is organized in several packages:
org.objectweb.asm
is the core package. It defines the ASM visitor API and provides theClassReader
andClassWriter
classes to read and write compiled Java classes. This package does not depend on the other ones and can be used alone.org.objectweb.asm.signature
provides an API to read and write generics signatures. It is independent of the core package but complements it.-
org.objectweb.asm.tree
provides a DOM-like API on top of the SAX-like API provided by the core package. It can be used to implement complex class transformations, for which the core package would be too complicated to use. org.objectweb.asm.tree.analysis
provides a static bytecode analysis framework on top of the tree package. It can be used in addition to the tree package to implement really complex class transformations that need to know the state of the stack map frames for each instruction.org.objectweb.asm.commons
provides some useful class adapters that are based on the core and tree packages. These adapters can be used as is or can be extended to implement more specific class transformations.org.objectweb.asm.util
provides some useful class visitors and adapters that can be used for debugging purposes. It is generally not needed at runtime.org.objectweb.asm.xml
is deprecated. It provided the ability to convert classes to and from XML.
From an implementation point of view the core package is the
most complex one. The tree
, util
and
xml
packages are very simple (they just convert
classes from one high level representation to another one,
which is much simpler than converting classes from their byte
array form to a high level representation one or vice versa).
The signature
package is also quite simple (it
consists in a parser and a pretty printer for a small grammar).
In fact the only packages that are not completely trivial,
except the core package, are the commons
and
analysis
packages. But the algorithms used in the
analysis
package are similar to the one explained
in section 3.5. This explains why
the rest of this guide is focused on the core package only.
3.2 Main data structures
3.2.1 Object model
The core package is made of 28 classes and interfaces. If we
exclude the Opcodes
interface, the 5 abstract
visitor classes ( AnnotationVisitor
,
ClassVisitor
, FieldVisitor
,
MethodVisitor
and ModuleVisitor
) and
the 6 utility classes (ConstantDynamic
,
Constants
, Handle
, Type
,
TypePath
and TypeReference
), this
leaves only 16 classes, which are depicted in the figure
below.
The conversion of compiled classes to visit events is done
by only one class, namely the ClassReader
class,
which uses the auxiliary class Context
. The
inverse process is done by the other 14 classes, which are
organized around the ClassWriter
class:
- Classes:
- the
ClassWriter
class is the main entry point. It contains fields that describe the class version, access flags, name, etc. It also contains references to other objects that represent the constant pool, the fields, the methods, the annotations and the attributes of the class.
- the
- Constant pool:
- the
SymbolTable
class is used to represent the constant pool and the bootstrap methods, as well as an ASM specific type table used to compute stack map frames. These structures are represented both in byte array form and as a hash set ofSymbol
instances, in order to efficiently test if a given constant pool item, bootstrap method or type has already been added to the symbol table. This class is referenced from theClassWriter
,AnnotationWriter
,FieldWriter
,MethodWriter
andModuleWriter
classes, since classes, annotations, fields, methods and modules need to add the constants they reference to the constant pool and/or the bootstrap methods. - the
Symbol
class is used to represent a single constant pool item, a single bootstrap method or a single ASM specific type used to compute stack map frames.
- the
- Fields:
- the
FieldWriter
class is used to write fields. It contains fields that describe the field's name, type, signature, value, etc. It also contains references to other objects that represent the field's annotations and attributes.
- the
- Methods:
- the
MethodWriter
class is used to write methods. It contains fields that describe the method's name, signature, exceptions, etc. It also contains references to other objects that represent the method's annotations and attributes. The method's code is stored in a byte array that is constructed during the visit of bytecode instructions. The labels used to reference instructions are stored in a linked list ofLabel
instructions. - the
Label
class is used to reference instructions, but also to represent basic blocks, which are used for the automatic computation of the maximum stack size and of the stack map frame table of a method. - the
Handler
class is used to represent try catch blocks. Each handler references threeLabel
objects that define the start and end of the try block, and the start of the catch block. - the
Frame
andCurrentFrame
classes are used for the automatic computation of the stack map frames of a method. - the
Edge
class is used to represent the control flow graph of a method, which is a graph of basic blocks, i.e. ofLabel
objects. AnEdge
is an edge between twoLabel
objects in this graph.
- the
- Modules:
- the
ModuleWriter
class is used to write theModule
,ModulePackages
andModuleMainClass
class attributes, which are related to modules (a Java module definition is compiled into a class file containing these attributes).
- the
- Annotations:
- the
AnnotationWriter
class is used to write annotations. This class is referenced from theClassWriter
,FieldWriter
andMethodWriter
classes, since classes, fields and methods can have annotations.
- the
- Attributes:
- the
Attribute
class is used to read and write non standard class attributes. It must be subclassed for each specific non standard attribute that must be read and written. This class is referenced from theClassWriter
,FieldWriter
andMethodWriter
classes, since classes, fields and methods can have attributes.
- the
- Resources:
- the
ByteVector
class is used to serialize the class elements while they are visited. It is used to represent the constant pool, the annotation values, the method's code, the stack map tables, the line number tables, etc.
- the
3.2.2 Collections
The core package does not use any java.util
class. Instead, lists, sets and graphs are encoded in dedicated
fields of their elements:
- Lists are represented as linked lists whose links are
stored directly in the list elements themselves. For instance
a list of
FieldWriter
objects is represented as theFieldWriter
objects themselves, linked through theirfv
field. Likewise forMethodWriter
,AnnotationWriter
,Label
, etc. The advantage of this method, compared to using separate objects to store the linked list itself (as injava.util.LinkedList
) is that it saves memory. The drawback is that a given element cannot belong to several lists at the same time, but this is not a problem in the ASM case. - The only hash set used in the core package, in the
SymbolTable
class, is implemented with an array ofSymbolTable.Entry
instances (a subclass ofSymbol
) that can be chained together through theirnext
field (to handle the case of hash collisions). In other words, as for lists, the hash set structure is embedded in the hash set elements themselves. The advantages and drawbacks are the same (saves memory but elements cannot belong to several hash sets at once). - Similarly, the control flow graph (see section 3.5) data structure is embedded in
the graph nodes themselves, i.e. in the
Label
objects.Since
Label
objects must be stored in several data structures at the same time, they have several distinct fields that encode these data structures:- the
nextBasicBlock
field is used to encode the list of labels of a method, in the order they are visited. - the
outgoingEdges
field is used to store the list ofEdge
objects (linked through theirnextEdge
field) that represent the control flow graph data structure. - the
nextListElement
field is used to store temporary lists of labels in the algorithms used to compute the maximum stack size and stack map frames of methods (see section 3.5.1).
- the
3.3 Main algorithms
3.3.1 ClassReader
The ClassReader
algorithm is quite
straightforward. It is summarized below.
- parse the constant pool and the bootstrap methods (in the constructor):
- store the start offset of each constant pool item in
cpInfoOffsets
- store the start offset of each bootstrap method in
bootstrapMethodOffsets
- store the size of the longest string constant in
maxStringLength
- parse the class (in the
accept
andread
* methods): - parse the header
- parse the class attributes. Depending on the complexity of the attribute:
- either parse it and store its value in a local variable (for attributes containing a simple value),
- or store its start offset in a local variable (for attributes with a complex structure)
- call the visit methods corresponding to the detected attributes. For complex attributes, which were not parsed in the previous step (such as annotations), parse them and visit them at the same time: i.e. parse one part, visit it, parse the next part, visit it, etc.
- for each field (in the
readField
method): - parse the header
- parse the field attributes. Depending on the attribute:
- either parse it and store its value in a local variable,
- or store its start offset in a local variable
- call
visitField
- call the visit methods corresponding to the detected attributes. For complex attributes, which were not parsed in the previous step (such as annotations), parse them and visit them at the same time: i.e. parse one part, visit it, parse the next part, visit it, etc.
- call
visitEnd
- for each method (in the
readMethod
method): - parse the header
- parse the method attributes. Depending on the attribute:
- either parse it and store its value in a local variable,
- or store its start offset in a local variable
- call
visitMethod
- if the returned visitor is a
MethodWriter
, and if itsClassWriter
's constant pool was copied from this reader (see section 3.3.2), the method bytes can be copied as is: then skip all steps below. - call the visit methods corresponding to the detected attributes. For complex attributes, which were not parsed in the previous step (such as annotations), parse them and visit them at the same time: i.e. parse one part, visit it, parse the next part, visit it, etc.
- for the special case of the
Code
attribute (in thereadCode
method): - find the labels and store them in the
Context.currentMethodLabels
array - look for labels in the code
- look for labels in the exception handlers
- look for labels in the line number and local variable tables
- look for labels in the other code attributes
- if there is a stack map table, parse the first frame
- parse the instructions
- if there is a stack map frame for this offset,
call
visitFrame
, and parse the next frame - if there is a label for this offset, call
visitLabel
- if there is a line number entry for this offset,
call
visitLineNumber
- call
visit
XxxInsn
- call
visitAttribute
for each non standard code attribute parsed during second step - call
visitMaxs
- call
visitEnd
- call
visitEnd
- call
visitEnd
Some points are interesting to note:
- the visit of line numbers and stack map frames is interleaved with the visit of instructions. In the case of stack map frames, not only the visit, but also the parsing of the stack map table and of the method's code is interleaved. The advantage, compared to a parsing of the stack map table followed by a parsing of the method's code, is that no complex data structure is needed to store the parsed frames for the second step.
- constant pool items are parsed every time they are
referenced, except for Utf8 and ConstantDynamic items, whose
values are cached in the
cpInfoValues
array. Not also that a single char array is reused to parse these items. It must be large enough to parse the longest string, hence the computation ofmaxStringLength
in the constructor.
3.3.2 ClassWriter, AnnotationWriter, FieldWriter, MethodWriter and ModuleWriter
Since the visit of the class members can be interleaved (it is possible to start visiting a field, then start visiting a method, go back to visit annotations of the field, continue with some instructions of the method, visit attributes of the field, add new instructions to the method, and so on), it is not possible to construct the class file's byte array in a sequential order, from beginning to end. Instead it is necessary to use several byte vectors that can grow simultaneously. This is why there are several writer classes, unlike for the reader case.
The ClassWriter
class is the main entry point.
It contains the class header elements and the lists of its
fields and methods, as well as a SymbolTable
instance, containing the constant pool items and the bootstrap
methods of the class. This symbol table uses a hash set of
Symbol
objects, in order to avoid adding the same
item several times in the constant pool or in the bootstrap
methods array. The symbol table can be created from an existing
class by passing a ClassReader
argument to its
constructor. This allows unchanged methods to be copied as is
from a class reader to a class writer, without visiting their
content (see section 3.3.1).
The AnnotationWriter
, FieldWriter
and ModuleWriter
classes are quite simple: they
convert visit events to a byte array representation, by using
the class writer's SymbolTable
in order to add
constant pool items when necessary. The most complex writer
class is the MethodWriter
class, because it
manages advanced features such as the automatic computation of
the maximum stack size and number of local variables of a
method, the automatic computation of its stack map frames, as
well as the automatic management of short versus long jump
instructions. Without these features, each
visitXxxInsn
method would be very simple,
i.e. it would just add the bytecode representation of an
instruction to the code
byte vector.
Instead, in order to be able to automatically compute the
maximum stack size, the maximum number of local variables and
the stack map frames, each visitXxxInsn
method does the following:
- append the instruction to the
code
byte vector -
if (currentBasicBlock != null) // if some automatic computation is needed
-
if (compute == COMPUTE_ALL_FRAMES || compute == COMPUTE_INSERTED_FRAMES)
- simulate the execution of this instruction on the stack frame
-
else // compute == COMPUTE_MAX_STACK_AND_LOCAL
- simulate the execution of this instruction on the stack height
- keep track of the local variables used, if any
- keep track of the successors of this instruction
- update
currentBasicBlock
-
The maximum stack size or the stack map frames are actually
computed in the computeMaxStackAndLocal
and
computeAllFrames
methods, by using the control
flow graph constructed in the visitXxxInsn
methods (see section 3.5 for more
details).
3.3.3 Label
From a user point of view the Label
class is
used to reference instructions. Internally it is used to store
the bytecode offset of an instruction (i.e. the bytecode array
index of the first byte of the instruction), and to compute
relative bytecode offsets (i.e. the difference between the
bytecode offsets of two instructions). It is also used to
represent basic blocks, which are used for the automatic
computation of the maximum stack size and of the stack map
frames of a method (see section
3.5).
Jump instructions such as IFEQ
or
GOTO
are stored in bytecode as an opcode followed
by a relative bytecode offset to the target instruction (this
relative offset is the difference between the bytecode offset
of the target instruction and the bytecode offset of the jump
instruction). This relative offset can be computed easily in
the case of a backward jump (i.e. a jump to an instruction that
is before the jump instruction in the bytecode), but it cannot
be computed at all in the case of a forward jump (i.e. a jump
to an instruction that is after the jump instruction) since, in
this case, the target instruction has not been visited yet, and
so its bytecode offset is unknown. The case of forward jumps is
solved in the following way:
- The jump instruction is written with a (temporary) relative offset equal to 0.
- The target
Label
object is updated to memorize the fact that this jump instruction makes a forward reference to this label (theforwardReferences
array inLabel
is used for that). - When this label is visited, i.e. when its bytecode offset
becomes known, all the forward jump instructions to this
label (given by
forwardReferences
) are updated, to replace the temporary relative offsets with their real values, which can now be computed.
3.3.4 toByteArray
The toByteArray
method in
ClassWriter
puts together all the pieces
constructed in the various writer classes in order to get the
full byte representation of the class. This is done in two
steps:
- the size of the class is computed by summing the size of
all the pieces, which is given by the
getSize
method (this method can add items to the constant pool, which modifies its size; this is why the constant pool size is added only at the very end). - a byte vector of this size is allocated, and the pieces
are copied into this vector in the right order. This is done
by calling the
putXxx
method on each piece (e.g.putFieldInfo
on eachFieldWriter
).
3.4 Jump instructions encoding algorithm
The jump instructions such as IFEQ
and
GOTO
store relative bytecode offsets with signed
values encoded in two bytes. This offset can therefore vary
between -32768 and 32767. However the bytecode of a method can
be as large as 65535 bytes: it is therefore possible to have
relative bytecode offsets that cannot be represented with
signed, two bytes values. Hopefully there are two special jump
instructions that store their relative bytecode offset in
signed, four bytes values, namely GOTO_W
and
JSR_W
.
In the case of backward jumps the jump offset is known when
the jump instruction is visited. It is then easy to use
GOTO
or GOTO_W
, depending on the
value of this relative offset. But in the case of forward jumps
this is not possible. A first solution is to assume that this
offset will require 4 bytes, and to always use a
GOTO_W
for forward jumps. But this is not very
optimal. A second solution is to assume that this offset will
require 2 bytes, and to use a normal jump instruction, with 2
bytes reserved to store the offset when it becomes known. The
problem is that if the offset turns out to require 4 bytes, the
jump instruction must be replaced with another one using a 4
bytes offset, which can invalidate already computed offsets
used in other jump instructions. In ASM the second solution was
chosen. It requires a method to replace the forward jump
instructions that turn out to require long offsets. The
algorithm used for this is presented in the rest of this
section.
3.4.1 Basic algorithm
The algorithm uses the following properties:
- the maximum method size, and therefore forward jump offset, is 65535, which can be stored in an unsigned two bytes value,
- the opcodes of the JVM bytecode instructions set do not
use all the 255 possible values and, in fact, there are
enough unused opcodes to define an unsigned equivalent
of each standard jump instruction opcode (that we note
ASM_GOTO
,ASM_JSR
,ASM_IFEQ
, and so on).
Given these properties, if it turns out, in
visitLabel
, that a forward offset is greater than
32737, the opcode of the corresponding instruction is changed
to the equivalent non standard opcode with unsigned offset, and
the offset is stored as an unsigned short.
Then, when all the class has been visited and converted to a byte array, if some non standard opcodes have been used, they must be replaced with standard ones. For this:
- the content of the
ClassWriter
is cleared (except its symbol table), and the byte array is parsed with aClassReader
chained to thisClassWriter
, to rebuild the class. The class is parsed with the specialEXPAND_ASM_INSNS
class reader flag, whose effect is explained below. - the class reader converts the ASM specific instructions
to standard ones, with explicit
GOTO_W
andJSR_W
instructions and, because of theEXPAND_ASM_INSNS
flag, leaves the existingGOTO_W
andJSR_W
instructions unchanged (normally they are converted toGOTO
andJSR
, but not here, since we know 4 bytes offsets will be needed - this is also needed to avoid infinite loops where theClassWriter
is cleared and reconstructed indefinitely).
This process removes the existing ASM specific instructions,
but can introduce new ones. Indeed, because the ASM specific
instructions are replaced with longer sequences of standard
instructions, some relative offsets between existing
instructions, which were just below the 32737 limit, can become
larger than this limit. In this case the process is repeated:
the byte array containing the new ASM specific instructions is
parsed again with a new ClassReader
(after the
ClassWriter
content has been cleared) chained to
the ClassWriter
. Eventually this process will
converge and the result will be a class with only standard
instruction opcodes.
3.4.2 Impact on stack map frames
An IF
Xxx instruction with a forward
offset larger than 32767 must be replaced with an
IF
NotX GOTO_W
sequence, which
requires a stack map frame after the GOTO_W
(for
Java 7 or more classes). This stack map frame may not be
already present, in which case it needs to be computed and
inserted. For this, one solution would be to recompute all the
stack map frames, but this is not always possible (computing
the stack map frames may require access to the class hierarchy,
may load classes, etc) and would not be efficient. Instead, we
use the existing stack map frames to compute the one we need to
insert. Indeed, by definition, the existing stack map frames
are sufficient to compute the state of the local variables and
of the stack at each instruction, with a linear algorithm
parsing the instructions and the frames from the beginning to
the end.
This is implemented as follows:
- when the
ClassWriter
is cleared to replace the ASM specific instructions, it is put in aCOMPUTE_INSERTED_FRAMES
mode. - in this mode, the
MethodWriter
uses aCurrentFrame
instead of aFrame
to simulate the execution of each instruction, which has the effect of computing explicitly the state of the local variables and of the stack at each instruction (normally, inCOMPUTE_FRAMES
mode, this explicit state is only computed at the basic block level, not at the instruction level). The existing frames are also used as is, to replace theCurrentFrame
with the visited frame. Finally, if a frame with the special typeF_INSERT
is visited, then the current frame state is used to insert a frame at the current bytecode offset. - in the
ClassReader
used to replace the ASM specific instructions, when anIF
Xxx instruction is replaced with anIF
NotXGOTO_W
sequence, avisitFrame(F_INSERT, ...)
is emitted, with an empty content, in order to insert a frame at this location.
3.4.3 An example
Consider the following Java code:
public void m(int i, int j) { for (; cond(i); --i) { if (j == 0) { break; } ... } }
It is compiled into:
public m(II)V GOTO L1 L2 ILOAD 2 IFNE L3 GOTO L4 L3 ... IINC 1 -1 L1 ALOAD 0 ILOAD 1 INVOKEVIRTUAL C.cond(I)Z IFNE L2 L4 RETURN
During the visit of each instruction, offsets are computed step by step as follows:
Offset | Instruction | Comment |
---|---|---|
0 | GOTO L1 |
relative offset unknown when this instruction is visited |
3 | L2 |
|
3 | ILOAD 2 |
|
4 | IFNE L3 |
relative offset unknown when this instruction is visited |
7 | GOTO L4 |
relative offset unknown when this instruction is visited |
10 | L3 |
relative offset for IFNE L3 becomes
known: 10 - 4 = 6 |
10 | ... |
|
32764 | IINC 1 -1 |
|
32767 | L1 |
relative offset for GOTO L1 becomes
known: 32767 - 0 = 32767 |
32767 | ALOAD 0 |
|
32768 | ILOAD 1 |
|
32769 | INVOKEVIRTUAL |
|
32772 | IFNE L2 |
relative offset = 3 - 32772 = -32769 |
L4 |
||
RETURN |
The offset computed for IFNE L2
is -32769,
which cannot be stored in a short. The instruction is therefore
transformed into IFEQ
GOTO_W L2
on
the fly:
0 | GOTO L1 |
relative offset = 32767, changed during visit of
L1 |
3 | L2 |
|
3 | ILOAD 2 |
|
4 | IFNE L3 |
relative offset = 6, changed during visit of
L3 |
7 | GOTO L4 |
relative offset still unknown |
10 | L3 |
|
10 | ... |
|
32764 | IINC 1 -1 |
|
32767 | L1 |
|
32767 | ALOAD 0 |
|
32768 | ILOAD 1 |
|
32769 | INVOKEVIRTUAL |
|
32772 | IFEQ L4 |
relative offset = 8 |
32775 | GOTO_W L2 |
relative offset = 3 - 32775 = -32772 |
32780 | L4 |
relative offset for GOTO L4 becomes
known: 32780 - 7 = 32773 |
RETURN |
When L4
is reached the offset computed for
GOTO L4
becomes knwown: we get 32773, which cannot
be stored as a signed short. As explained in section 3.4.1, the GOTO
instruction is then replaced with a non standard
ASM_GOTO
instruction, whose offset is stored as an
unsigned short:
0 | GOTO L1 |
relative offset = 32767, changed during visit of
L1 |
3 | L2 |
|
3 | ILOAD 2 |
|
4 | IFNE L3 |
relative offset = 6, changed during visit of
L3 |
7 | ASM_GOTO L4 |
relative offset = 32773, changed during visit of
L4 |
10 | L3 |
|
10 | ... |
|
32764 | IINC 1 -1 |
|
32767 | L1 |
|
32767 | ALOAD 0 |
|
32768 | ILOAD 1 |
|
32769 | INVOKEVIRTUAL |
|
32772 | IFEQ L4 |
relative offset = 8 |
32775 | GOTO_W L2 |
relative offset = 3 - 32775 = -32772 |
32780 | L4 |
|
32780 | RETURN |
Since the bytecode contains at least one non standard
instruction, the ClassWriter
content is cleared,
just after the byte array representation of the class has been
computed (in toByteArray
), and reconstructed again
by parsing the byte array with a ClassReader
chained to this ClassWriter
. During this process
the ClassReader
converts the ASM_GOTO
into a GOTO_W
(and preserves the existing
GOTO_W
). When L1
is reached, we
get:
0 | GOTO L1 |
relative offset unknown |
3 | L2 |
|
3 | ILOAD 2 |
|
4 | IFNE L3 |
relative offset = 8, changed during visit of
L3 |
7 | GOTO_W L4 |
relative offset unknown |
12 | L3 |
|
12 | ... |
|
32766 | IINC 1 -1 |
|
32769 | L1 |
|
ALOAD 0 |
||
ILOAD 1 |
||
INVOKEVIRTUAL |
||
IFEQ L4 |
relative offset = 8 | |
GOTO_W L2 |
||
L4 |
||
RETURN |
At this point the relative offset for GOTO L1
becomes known, namely 32769. This is too large for a signed
short offset, so GOTO L1
is replaced with
ASM_GOTO L1
with an unsigned short offset. The
class parsing and reconstruction then continues, and eventually
a new byte array for the class is produced. Since it still
contains a non standard opcode, the same process is repeated:
the ClassWriter
content is cleared and
reconstructed by parsing the byte array with a
ClassReader
. Thanks to the special
EXPAND_ASM_INSNS
flag, GOTO_W
and
JSR_W
are never converted back to
GOTO
or JSR
, which ensures that this
iterative process will eventually converge. In our example, the
second iteration does not produce any new non standard opcode,
and we get the final result:
0 | GOTO_W L1 |
relative offset = 32771 |
5 | L2 |
|
5 | ILOAD 2 |
|
6 | IFNE L3 |
relative offset = 8 |
9 | GOTO_W L4 |
relative offset = 32775 |
14 | L3 |
|
14 | ... |
|
32768 | IINC 1 -1 |
|
32771 | L1 |
|
32771 | ALOAD 0 |
|
32772 | ILOAD 1 |
|
32773 | INVOKEVIRTUAL |
|
32776 | IFEQ L4 |
reative offset = 8 |
32779 | GOTO_W L2 |
relative offset = -32774 |
32784 | L4 |
|
32784 | RETURN |
3.5 Control and data flow analysis algorithm
This section presents the algorithm used to compute the maximum stack size and the stack map frames of a method. This algorithm is a control and data flow analysis algorithm, based on the decomposition of the method into a control flow graph of basic blocks. A basic block is a sequence of instructions where only the first instruction is the target of a jump instruction, and where only the last instruction can jump to other basic blocks. The control flow graph of a method is the graph whose nodes are the basic blocks, and whose edges connect the basic blocks that are linked by jump instructions. This graph is constructed during the visit of each instruction. As an example, the control flow graph of the method defined in section 3.5.1 is the one shown in section 2.
In the following we explain the algorithm used to compute
the stack map frames, implemented in the
computeAllFrames
method. The algorithm to compute
the maximum stack size, implemented in the
computeMaxStackAndLocal
method, is similar but
simpler (since only the size of the stack is needed), so it is
not detailed here (except for the handling of subroutines).
3.5.1 Basic data flow analysis algorithm
Stack map frames are computed in a two steps process:
- During the visit of each instruction, the state of the frame at the end of the current basic block is updated by simulating the action of the instruction on the previous state of this so called "output frame".
- In
computeAllFrames
, a fix point algorithm is used to compute the "input frame" of each basic block, i.e. the stack map frame at the beginning of the basic block, starting from the input frame of the first basic block (which is computed from the method descriptor), and by using the previously computed output frames to compute the input state of the other blocks.
Let's take a simple example in order to explain the details of this algorithm. Consider the following very simple method (where B is a subclass of A):
public static m(Z)LA; GETSTATIC B.VALUE : LB; ASTORE 1 GOTO L0 L1 GETSTATIC A.VALUE : LA; ASTORE 1 L0 ILOAD 0 IFNE L1 ALOAD 1 ARETURN
First step
As stated above, during the first step of the algorithm,
which takes place in each visitXxxInsn
method in ClassWriter
, the state of the output
frame of each basic block is updated by simulating the action
of the visited instruction. It is important to note that the
algorithm works at the basic block level, and not at the
instruction level. This means that input and output frames are
associated to basic blocks, and not to individual instructions.
In practice, they are stored in a Frame
object
that is associated to a Label
object, which marks
the beginning of basic blocks.
The effect of this first step for the above example method is illustrated on the table below:
Label | Instruction | Output frame | Comment |
---|---|---|---|
GETSTATIC B.VALUE : LB; |
O1 = [??] [?B] | getstatic pushes a value of type B on the stack | |
ASTORE 1 |
O1 = [?B] [?] | astore consumes this value and stores it in local 1 | |
GOTO L0 |
O1 = [?B] [?] | goto does not change the frame | |
L1 |
GETSTATIC A.VALUE : LA; |
O2 = [??] [?A] | each basic block starts with a new, unknown frame |
ASTORE 1 |
O2 = [?A] [?] | astore stores the value produced by getstatic in local 1 | |
L0 |
ILOAD 0 |
O3 = [??] [?I] | iload pushes the value of local 0, which is of type int |
IFNE L1 |
O3 = [??] [?] | ifne consumes this value | |
ALOAD 1 |
O4 = [??] [?L1] | aload pushes the value of local 1, which is unknown | |
ARETURN |
O4 = [??] [?] | areturn consumes this value |
At the beginning, the output frame O1 of the first basic
block is completely unknown. During the visit of the first
instruction, the action of GETSTATIC
is simulated:
the result is that a new value of type B is pushed on the
stack, on top of the previous values (although we know here
that the stack is initially empty, we do not take this into
account and do as if the stack could previously contain any
number of values of any type - hence the [?B]). During
the visit of the second instruction, the output frame O1 is
updated to simulate the effect of ASTORE
: the
result is that the value previously pushed on the stack is
popped and stored in the local variable 1. The visit of the
third instruction does not change the output frame O1, but
changes the current basic block to null
.
The visit of the L1 label makes L1 become the new current basic block. Like for the first basic block, the output frame O2 of this basic block is initially completely unknown. The visit of the instructions of this basic block is then similar to the visit of the previous instructions.
The visit of the L0 label makes L0 become the new current
basic block. Here again we start with a completely unknown
output frame O3 although, in this case, we could start from the
value of O2 (since this basic block is a successor of the
previous one). The ILOAD
instruction loads the
value of local variable 0, which is necessarily of type int
(the whole algorithm is based on the assumption that the code
is correct), and pushes it on the stack. The IFNE
instruction consumes this value.
Another effect of simulating the action of the
IFNE
instruction is to create a new basic block,
and to make it the new current basic block. This is why,
although there is no label just after this instruction, the
basic block changes. Here again, the output frame O4 of this
basic block is initially completely unknown (although, as
before and for the same reason, we could start from the value
of O3). The ALOAD
instruction loads the value of
local variable 1, whose type is unknown since the frame is
initially completely unknown. The only thing we know is that,
after the execution of this method, the stack contains one
additional value whose type is the type of local variable 1
before this instruction was executed (hence the
[?L1]).
Second step
During the second step of the algorithm, which takes place
in the computeAllFrames
method, the input frames
of each basic block are computed by using an iterative fix
point algorithm (i.e. an algorithm to find a fix point x
of a function f, defined by f(x)=x. If x
values define a complete lattice and if f is monotonic,
xn+1=f(xn) converges to the
smallest fix point of f, according to the Tarski
theorem. Here x is a control flow graph with input and
output frames, f is a "merge" function and the order
relation is based on the subtyping relation for the
verification type system). First the input frame I1 of the
first basic block is computed from the method descriptor
"public static m(Z)LA;", which gives I1 = [I] []. Then the
first basic block is marked as "changed", and the fix point
algorithm starts:
Iteration | Changed | Output frames | Input frames | Comment |
---|---|---|---|---|
0 | B1 | O1 = [?B] [?] O2 = [?A] [?] O3 = [??] [?] O4 = [??] [?] |
I1= [I-] [] I2 = [??] [?] I3 = [??] [?] I4 = [??] [?] |
Initialization of input frame I1 from the method's
descriptor, B1 marked as "changed" |
1 | B3 | I1= [I-] [] I2 = [??] [?] I3 = [IB] [] I4 = [??] [?] |
B1 marked as "unchanged", merge of I1 and O1 into I3 (B3 is a successor of B1), B3 marked as "changed" |
|
2 | B2, B4 | I1= [I-] [] I2 = [IB][] I3 = [IB] [] I4 = [IB] [] |
B3 marked as "unchanged", merge of I3 and O3 into I2 (B2 is a successor of B3), B2 marked as changed, merge of I3 and O3 into I4 (B4 is as successor of B3), B4 marked as changed |
|
3 | B4, B3 | I1= [I-] [] I2 = [IB] [] I3 = [IA] [] I4 = [IB] [] |
B2 marked as "unchanged", merge of I2 and O2 into I3 (B3 is a successor of B2), B3 marked as changed. |
|
4 | B3 | I1= [I-] [] I2 = [IB] [] I3 = [IA] [] I4 = [IB] [] |
B4 marked as "unchanged" | |
5 | B2, B4 | I1= [I-] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B3 marked as "unchanged", merge of I3 and O3 into I2 (B2 is a successor of B3), B2 marked as changed, merge of I3 and O3 into I4 (B4 is as successor of B3), B4 marked as changed |
|
6 | B4 | I1= [I-] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B2 marked as "unchanged", merge of I2 and O2 into I3 (B3 is a successor of B2), B3 not marked as changed. |
|
7 | I1= [I-] [] I2 = [IA] [] I3 = [IA] [] I4 = [IA] [] |
B4 marked as "unchanged" |
3.5.2 Uninitialized types
The simulation of a NEW
T instruction
results in a special uninitialized type being pushed on
the stack. This special type contains the offset of the
NEW
instruction that created it. When an
INVOKESPECIAL
instruction for a T
constructor is simulated, all occurrences of this special type
in the current stack map frame must be replaced with the normal
T type.
The basic block input and output frame data structures presented in the previous section are not sufficient to take uninitialized types into account. Indeed, when a constructor invocation is visited, the type of the target may be unknown, and many local variable and operand stack types may also be unknown. It is therefore impossible to replace all occurrences of the target type everywhere in the stack map frame.
For this reason an additional data structure is associated
with each basic block, namely the list of types that are
initialized in this basic block (these types can be relative to
the unknown input stack map frame of the basic block). This
data structure is constructed during the visit of instructions,
and is used in computeAllFrames
, when all the
types are known.
3.5.3 Exception handlers
For all the instructions covered by an exception handler, the control flow can jump to the exception handler block. This means that, inside the region covered by an exception handler, and as a consequence of the definition of basic blocks, basic blocks are reduced to individual instructions. In this case the advantage of using an algorithm working at the basic block level is lost, since there are as many basic blocks as instructions.
Hopefully it is not necessary to really use one basic block
per instruction inside regions covered by exception handlers.
This is because not all the frames associated to these
instructions have an impact on the input frame of the exception
handler. Indeed this input frame only contains local variable
types, and its stack is reduced to a single element that
depends only on the type of exception caught by this handler.
As a consequence only the frames associated with the
instructions that affect local variables are important. In
practice, this means that, inside regions covered by exception
handlers, xSTORE
instructions end the
current basic block (and start a new one) like, for instance,
an IFEQ
instruction.
As an example, consider the following method:
public m(Ljava/lang/Integer;Ljava/lang/Float;)Ljava/lang/Number; TRYCATCHBLOCK L0 L1 L1 java/lang/Exception ACONST_NULL ASTORE 3 L0 ALOAD 1 ASTORE 3 ALOAD 2 ASTORE 3 ALOAD 3 ARETURN L1 ASTORE 4 ALOAD 3 ARETURN
Normally, due to the exception handler, each instruction
between L0 and L1 should be a distinct basic block, which would
give 6 basic blocks inside this region. In practice, thanks to
the above optimization, only the ASTORE
instructions change the current basic block, which gives 3
basic blocks (ALOAD 1 ASTORE 3
, ALOAD 2
ASTORE 3
and ALOAD 3 ARETURN
). Note that if
the instructions between L0 and L1 were considered as a single
basic block, the frame computed for L1 would be incorrect: it
would indeed be the same as the output frame of the previous
block, where local variable 3 is of type Float
(whereas the correct value is the common super type of
Integer
and Float
, i.e.
Number
).
Note: the edges of the control flow graph that
connect basic blocks to exception handler blocks are not
constructed in the visitTryCatchBlock
method but
in the computeAllFrames
method. Indeed, since the
visitTryCatchBlock
method must be called before
the labels passed as arguments to this method are visited, it
is not possible to know in this method which labels belong to
the range protected by this handler.
3.5.4 Dead code
The fix point algorithm used in the second step of the algorithm described in section 3.5.1 is limited to reachable code. Indeed, by definition, reachable code is code that can be reached from the initial basic block in the control flow graph, and the fix point algorithm is precisely looking at these blocks. As a consequence the algorithm does not compute the input frame of dead basic blocks.
Unfortunately the Java 6 split verifier requires a stack map
frame for every basic block, even unreachable ones. The
problem, as shown above, is that these frames are not computed
and, even worse, cannot be computed. Indeed an
unreachable basic block can contain illegal bytecode sequences,
such as ISTORE 1 ALOAD 1
(more precisely this
was possible with the JVM verifier prior to Java 6; but
this is no longer possible with the new verifier).
The consequence of all this is that dead code must either be removed or replaced with a valid bytecode sequence whose stack map frame can be easily computed. The first solution was judged too complicated. So the second solution has been chosen.
A simple solution is to replace dead code with
NOP
instructions. In this case any stack map frame
will be ok for these blocks. The only problem is that execution
can fall from the end of a dead code block to the end of the
method or to the start of a reachable block. So either the
stack map frame for the dead code block must be consistent with
the frame of the next block, or the last instruction of the
dead code block must not be replaced with a NOP
but with an instruction without successor, such as
RETURN
, GOTO
, or
ATHROW
.
The first solution is too complicated; the second one is
possible, but the fact that the dead code block can be reduced
to a single byte must be taken into account: there is not
always enough room to replace it with NOP
instructions followed by a GOTO
, for example.
xRETURN
can be used (this is a single byte
instruction), but this requires adjustments to the method's
return type. The ATHROW
instruction does not have
this problem, and is also a single byte instruction. It was
therefore chosen to end the dead code blocks.
Note that it is not necessary to insert instructions to
create something on stack before ATHROW
: the stack
map frame for this block can "declare" that, when the execution
of this basic block begins, there is already an exception on
stack (there will be no consistency problem with other frames
since this block is unreachable. Also there is no problem with
declared exceptions, because the verifier does not check that
ATHROW
instructions are consistent with declared
exceptions).
But declaring a stack map frame of the form [] [java/lang/Throwable] for dead basic blocks may cause a problem if this block is in the range of an exception handler: the stack map frame for this handler may be inconsistent, with a non empty list of local variable types (the stack part will always be consistent since it always has the form [exception type] for an exception handler). To solve this, we remove the dead basic blocks from the exception handler ranges (which may remove a handler, decrease its range, or split its range in two sub ranges).
In summary, the solution for dead code blocks is to replace
these blocks with NOP
... NOP
ATHROW
, to declare a [] [java/lang/Throwable]
stack map frame for these blocks, and to remove the dead code
blocks from the exception handlers.
3.5.5 Subroutines
The JSR
and RET
instructions, used
for subroutines, complicate the control flow and data flow
analysis algorithms. Hopefully they must not be used with Java
6 classes, so they do not have any impact on the algorithm used
to compute stack map frames. But they do complicate the
algorithm used to compute the maximum stack size, with the
COMPUTE_MAXS
option. More precisely they do not
impact the algorithm used to compute the maximum stack size
from the control flow graph, but they complicate the
construction of this graph. Like for exception handlers, the
edges in the control flow graph that correspond to subroutines
are computed in computeMaxStackAndLocal
.
The first step consists in finding, for each basic block, to which subroutine it "belongs". A basic block may be reached from several subroutines, but we say that it "belongs" to the "oldest" of these subroutines (with the convention that a subroutine calling another one is "older" than the callee). We compute the owner subroutine of each basic block as follows:
- visit all the basic blocks that are reachable from the
first one, without following a
JSR
orRET
instruction, and mark them as belonging to a main "subroutine". In the process, push all the targets of the visited JSR instructions into a queue Q. - while the queue Q is not empty, pop a JSR target from it,
visit all the basic blocks reachable from this JSR target
(without following a
JSR
orRET
instruction), and mark them as belonging to the same, new subroutine, if they are not already marked as belonging to a previous subroutine. In the process, push all the targets of the visited JSR instructions into Q.
The second step consists in finding, for each
RET
instruction, to which possible basic blocks it
can return to. The possible basic blocks are those that follow
JSR
instructions. We therefore examine each
JSR
instruction in turn. For each such
JSR
instruction I we look at the basic
blocks that belong to the called subroutine. When a
RET
instruction is found we add the block
following I as a successor of the RET
instruction (except if I and the RET
instruction belong to the same subroutine, which can happen if
a subroutine returns to its parent subroutine without a
RET
).
Let's take an example to illustrate this:
public m(Z)V L0 JSR L2 L1 RETURN L2 ASTORE 2 JSR L4 L3 GOTO L5 L4 ASTORE 3 ILOAD 1 IFEQ L5 RET 3 L5 RET 2
After all the instructions have been visited in
MethodWriter
, and just before
computeMaxStackAndLocal
is called, the control
flow graph is the following (L1 and L3 are not real successors
of L0 and L2: these edges are only used to keep track of
JSR
's return addresses, and are ignored during the
control flow graph analysis used to compute the maximum stack
size):
- L0 successors: L2, L1
- L1 successors: none
- L2 successors: L4, L3
- L3 successors: L5
- L4 successors: L5
- L5 successors: none
As explained above, the first step in
computeMaxStackAndLocal
consists in finding, for
each basic block, to which subroutine it belongs. The blocks
that are reachable from L0 without following a JSR
or a RET
are L0 and L1: they are marked as
belonging to subroutine #1, and the JSR target L2 is enqueued.
L2 is then removed from the queue, and the blocks that are
reachable in the same way from L2, namely L2, L3 and L5, are
marked as belonging to subroutine #2. In the process the JSR
target L4 is enqueued. Finally L4 is removed from the queue,
and the blocks that are reachable in the same way from L4, and
not already marked, are marked as belonging to subroutine #3.
Here this means only L4 is marked, because L5 was already
marked as belonging to subroutine #2 (this happened because the
nested subroutine #3 can "return" to its parent subroutine #2
without a RET
):
- L0 belongs to: subroutine #1
- L1 belongs to: subroutine #1
- L2 belongs to: subroutine #2
- L3 belongs to: subroutine #2
- L4 belongs to: subroutine #3
- L5 belongs to: subroutine #2
The second step consists in finding the successors of the
RET
instructions. As explained above, this is done
by examining each JSR
instruction in turn. The
first one, in L0, leads to the analysis of the basic blocks
that belong to subroutine #2. In these basic blocks we find
only one RET
instruction, in L5. These
JSR
and RET
instructions do not
belong to the same subroutine, so we add L1 (the next basic
block after the JSR
in L0) as a successor of L5.
The second JSR
instruction, in L2, leads to the
analysis of the basic blocks of subroutine #3. In these basic
blocks we find one RET
instruction, in L4. L2 and
L4 do not belong to the same subroutine, so we add L3 (the next
basic block after the JSR
in L2) as a successor of
L4.
The final control flow graph is the following:
- L0 successors: L2, L1
- L1 successors: none
- L2 successors: L4, L3
- L3 successors: L5
- L4 successors: L5, L3
- L5 successors: L1
Note: you may have noticed that the L4 "basic block"
is not a real basic block, because it contains several
instructions that can lead to other blocks. In fact it is the
union of two consecutive basic blocks. This is not an error:
with the COMPUTE_MAXS
option it is not always
necessary to decompose the control flow graph down to
individual basic blocks. Therefore, when possible, several
consecutive basic blocks are represented as a single block, for
optimization purposes.
4 Code optimizations
The main objective of ASM is to get the smallest and fastest code as possible, while keeping a quite "clean" public API. This section explains the techniques used in this project to achieve these objectives (many of them are not recommended for mainstream development).
4.1 Optimizing performance
The first and most important step to get good performances is to use the right API and good algorithms [R0]. Sometimes it is hard to decide which API or algorithm is best without actually implementing and testing all options. In this case, take the time to implement and test all options to find the best one. There are however some general rules that can be used when designing an API and when implementing it, in order to get good performances (a side effect of these rules is to reduce code size):
- [R1] the API must stay very close to the internal class
file structures, in order to avoid costly conversions during
parsing and writing. Many examples of this rule can be seen
in the existing API: internal class names, type descriptors
and signatures are passed as argument in visit methods in the
same form as they are stored in the class file. Stack map
frames are also visited as they as stored in the class file,
i.e. in a compressed form (although there is an option to
uncompress them). The drawback of this rule is when several
class adapters need a high level view of these encoded
structures (for example a
SignatureVisitor
based view of a signature string): indeed, in this case, the decoding and encoding steps will be executed in each adapter, while they could be executed only once inClassReader
andClassWriter
. - [R2] the implementation must not include any check or
verification, at any level (class file parsing, preconditions
of methods, validity of bytecode sequences, etc). These
verifications must be added in
Check
XxxAdapter
classes.
Once the best API and algorithms have been found, several "low level" techniques can be used to optimize performances:
- [R3] avoid string manipulation operations. These
operations generally have a high cost. Examples of this can
be seen in
ClassReader
: thecpInfoValues
array is used to avoid parsing and building strings several times for the same constant pool UTF8 item. Another example is in theLabel
class: the types used for stack map frames are encoded as int values instead of strings (they were initially stored as strings; the change to int values improved performances a lot). - [R4] avoid array copy operations. Several examples of
this principle can be seen in the existing implementation.
For example the
toByteArray
method first computes the size of the class, then allocates aByteVector
of this size, and finally writes the class content in this vector. This avoids many calls to theenlarge
method inByteVector
, and therefore many array copy operations. - [R5] avoid getter and setter methods, at least for non public or protected fields. Some JVMs do not inline these methods, in which case accessing fields directly is faster than using getter and setters. This also saves code. The core package does not use any such method. The tree package also exposes many fields directly.
- [R6] copy frequently accessed instance fields to local
variables to reduce field access operations (this is the
logical continuation of the previous technique). See for
example the
ByteVector
class: thelength
anddata
fields are copied into local variables in methods that access them several times. A variant of this rule is to replace fields with method parameters (this variant was used in thesignature
package - see section 4.3). - [R7] cache the result of complex functions in order to
execute them only once. This rule is used, for example, in
the
visitMethod
method inMethodWriter
: the result ofgetArgumentsAndReturnSizes
for a methodSymbol
is cached in a lazily computed field ofSymbol
. It is also used inClassReader
: thecpInfoValues
array is used as a cache to avoid parsing the same string constant pool item several times. - [R8] use
int
coded data structures in order to replace slow data structure access methods with fast arithmetic or bitwise operations (this also reduces code size). Several examples of this rule can be seen in the existing code: the boolean stack encoded as anint
inSignatureWriter
, the abstract types encoded asint
inLabel
, etc. - [R9] use
switch
statements that can be compiled intoTABLESWITCH
instead ofLOOKUPSWITCH
(the first opcode executes faster), where possible (this requires switch case constants that are not sparse).
These techniques must be used with care (because they can make code harder to understand and to maintain) and only when they bring a real performance benefit. In order to see if it is the case, the performances of every proposed optimization must be compared to the performances of the current code (if possible on several JVMs, in order to exclude singularities due to specific JVM optimization strategies).
Of course optimizations must be applied in priority to the code that is responsible for the most important part of the total execution time, in order to get the best performance improvements. In order to detect these "hotspots", a code profiler can be used. It is also possible to turn some code sections on or off, in order to measure, by difference to the normal case, the cost of this section.
4.2 Optimizing code size
The obvious way to reduce code size is to reduce the number
of classes and methods! This begins by reducing the number of
abstract classes in the public API (not because they are big,
but because it indirectly reduces the number of classes and
methods) [R10]. This reduction generally requires compromises
with what a "clean" object oriented API would be. Whether these
compromises are worthwhile or not must be decided on a case by
case basis. Several such compromises can be detected in the
current public ASM API. For example the
AnnotationVisitor
abstract class is used both for
annotations and annotation arrays although, in the second case,
the name
argument is useless and would not have
been used if an AnnotationArrayVisitor
abstract
class were defined. Another example is the
SignatureVisitor
abstract classs a single class is
used to represent a quite complex recursive data structure
that, in a "normal" object oriented design, would be
represented with up to 8 abstract classes.
The number of classes can be reduced in several ways. The
most "elegant" one is to use appropriate design patterns [R11].
For instance the ClassReader
class needs
Attribute
factories in order to create attribute
objects for the class, method and field attributes. But instead
of using the Factory design pattern, which requires two
classes, ASM uses the Prototype pattern, which uses only one
class.
The number of classes can also be reduced by merging several
related classes into a single one [R12]. For instance a single
Symbol
class is used to represent all the constant
pool item types, and the Label
class is used to
represent both bytecode offsets and basic blocks. The drawback
of this approach is that it increases memory requirements (an
integer Symbol
object, for example, uses 7 fields
while an IntegerSymbol
class would require only 2
fields) and object instantiation times (Label
and
Frame
classes were initially merged but have been
separated in order to improve performances).
The number of methods can be reduced by inlining methods that are called at only one place in the code (a method that is called from several places can also be inlined if the client code can be refactored to call it at only one place) [R13]. Another benefit of this technique is that it increases performances.
4.3 An example
This section illustrates the previous optimization
techniques on a real example, namely the
asm.signature
package. The goal of this package is
to provide an event based API, a parser and a writer for the
generics signature grammar, given below:
FormalTypeParameters? ClassTypeSignature ClassTypeSignature*
FormalTypeParameters:
< FormalTypeParameter+ >
FormalTypeParameter:
Identifier FieldTypeSignature? FieldTypeSignature*
FieldTypeSignature:
ClassTypeSignature | ArrayTypeSignature | TypeVariableSignature
ClassTypeSignature:
L Identifier ( / Identifier )* TypeArguments? ( . Identifier TypeArguments? )* ;
TypeArguments:
< TypeArgument+ >
TypeArgument:
* | ( ( + | - )? FieldTypeSignature )
ArrayTypeSignature:
[ TypeSignature
TypeVariableSignature:
T Identifier ;
TypeSignature:
Z | C | B | S | I | F | J | D | FieldTypeSignature
MethodTypeSignature:
FormalTypeParameters? ( TypeSignature* ) ( TypeSignature | V ) ( ^ClassTypeSignature | ^TypeVariableSignature )*
The first design of this package used 7 abstract classes, roughly corresponding to the grammar non terminals:
ClassSignatureVisitor
FormalTypeParameterVisitor
FieldTypeSignatureVisitor
ClassTypeSignatureVisitor
TypeArgumentsVisitor
TypeSignatureVisitor
MethodTypeSignatureVisitor
The parser was generated with JavaCC
and the
writer used 7 classes, one per abstract class.
The second step was to replace the generated parser with a hand written, recursive descent parser, containing 11 recursive methods (one per rule of the grammar). This led to major performance improvements, and to a huge code size reduction. This was due to the fact that hand written code can be more efficient than generated code, and also to the fact that all lexical and syntactic verifications were removed ([R2]).
In the third step a refactoring reduced the number of writer classes to 6 instead of 7 (by using inheritance).
In the fourth step the number of abstract classes was
reduced to 4 ([R10]), and the 6 writer classes were merged into
a single one ([R12]), using a boolean stack encoded as an
int
([R8]). The parser was mostly unchanged. The
reduced API allowed invalid signatures to be generated (it
corresponded to a generalized grammar where FieldType,
ClassType, ArrayType, TypeVariable and
Type are merged into a single Type non terminal),
but this was seen as an acceptable compromise.
In the sixth step the parser was optimized by replacing some fields that represented the last parsed character and its index with method arguments ([R6]). In addition some methods were inlined, which removed 6 parsing methods out of 11 ([R13]).
Finally, after some feedback from users, the API was reduced
to only one abstract class (although it allowed even more
invalid signatures to be generated, this was judged more
practical to define SignatureAdapter
s), and this
provided more opportunities to inline parser methods: the final
parser contains only 3 parsing methods.
The code size decreased from 22KB at the first step, to 6.8KB at the second step, 4.5KB at the fourth step, and less than 4KB at the last step. At the same time the average time to parse and rebuild a signature decreased from 74 micro seconds at the first step, to 15 micro seconds at the second step, 5.4 micro seconds at the fourth step, and less than 5 micro seconds at the last step.
As can be seen from this example (more details can be found on the ASM mailing list archives of January 2005), improving performances often leads to code size reductions, and vice versa!