Introduction
I have been spending time benchmarking aspects
of our JVM profiling agent looking for possible
areas of improvement. The most expensive call
our agent makes is the collection of the Java
thread call stack. This agent call is made when
a JXInsight trace extension invokes
Tracer.start().
On a typical developer workstation the cost
can range from 5 microseconds to 50 microseconds
depending on the processor speed, the JVM implementation,
and the depth of the thread call stack.
To determine whether the JXInsight native agent
could be improved I created a simple benchmark
test to compare JXInsight's
Tracer API with the standard Java API's provided in
Java 2 (1.4.2) and Java 5 (1.5.0) that allow
access to a thread's call stack. In Java 2 the
stack trace can be obtained by calling
new
Exception().getStackTrace() and in Java
5 we now have
Thread.currentThread().getStackTrace().
Below is the simple benchmark test driver I
created that uses JXInsight's Tracer API to
record high resolution clock times.
The Results
The test driver was run with 4 different JVM
implementations on a Windows 2000 workstation
with a single P4 1.6 Ghz CPU. The lowest average
wall clock time per call at
6
microseconds was recorded with
Tracer.getThreadCallStack(ARRAY) run
with the BEA JRockit 1.4.2 JVM. The highest
average wall clock time per call at
1487
microseconds was recorded with
Thread.currentThread().getStackTrace() run with the BEA JRockit 1.5.0 JVM.
|
IMPORTANT
UPDATE
Very
soon after the publication of this Insight article the BEA
Rockit Team contacted us and
where kind enough to give JInspired access to an early release of JRockit
Java 5 (1.5.0.04) scheduled
for public release some time in Nov
2005. The
new figures show a significant performance
improvement. It is important to note
that the Java 5 API call, Thread.currentThread().getStackTrace(),
is not typical for any normal Java business
application.
|

The following chart breaks down the JRockit
1.4.2 JVM wall clock times further into GC and
CPU time. We can see from this chart that the
average wall clock times for the test case
new
Exception().getStackTrace() is impacted
significantly by excessive GC events occurring
during its execution as a result of its repeated
temporary object allocation (
StackTraceElements).
The GC clock time is approximately
60% of the execution wall clock time. The thread
CPU time metric in the chart confirms the impact
of the GC collection thread on the main test
execution thread (
a non-concurrent garbage
collection implementation).
The Problem
The benchmark test is not only measuring
the speed of the actual Java API but also the
JVM's garbage collector. Indeed the
new
Exception().getStackTrace() test does create
excessive temporary object allocations but I
hope you would agree that the wall clock time
measurement is invalidated by such a large proportion.
Still not convinced? Well consider
the wall clock time measurements reported today
by other Java profiling application performance
management tools. These measurements are recorded
in JVMs with high levels of concurrent thread
execution, each thread creating significant
amounts of temporary object allocation during
request processing that has a
post execution performance impact on
other transaction running or starting. Not recording
GC measurements for every wall clock time measurement
could easily have performance engineering and
development teams chasing red herrings.
I am quite confident that most performance test
engineers have at one time being baffled by
a profiling tool reporting high wall clock times
for a method with a small number of sources
lines with limited outgoing method invocations.
This is further compounded by the fact that
most profiling tools require you to manually
switch between wall clock times and CPU times
(
you cannot have both). The occurrence
of a GC collection event is effectively outside
the control of any code block measured and hence
should always be recorded and discounted from
wall clock times. Discounting GC does not mean
it is of no concern to the performance engineer
- excessive GC events are a system level performance
issue that has the potential to impact all user
transactions.
PERFORMANCE
APPLICATION MANAGEMENT TIP
When
chasing memory leakages and GC problems
a good starting point is the transactions,
components, and classes creating large
object allocations. JXInsight can be
dynamically configured to record object
allocation byte sizes for tracing and
transaction intervals. This has much
less performance impact than other memory
profiling solutions. Please read this
previous Insight article about an
approach to a special case of memory
leakage (capacity) for production
systems.
|
The Solution
JXInsight has an innovative application performance
management approach in that it not only records
wall
clock times and
CPU times (
together) but also
GC times, thread monitor
blocking,
thread monitor
waiting and
wall clock
adjusted (
wall clock time
less the sum of GC and thread monitor times).
The recording options can be configured at agent
load time via -Xrunjdbinsight:g=t or via the
JXInsight console menu or terminal commands.
The following bar chart displays the object
allocation byte sizes for a run of the benchmark
with the Java 5 (
1.5.0) JVM's from
Sun and BEA JRockit this time with the JXInsight
agent configured to collect only object allocations
byte sizes (
-Xrunjdbinsight:g=f,a=t).
From this chart is very clear why the JRockit
Java 5 times are slow: excessive (
temporary)
object allocations.
William
Louth, JXInsight's Product Architect