Introduction
I have been spending time benchmarking aspects of our JVM profiling agent looking for possible areas of improvement. The most expensive call our agent makes is the collection of the Java thread call stack. This agent call is made when a JXInsight trace extension invokes Tracer.start(). On a typical developer workstation the cost can range from 5 microseconds to 50 microseconds depending on the processor speed, the JVM implementation, and the depth of the thread call stack.

To determine whether the JXInsight native agent could be improved I created a simple benchmark test to compare JXInsight's Tracer API with the standard Java API's provided in Java 2 (1.4.2) and Java 5 (1.5.0) that allow access to a thread's call stack. In Java 2 the stack trace can be obtained by calling new Exception().getStackTrace() and in Java 5 we now have Thread.currentThread().getStackTrace().

Below is the simple benchmark test driver I created that uses JXInsight's Tracer API to record high resolution clock times.




The Results
The test driver was run with 4 different JVM implementations on a Windows 2000 workstation with a single P4 1.6 Ghz CPU. The lowest average wall clock time per call at 6 microseconds was recorded with Tracer.getThreadCallStack(ARRAY) run with the BEA JRockit 1.4.2 JVM. The highest average wall clock time per call at 1487 microseconds was recorded with Thread.currentThread().getStackTrace() run with the BEA JRockit 1.5.0 JVM.


IMPORTANT UPDATE
Very soon after the publication of this Insight article the BEA Rockit Team contacted us and where kind enough to give JInspired access to an early release of JRockit Java 5 (1.5.0.04) scheduled for public release some time in Nov 2005. The new figures show a significant performance improvement. It is important to note that the Java 5 API call, Thread.currentThread().getStackTrace(), is not typical for any normal Java business application.





The following chart breaks down the JRockit 1.4.2 JVM wall clock times further into GC and CPU time. We can see from this chart that the average wall clock times for the test case new Exception().getStackTrace() is impacted significantly by excessive GC events occurring during its execution as a result of its repeated temporary object allocation (StackTraceElements). The GC clock time is approximately 60% of the execution wall clock time. The thread CPU time metric in the chart confirms the impact of the GC collection thread on the main test execution thread (a non-concurrent garbage collection implementation).





The Problem
The benchmark test is not only measuring the speed of the actual Java API but also the JVM's garbage collector. Indeed the new Exception().getStackTrace() test does create excessive temporary object allocations but I hope you would agree that the wall clock time measurement is invalidated by such a large proportion. Still not convinced? Well consider the wall clock time measurements reported today by other Java profiling application performance management tools. These measurements are recorded in JVMs with high levels of concurrent thread execution, each thread creating significant amounts of temporary object allocation during request processing that has a post execution performance impact on other transaction running or starting. Not recording GC measurements for every wall clock time measurement could easily have performance engineering and development teams chasing red herrings.

I am quite confident that most performance test engineers have at one time being baffled by a profiling tool reporting high wall clock times for a method with a small number of sources lines with limited outgoing method invocations. This is further compounded by the fact that most profiling tools require you to manually switch between wall clock times and CPU times (you cannot have both). The occurrence of a GC collection event is effectively outside the control of any code block measured and hence should always be recorded and discounted from wall clock times. Discounting GC does not mean it is of no concern to the performance engineer - excessive GC events are a system level performance issue that has the potential to impact all user transactions.


PERFORMANCE APPLICATION MANAGEMENT TIP
When chasing memory leakages and GC problems a good starting point is the transactions, components, and classes creating large object allocations. JXInsight can be dynamically configured to record object allocation byte sizes for tracing and transaction intervals. This has much less performance impact than other memory profiling solutions. Please read this previous Insight article about an approach to a special case of memory leakage (capacity) for production systems.


The Solution
JXInsight has an innovative application performance management approach in that it not only records wall clock times and CPU times (together) but also GC times, thread monitor blocking, thread monitor waiting and wall clock adjusted (wall clock time less the sum of GC and thread monitor times). The recording options can be configured at agent load time via -Xrunjdbinsight:g=t or via the JXInsight console menu or terminal commands.

The following bar chart displays the object allocation byte sizes for a run of the benchmark with the Java 5 (1.5.0) JVM's from Sun and BEA JRockit this time with the JXInsight agent configured to collect only object allocations byte sizes (-Xrunjdbinsight:g=f,a=t). From this chart is very clear why the JRockit Java 5 times are slow: excessive (temporary) object allocations.





William Louth, JXInsight's Product Architect