Skip to content

Reality, Reactivity, Relevance and Repeatability in Java Application Profiling

“Instrument error refers to the combined accuracy and precision of a measuring instrument, or the difference between the actual value and the value indicated by the instrument (error)…a way to deal with instrument error may be to reduce the reactivity of the system to being measured…reactivity is a phenomenon that occurs when individuals (systems) alter their performance or behavior due to the awareness that they are being observed….accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity’s actual (true) value…precision of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.” – Wikipedia

This article was written in response to a message thread on the LinkedIn “Java Performance Professional” group in which the effectiveness of instrumentation based Java Profilers solutions was called into question in profiling well tuned systems including high frequency low latency message processing platforms. Here is a comment from a performance analyst at a very big and blue IT management software and services company

“On a sufficiently large java application, many profilers either have prohibitively large overhead; some simply cannot handle the volume of data.”

Every time a performance optimization is proposed and temporarily applied to our codebase we go through a series of evaluation tests and questions that need to be answered before deciding whether the change stays and gets published. First we run a number of micro-benckmarks to see whether a reduction in our measurement overhead can indeed be observed and whether this reduction is present in both single threaded and multi-threaded benchmark runs. We also look to see whether the optimization is fair to all measurement points and flows (paths). We need to understand whether it introduces any hotspot bias especially as our metering engine is highly adaptive and sensitive to measurement changes that impact the degree of “inherent” (self) metering assignment to probes (methods); it is this very aspect which makes our solution effective, efficient and exact. We then move onto more realistic and slightly less repeatable benchmarks in which we try to determine how much of an improvement does the optimization achieve in reality. Part of this effort involves comparing performance models (snapshots) taken with previous versions of the metering engine software with the new, and hopefully more optimized, version. We use various internal tools, built on top of our Probes Snapshot Open API, to access and compare the data but sometimes simply opening up both snapshots and looking at visualization can already instill some degree of confidence; accepting that most realistic benchmarks are never entirely reproducible or highly adaptive systems predictable. Below is a screenshot comparison of two snapshots created by different releases of our metering runtime. A great deal of similarity in the visualizations.


Of course this type of comparison does not mean the measurement solution is anymore accurate. It just means that the measurement error, which can be systematic or random, is consistent (or stable). So occasionally it is good practice to compare your measurement results with results generated with other performance measurement solutions and techniques. Whilst this sounds simple, in practice the reality is very different; in fact you face the prospect of multiple realities and then deciding which one is the more accurate (or less wrong) version. Before proceeding any further ask yourself when was the last time you used more than one performance profiling solution to determine the validity of a profilers results. If you have been fortunate to always get called in to identify “low hanging fruit” then this thought might not have ever crossed your mind. But if you have been tasked with measuring and tuning an existing well optimized software you might have opened your eyes to the fact that all measurements invariably have some degree of error (inaccuracy and imprecision); some more than others.

Note: In our Scala compiler case study this aspect of performance measurement accuracy (and relevance) was tackled through the creation of multiple performance models each one with different strengths and weaknesses.

In the rest of the article we compare the results generated by our adaptive metering engine with two other popular but largely non-adaptive Java profiling solutions. But first lets present the steps needed to run the benchmark that was used to generate the results presented.

1. Download the DataStax Community Edition of Apache Cassandra. We used version 1.1.6 though 1.2 is now available.

2. Create a file named benchmark.cli in the top directory of the Cassandra install with the following contents.

drop keyspace Keyspace1;

3. Create a file named in the same directory as the benchmark.cli file with the following script code.

for i in {1..20}
  ./bin/cassandra-cli -f benchmark.cli
  sleep 5
  ./tools/bin/cassandra-stress -o insert -t 1 -n 500000

4. Open up a terminal window and launch the Apache Cassandra server with ./bin/cassandra

5. Open up a new terminal window and start the benchmark with ./ You should see the following printed out in 20 repeated cycles.


6. To run the Apache Cassandra server with a profiling agent you can restart the server with the JVM_OPTS environment set. Here is how this is done with JXInsight/OpenCore.

export JVM_OPTS="-agentpath:/OpenCore/bin/osx-32/libjxinsight.jnilib=prod -javaagent:/OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar -server"

We performed 4 benchmark runs. The first without any profiling agent installed, which we have labeled as the baseline. We used the latest version of JXInsight/OpenCore as well as the latest releases for JP and NBP. For each profiler we limited method level instrumentation to the following packages:


For NBP we unchecked the option that limited thread profiling to 32 threads as the Apache Cassandra server creates a large number of threads in processing the requests generated by the stress tool.

For JXInsight/OpenCore we created a jxinsight.override.config file in the same directory as the script file with the following contents. Note we measure and report in microseconds.

# changed default hotspot threshold from 50 to 10
# changed default hotspot inherent threshold from 5 to 1

All profilers used instrumentation based measurement of the clock time with as many extraneous options turned off. After trying to run the benchmark with JP or NBP we had to increase the sleep time in the script to 20 and 30 seconds because the Apache Cassandra server would generate an error if it had not finished the drop keyspace command before the stress tool was run. We also had to change the for loop for both JP and NBP dropping down the count from 20 to 4 because of the long duration of each benchmark cycle.

From the results outputted by the stress tool we took a sample of the top N throughput results in each stress cycle. Here is a statistical summary.


So how exactly did JXInsight/OpenCore outperform the other solutions by such a huge margin? Well just like the JVM hotspot compiler our metering engine is adaptive and uses the measurement data to decide whether to continue measuring a particular probe. It also moves the instrumentation and measurement up to the most appropriate point in the call execution to measure the software execution from; it does this very intelligently, ensuring what continues to be metered is most relevant. Here is a graphic that can help understand this process. If you want more background information please check out this article.


In the remaining part of this article we will simply present the different realities offered up by each Java profiling solution, for now ignoring the fact some of these tools have changed the throughput of the system so drastically that it could be viewed as an entirely different system. Which “reality” do you think more accurately reflects the places in the code that represent the execution hotspots? Which “reality” do you think more accurately reflects the places in the code that represent the locations from which performance optimizations should be driven? These two need not necessarily be the same. Well if you think you have the right answer to both of these questions then please create a posting on your own blog and link to this article. We will track the links and if the analysis seems reasonable and proven we will link back.

Below is the metering table generated after the benchmark run with JXInsight/OpenCore, listing “hotspot” probes still being metered following completion. The “Total” column is the cumulative clock time total for a method including the time spent executing other methods some of which may and may not be measured. The “Total (I)” is the cumulative clock total for a method excluding the time spent executing other instrumented methods that are still being metered. If a method is adaptively disabled for measurement purposes then the subsequent inherent clock time readings will be assigned to its immediate callers, unless they too are also dynamically disabled. The package names have been abbreviated in the view.


Here are the hotspots identified by the JP profiling solution. You will notice that the invocation for SimpleCondition.await method is only 2 million compared to 10 million with JXInsight. That is because we only looped for 4 iterations because of the significant slowdown in execution of the benchmark. Without the cumulative clock time we found this data result incredibly hard to understand and relate immediately to our results and the execution behavior of the system.

Note: That some of the methods listed with high inherent times don’t actually execute as part of the service request processing.


Here is the hotspots identified by the NBP profiling solution. Even though both JP and NBP are non-adaptive and performed relatively similar on initial inspection the results look to be different. As was the case for JP we had to change the for loop to only iterate for 4 cycles so you will see 2 million instead of 10 million listed as a frequency against db model related methods. An interesting observation is that nowhere is there a hint to what was actually being performed at the Apache Cassandra model level. The NBP results seem to flag IO as the bottleneck, which is not a revelation, but we miss important insights into which callers are the most relevant in driving this bottleneck behavior – the appropriate call points where we should look to eliminate IO.


We are not going to go over the many reasons why call stack sampling is not terribly effective in understanding and managing the performance of systems in production. We have covered this at great length on this site. But as an exercise we did run the benchmark with call stack sampling. Below the throughput results are compared with the baseline and our instrumentation based solution which collects far more useful data related to frequency and latency.


Here are the hotpot methods identified by the JP call stack sampling approach. Which of the above realities above do you think most closely resembles this version, ignoring those method which we know from above don’t get executed every time an insert is performed?


In our final test we employed our IMRR technique to generate a revised instrumentation set for our agent. Here is a comparison of the throughput results with the baseline. We managed to get under that magical 5% overhead even for a well tuned system with very low latency and high frequency.


Here are the metering results for the revised agent. You can see a slight reduction in some of the totals and averages. The SimpleCondition.await method, whose execution time is a good indicator of system performance, has dropped down from 33 microseconds to 29 microseconds.


Again we would welcome any independent external performance analysis of the results above that would help shed light on which reality is the closet one to the actual in which the software is not observed and measured. We have a very good idea which one that but if you wish you can always check out the sources and then look at the hotspots identified by each of the solutions presented here.

2.6 GHz Intel Core i7
8 GB 1600 MHz DDR3
500 GB Flash Storage
Operating System
Apple OS X 10.8.3
Java Runtime
Java HotSpot(TM) 64-Bit Server VM; 1.6.0_43; 20.14-b01-447; mixed mode

Update – 16th April 2013
We have done two new benchmarks with YK with an “adaptive” flag set to true and false. Here is the statistical summary chart.


Here is the YK hotspot table for the non-adaptive benchmark run. Like the JP and NBP tests runs only 4 cycles instead of 20 were performed because of the significant time duration.


Here are the hotspots for adaptive YK benchmark run. Again only 4 cycles because of slowdown. What is interesting is that the adaptive option disabled the SequentialWriter.flushData() method which was classified by both OpenCore and JP as an actual performance hotspot.