Skip to content

A Performance Measurement of a “canned” Java Persistence (JPA) Benchmark

An article that walks through the typical challenges software performance engineers face in extracting quality information from a software benchmark that delivers insight into the internal behavior of the software, balancing  (collection) coverage and (overhead) cost, accuracy (closeness) and precision (reliability).

There has been a string of blog postings on JPA service provider benchmarking following the announcement of a new vendor, Batoo, and claims of significant code performance efficiency improvements over Hibernate JPA, which has resulted in a rebuttable from the RedHat team here. The latest posting from the EclipseLink team, which inspired this article, claims “15x faster” for an all encompassing vendor benchmark, though I am not sure whether the speed-up reported is after factoring out the database overhead which one could assume to be consistent across vendors for a “canned” benchmark. So I decided to download James Sutherland (Oracle) “canned” benchmark and take it for a much longer spin to see whether I could identify the hotspots in the JPA code which may or not be further tuned (considering Oracle has already invested heavily in tuning the software).

Before I walk through the measurements collected over the course of a day, running various elongated (20 minute) benchmark runs, allow me to list some basic principles and rules of thumb that guide the performance analysis of any software system, which involves instrumentation, measurement and collection of software execution behavior and resource consumption.

  • There are many performance models, some are more useful than others, but comparatively each contributes to some degree of understanding.
  • A measurement system or model can be accurate but not precise, precise but not accurate, neither, or both (a valid system).
  • Measurements of (certain) systems cannot be made without affecting the systems. Awareness and minimization of this effect is critical.
  • If we increase the precision in measuring one observation (event, quantity), we are forced to lose precision in measuring another.
  • Only change one measurement variable at a time and such change should reflect what needs to be asked and answered.
  • There will always be variation, and not all variation is undesirable or inappropriate. Minimizing unwanted variability is paramount.
  • Measure what you can change (or manage). Your understanding of the whole must start from here.
  • Look to validate your measurements using different measurement approaches and techniques. And always rinse and repeat.
  • The location at which a performance bottleneck is removed need not necessarily be the same location at which it occurs.

For the performance analysis I used the latest JXInsight/OpenCore 6.4.EA.5 release with the following jxinsight.aspectj.filters.config file across all runs of the benchmark, which I changed to execute much longer.


I then changed the build.xml file used to execute the benchmark, passing our -agentpath and -javaagent arguments to the benchmark JVM.

-agentpath:/OpenCore/bin/osx-64/libjxinsight.jnilib=prod -javaagent:/OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar

Below is the first probes metering model I collected following completion of a 20 minute benchmark run. Initial impression? The default hotspot threshold of 50 microseconds is too high for this particular benchmark. Only a few probes are still enabled and labeled as hotspots (orange circle) at the end of the benchmark run. That said there are some probes which can already be looked at because of high frequency and high inherent totals, notably DatabaseAccessor.basicExecuteCall, ReadAllQuery.executeObjectLevelReadQuery and DatabaseAccessor.executeSelect. The last one I suspect calls into the JDBC library.

Note: Because the org.apache.derby package was not instrumented the inherent (self) totals and averages are overstated. But remember we only measure what we can change.

To lower the hotspot threshold down I created a jxinsight.override.config file with the following content.


In dropping the threshold down from 50 to 25 microseconds, I have prevented the disabling of the DatabaseAccessor.executeSelect probe which in turn has altered the inherent total ordering in the table. Without looking at the code I can speculate that DatabaseAccessor.basicExecuteCall calls into DatabaseAccessor.executeSelect. Still not enough hotspots for my likening…can we go lower without perturbing the system too much?

This time I decided to change the inherent threshold used by the default enabled hotspot metering extension – from 5 down to 1 microseconds.


With the drop in the hotspot inherent threshold the number of probes labeled hotspot has grown, with a noticeable change in the average inherent column listing. Both probes ending with executeObjectLevelReadQuery have lost 3 microseconds in their inherent average. This inherent metering has been redistributed to the probes not disabled this time around by the adaptive hotspot metering extension. There has been a relatively small drop in the throughput but not significant enough to stop us lowering the threshold further.

For the next benchmark run I dropped the threshold down to 15 microseconds.


Many more new hotspot labeled probes, in particular ObjectBuilder.buildObject. Both executeObjectLevelReadQuery ending probes have lost a significant portion of their inherent total with both of their inherent averages less than 1. There has been a markable drop in the throughput, which we will come back to later.

Dropping the threshold to 10 adds a new hotspot labeled probe to the metering model, DatabaseAccessor.processResultSet, which has an inherent average of 9 – the same amount that the basicExecuteCall probe has lost in its inherent average. The throughput decline seems to have leveled off.

Dropping the threshold down to 5 introduces two new hotspot labeled probes – DatabaseAccessor.prepareStatement and ObjectBuilder.buildAttributesIntoObject. But throughput has degraded further and both average and inherent average show increases across the board, indicating we have reached a (lower) limit.

Up to now, every time a benchmark was run the hotspot metering extension had to learn all over again, disabling much of the instrumentation applied to the complete org.eclipse.persistence code base. We can eliminate the impact this learning has on the overhead and accuracy of the metering model by generating (offline) a new agent bundle from a metering model.

java -jar opencore-console.jar -generate-aj-bundle hotspot-eclipselink-t5-ti1-i10000-probes.ocs /OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar

This command  creates a new opencore-ext-aj-javaagent.jar which only instruments those methods with a hotspot label. All that is needed is to change the build.xml file to point to this new agent bundle. Below is the metering model exported following a re-run of our last config. The throughput has been restored and the averages and inherent averages reduced.

We can repeat this instrument-measure-refine-repeat (IMRR) process again until there is no significant change in the metering model.

java -jar opencore-console.jar -generate-aj-bundle hotspot-eclipselink-t5-ti1-i10000-imrr-1-probes.ocs /OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar

We can assess the accuracy and reliability (repeatability) of our measurements by taking the last bundle generated, disabling the hotspot metering extension and enabling the spotcheck metering extension. The spotcheck metering extension, which enhances all probes fired by instrumented methods, randomly elects to meter a firing probe and then temporarily disables the metering of all direct or indirect called methods (enclosed firing probes). With spotcheck you can never have the metering of a probe enclosed within the metering of another probe (caller method), which means the total and inherent totals will always be equal.


I’ve filtered the table view listing the results of the random sampling by “executeCall“. Comparably they are very much the same results, especially considering this is an average of a random sample and not comprehensive statistical distribution, which we could have collected using the quantization metering extension.

The approach taken up to now has been to identify the most appropriate location in the code from which to start investigations into possible code tunings. As stated above this location might not necessarily be the location of the bottleneck itself. It it very common to find that the closer you get to a performance bottleneck in measuring it the greater the pushback or diversion you incur.

To help identify possible high frequency bottlenecks that can’t be so easily timed we can use the swatch metering extension which offers an alternative approach to measurement based largely on sampling of a single memory slot called a swatch which each probe tries to enter and then stay inside until finished executing, whilst still allowing child (nested) probes to temporarily take ownership of the swatch. In doing so we can validate our previous findings.


Here is the resulting metrics model, filtered on swatch.samples, following completion of a benchmark run in which the swatch metering extension was enabled. Many of the top listed metrics, ${probe}.swatch.samples, look very familiar to our hotspots above, though swatch tends to highlight the bottleneck points rather than the possible code change points.

The swatch metering extension is also adaptive. It can be controlled by the threshold property. Here is the metrics model with threshold set to 10, which is not microseconds but the number of swatch sample fast spins.

As the threshold is increased, probes are dynamically disabled, resulting in other probes having less contention on the swatch and so more samples. Here is threshold=100.

The final metrics model with a threshold set to 1000. Notice the visual pattern that has emerged from the delta metric samples as we increased the threshold and disabled more probes.

Further Reading

Increasing the Ability of an Experiment to Measure an Effect

Appendix A – Code Listings