Skip to content

Performance Measurement Truths – Rarely Pure and Never Simple

In Reality, Reactivity, Relevance and Repeatability in Java Application Profiling we looked at the multiple realities proposed by different performance profiling solutions and performance measurement techniques. Hopefully in that article you would have seen how difficult it can be for a novice to discern which of the measurement models most closely resembles the truth though the degree of measurement overhead is a good starting point. Naturally we think that some approaches to measurement, such as ones based intelligent adaptive technologies, are far better but even then we still must be careful in how we execute the benchmark and how we configure the adaptive measurement as these can give us multiple realities even with the very same measurement solution. In this article we look at the impact of different adaptive measurement thresholds as well as the impact that slow or fast ramp up of concurrent workload has on a benchmark and its performance model.

The Apache Cassandra project just this week announced their latest release, version 1.2.5, which we have opted to use for the benchmark so that our findings can be put to good use by the NoSQL community behind this project. Unlike the previous benchmark, the client stress load driver will be run on a separate machine and the client insert thread count increased from 1 up to 20.

Here is the first script used.

for i in {1..20}
./tools/bin/cassandra-stress -d $DSC_HOST -o insert -t $i -n 1000000

By default the JXInsight/OpenCore adaptive measurement (metering) engine uses a hotspot extension based on a concept of scorecard which determines whether the engine continues to measure a particular probe (method) by way of credit balance that is adjusted based on timing threshold comparisons. The default threshold values are set with the typical web/enterprise application in mind so the first thing that needs to be done is to drop down the thresholds (in microseconds) from 100 and 5 to 10 and 1 respectively.


Just like in Reality, Reactivity, Relevance and Repeatability in Java Application Profiling, we only instrumented the following packages, though in the results presented below we have filtered the view to only include those probes in the org.apache.cassandra. The reason for this is that inclusion of additional packages in the instrumentation provides us with more accurate inherent (self) totals but when it comes to what can be changed the focus should be on the projects own code base.


Below is a relatively short listing of the hotspots found with the measurement thresholds set to 10 (inclusive) and 1 (exclusive). From this listing we get an initial outline of the main execution points in the processing of a request from the client stress tool. Even without a call tree, but instead looking at the Count, Total and Total (Inherent) column values, it is easy to form an educated guess that RowMutation.apply (directly or indirectly) calls Table.apply which in turns calls ColumnFamilyStore.apply. Don’t worry if this is not immediately obvious because it is not of great importance in the rest of this article though understanding the difference between inclusive and exclusive in terms of measurement time in a caller-to-callee chain would help when we see how the inherent total (and average) shifts as thresholds are changed.


What if we reduce the threshold? What impact will this have on the reality proposed by the measurement engine? First of all we must be careful not to introduce significant measurement overhead that radically alters the execution dynamics of the software, especially in a highly concurrent environment. There is no magic (though self-adaptation can at times appear so) that eliminates measurement overhead entirely, that only every happens in sales & marketing literature of vendors who think that everyone runs PetStore style apps in production or have never tuned a SQL statement before using standard database tooling. More importantly, at least in the context of this article, is that any changes in thresholds can result in some probes, previously dynamically disabled remaining measured until the very end of the benchmark and in doing so alter (reduce) the inherent total and inherent averages of their immediate callers who previously assumed such timings in their own totals. This alteration (reduction) of the caller inherent total means that some probes will drop down in sorted table listing and in some cases cause the callers to be disabled though setting the inherent threshold to 1 mitigates this somewhat.

Below is a revised listing with the property, j.s.p.hotspot.threshold, set to 5 microseconds. There are a number of new probes appearing still enabled and at the top of the metering table following completion of the benchmark, which performed 20 iterations of the scripted loop shown above. The most notable new probes include CassandraServer.createMutationList, PrecompactedRow.merge, SSTableWriter.append and CommitLogSegment.write. You will notice how StorageProxy.mutate and CassandraServer.batch_mutate have dropped down the table listing because of measurement occurring far longer at lower levels in the caller-to-callee chain.

Note: The number of probes measured with the hotspot threshold set to 10 microseconds was 470 million. With the threshold set to 5 this number increased to 782 million.


Dropping the threshold down to 2 microseconds, which is what we currently regard as the borderline before needing to switch to something entirely different such as Signals, offers up more potential hotspots such as AtomicSortedColumn.addAllWithSizeDelta, MergeIterator$ManyToOne.computeNextOnDiskAtom$Serializer.deserializeFromSSTable as well as LatencyMetrics.addMicro. The last one is interesting in that it is actual instrumentation that has been built directly into Apache Cassandra since version 1.2 that delegates call execution to an open source Metrics libraries covered our article series titled “How (not) to design a Metrics API”.

Note: With the threshold set to 2 the number of probes measured (metered) increased from to 1,320 million.

Comparing this model with the previous model we not only have additional probes listed, we also have a number of probes that have moved down significantly in the table ranking such as PrecompactedRow.merge and CommitLogSegment.write, which previously had a near equivalent total and inherent total column value.


All of the above models reflect the truth to some degree. Many of the items listed above do appear in all three hotspot probe sets. What we have traded in setting different thresholds is coverage, cost (accuracy) and comprehension. This changes are good starting points for performance tuning in the caller-to-callee chain

Measurement will always contaminate (or distort) the underlying software performance execution to some degree. This also applies to instrumentation that is already present within the application code base (a reason why some less informed individuals falsely claim that DTrace has little or no overhead). The skill is in managing this overhead via runtime intelligence and measurement control set points. The art is in knowing which information quality attributes to trade and at what point in the performance investigation and ongoing monitoring of application and its services. It is natural (though not always productive) to want to collect more information in any performance assessment, but we must be careful in trying to get close to such hotspots to not get pulled into some reality distortion field. It can be far more efficient, both online and offline, to just simply identify suitable and safe landing points within the code base from which to explore the remaining and deeper terrain especially within a static plane such as an code editor. Again success depends on getting close enough, but not too close.


Up to now we’ve looked at how performance models can be impacted in the setting of different measurement control set points (thresholds) within the adaptive measurement engine. What about the nature of the benchmark load itself? Does the slow incrementally ramp up of load from 1 thread to 20 threads, with each scripted loop iteration, create a different performance model than when instead we start at, and remain at, 20 threads throughout the benchmark?

for i in {1..20}
./tools/bin/cassandra-stress -d $DSC_HOST -o insert -t 20 -n 1000000

Below is the performance model derived following execution of our altered benchmark script with the threshold and inherent threshold set points reverted back to 10 and 1 respectively. This is much more different than the very first hotspot probe listing presented above that used the same threshold set points. In fact it has some of the additional hotspots uncovered when the threshold was lowered from 10 to 5 and then 2 microseconds.


Why would the performance model be so different depending on whether we slowly ramp-up concurrency or go directly to the highest level defined for the test? There are a number of possible causes but the primary one being the adaptive measurement engine itself. With the slow ramp up of load many of the probes listed below as hotspots would have been measured under optimal (if not unrealistic) workload conditions with no contention for resources such as the processor, object monitors and the memory system. The measurement timings would have been understated and the adaptive measurement engine would have disabled many of the probes, which in time and under more realistic production workloads represent significant contributors to the response timings and throughput. With high load very early on in the benchmark we quickly tax the memory and processor systems and thus those probes with high allocation or memory/cpu costs rise to the top much quicker.

Lowering the threshold from 10 to 5 reveals a newer reality to our performance model.


Even with the most advanced measurement solution in the world, performance tuning is at best a complex challenge that cannot be easily simplified. Unless you are the one who created the project (and the problem) it can be extremely difficult reasoning about the software execution behavior of distributed systems when you know that it is simply impractical if not impossible to obtain the exact information you need both to identify and resolve the problem (which can be two very different tasks)….each and every time.

Whilst we strongly believe that technologies such as Signals, QoS for Apps and Adaptive Control can actively manage the performance and improve resilience of distributed systems with very little human intervention we don’t believe the same applies to the performance analysis of such systems. And yes this runs counter to nearly everything else that is promoted by other vendors in the application performance management space, but then again it is not like these vendors are trying to sell you the truth.

Commentary: Since the publication of our benchmark challenge not a single application performance monitoring vendor has come forward with any sort of attempt at beating or matching our performance analysis results and low overhead.

Commentary: It is because of these multiple sensed realities (reality is itself a computation performed within our own minds) that we hold little regard for other application performance monitoring solutions that claim to “put out fires in production within 15 minutes”. Unless you are dealing with extremely low hang fruit, day in and day out (which begs another far more serious and troubling question), such statements represent sheer utter marketing nonsense and completely discredit the work of professionals and scientists in the performance monitoring and management field that are actively (re)searching for new ways to solve the hard task of monitoring and management ever increasing complexity in production systems especially in the cloud.

Appendix A – (Some) Hotspot Code Listings