Skip to content

Case Study: Scala Compiler – Part 5

This is the fifth part in a series of articles on the “how to of software performance analysis” using JXInsight/OpenCore. Please read the first, second, third and fourth parts as well as the preamble before proceeding if you have not already done so.

In this part of our series investigating the performance of the Scala compiler in compiling its own source files we switch attention to eliminating as much noise as possible in the metering model and in the course reducing measurement error which is the difference between a measured value of quantity and its true value.

How do you know when you have a significant amount of measurement error? In some cases it is pretty easy. For example running the performance test referenced throughout this series takes on overage 110 seconds. Running the same performance test with Oracle’s HPROF, with cpu=times, took 1,041 minutes (17 hours 21 minutes). The lowest performance test run with JXInsight/OpenCore and its hotspot strategy took 121 seconds and a large part of that was the actual startup and shutdown costs incurred during the dynamic instrumentation of the class bytecode and the exporting of the metering model as a snapshot on exit. Overhead wise we have Oracle’s HPROF with 1,039 minutes compared with JXInsight/OpenCore’s <11 seconds. Its for this reason why many are forced to use cpu=samples even though in such cases the resulting data is pretty much useless and riddled with false leads. You get the data faster but the wrong data which admittedly beats waiting a day to get the wrong but different data.

Pretty much every performance profiling solution we have tested has the same problem because fundamentally they are dumb measurement machines not taking into account cost, relevance and value in instrumenting, measuring and collecting – no self awareness or adaption(s). Without some sort of intelligence the measurement solution very quickly becomes part of the performance problem itself and worse exacerbating the problem so significantly that the only finding at the end of the test is the performance measurement solution has a performance problem itself.

Part of the problem here is that developers mistakenly believe that a performance measurement solution should direct them to the actual line of code that is the problem when generally the best starting point for both observation and tuning is just above the problem in the caller-callee chain providing an appropriate enclosing context to the execution. Below is a graphic we use with new customers showing them the target points we are aiming for in our metering of their applications in development, test and production.

Sampling solutions tend to focus on call points in the ambiguous and ineffective observation space, primitive event/trace instrumentation based solutions aim for the high frequency bottleneck point but get deflected elsewhere completely distorting all the data, whilst strategy based measurement solutions like our intelligent activity metering technology focus on finding safe and optimal observation and tuning points respectively.

JXInsight/OpenCore makes it incredibly easy to reduce noise and eliminate (more like manage) as much as possible measurement error via extensions that dynamically decide at the point of firing to meter or not the (source) probe, and over the iterations of the IMRR loop eliminating the instrumentation of the probe altogether. In the rest of this article we will look at the metering model derived from using three such extensions that have been selected with the previously observed deep call stacks in mind: tracedepth, nonrecursive and tracecycle. We are not going to use them in isolation as the hotspot strategy will remain active (which is the the default setting). These extensions will simply further refine the metering model during the course of the test execution.

Here is the configuration required to install the tracedepth with the maximum depth of the probe metering stack from caller to callee set to 50.

[jxinsight.override.config]
jxinsight.server.probes.tracedepth.enabled=true
jxinsight.server.probes.tracedepth.upper=50
jxinsight.server.probes.strategy.hotspot.threshold=25
jxinsight.server.probes.strategy.hotspot.warmup=100
jxinsight.server.probes.strategy.hotspot.statistic=avg
jxinsight.server.probes.console.enabled=true
jxinsight.server.probes.console.dump.file.name=probes.ocs

Below is the metering model exported on completion of the very last iteration of the IMRR script. Compared with the metering model in part 4 there is an overall metering count reduction of more than 5 million though the metering totals are relatively similar except for scala.tools.nsc.symtab which has lost more than 50% of its previous total indicating code in this package are executed deeper in the stacked execution. The overall metering total has dropped by 8 seconds.

The top probe listing ordered by inherent total is also very similar to what was reported in part 4. We got roughly the same results for a cheaper price (in terms of measurement cost) – always a good thing if the balance is kept right between cost, relevance and value. That does not mean we should disregard the differences that do exist between the models as they reveal the stack execution nature of the probes in question.

Metering extensions such as tracedepth are extremely useful because they allow the software performance engineer practically near complete control over the measurement cost and with the minimal of surface configuration. An engineer can simply change some setting to suit the situation and environment in terms of coverage, collection and acceptable cost.

Here is a metering model collected with the above configuration but with the tracedepth upper set to 100. The overall metering count and totals have risen as expected. The scala.tools.nsc.symtab package has returned close to its previous metering totals but not altogether hinting at deeper call depths in the normal execution of the software. The scala.tools.nsc.matching package has now reappeared after disappearing in the upper=50 performance test run. Whats also worth noting is that the scala.tools.typechecker package has slightly increased its metering total but its metering count has grown by over 55% whilst its inherent metering total has remained pretty much the same. This clearly points out the recursive or re-entrant call nature within this package.

Here again are the top probes ordered by inherent total with the maximum metered call depth of 100. If you look at the bottom of the table you will see that even after 20 iterations of the IMRR loop some probes were still being disabled in this case scala.tools.nsc.transform.Erasure$Eraser.typed1.

Here is a metrics model generated by comparing the two tracedepth metering model filtered by clock.time.inherent. Can you notice the swings in the assignment of inherent totals between caller and callee?

Lets try another strategy this time eliminating the metering of same named caller-to-callee calls using the nonrecursive metering extension (there is also a nonreentrant version).

[jxinsight.override.config]
jxinsight.server.probes.nonrecursive.enabled=true
jxinsight.server.probes.strategy.hotspot.threshold=25
jxinsight.server.probes.strategy.hotspot.warmup=100
jxinsight.server.probes.strategy.hotspot.statistic=avg
jxinsight.server.probes.console.enabled=true
jxinsight.server.probes.console.dump.file.name=probes.ocs

In the graphic below the call to x.y.z from x.y.z is still executed at runtime but not measured by the metering engine.

Here is the metering model collected with the above configuration. Whilst the metering total looks very much like the metering models in part 4 it is the count and inherent total values that have changed somewhat with the biggest change occurring in the scala.tools.nsc.ast and scala.tools.nsc.transform packages. The increase and decrease in inherent totals is more than likely caused by the change in the value of the average statistic used by the hotspot strategy in deciding whether to continue to meter a particular firing probe. By reducing the count of the nested recursive calls the average increases possibly beyond the threshold used by hotspot.

Below is a listing of the top probes ordered by inherent total. Comparing this to the original hotspot metering model the transformStats probe has increased its inherent total by more than 5 seconds and the typed1 probe has decreased by 4 seconds whilst the metering totals have remained the approximately the same.

Here is a metrics model generated by comparing the nonrecursive metering model with the original hotspot metering model filtered by clock.time.inherent. The probes listed on rows 3-5 did not appear in the original metering model more than likely because of the overstatement of the count due to recursion (or method overloading as signatures are not used here) causing the average to be understated which resulted in the probe being disabled during the IMRR process execution.

Generally when a repeated call sequence is detected on the call stack we don’t want to meter any further down in the call execution until the sequence is popped off the stack. This is what the tracecycle metering extension provides.

The following configuration enables the tracecycle metering extension specifying the minimum cyclic call sequence length to be 3.

[jxinsight.override.config]
jxinsight.server.probes.tracecycle.enabled=true
jxinsight.server.probes.tracecycle.length=3
jxinsight.server.probes.strategy.hotspot.threshold=25
jxinsight.server.probes.strategy.hotspot.warmup=100
jxinsight.server.probes.strategy.hotspot.statistic=avg
jxinsight.server.probes.console.enabled=true
jxinsight.server.probes.console.dump.file.name=probes.ocs

With this configuration we have eliminated more than 10 million calls from the metering model presented in part 4 and we have recorded the best benchmark time up to now. What’s truly amazing is that we have gone from metering 16.5 billion possible calls to just under 1.5 million and we never had to decide (or click) what should and should not be measured. And to top it off we have a metering model that at the package level looks relatively similar to our previous models in terms of totals and inherent totals though the counts have changed significantly with a reduction in the number of metered probes in the scala.tools.nsc.typechecker package from just over 7 million to a mere 55 thousand. The metering count for the scala.tools.nsc.ast package has also dropped significantly down from 2.5 million to 205 thousand.

From the following top probes listing ordered by inherent total its pretty easy to identify the probes that represent the last call point in a repeating call sequence. Of course what the inherent total means here and what it means in other models collected up to now is different. But that does not mean it is wrong. We like to say “There are good performance models, poor (bad) performance models and then there is call stack sampling”. The differences in the models are just as revealing as the similarities and largely reflect movement of inherent cost from one probe to another in a caller-callee chain. The art in software performance measurement is how to alter this assignment of cost in such a way to better observe the core behavioral patterns (performance drivers) without perturbing the system and to identify the appropriate and most effective starting point for further more specific performance investigation and tuning. But unlike other straw man approaches such as call stack sampling the differences in the models for each performance test can be reasoned on and explained.

Again as a software performance engineer we can control the degree of measurement by simply changing one or more properties without having to get into the details of the code base. Here is the metering model collected with the minimum cycle call sequence length set to 5 for the tracecycle metering extension.

Here is a revised listing ordered by inherent total.

In this part we focused on filtering out (or compressing) particular call execution behaviors. Deep call stacks. Recursive calls. Cyclic stacked call sequences. In the next part we are still going to look at patterns on the execution call stack but not in terms of the probes themselves but in terms of their consumption of cost which in the case of performance profiling is latency (wall clock time).