Skip to content

Is JVM Call Stack Sampling Suitable For Monitoring Low Latency Trading Apps?

I was surprised this week to see a posting questioning the performance results of a call stack sample profiler in monitoring a low latency trading application. The poster was clearly confused by the results, but not as much as I was confused as why anyone would consider using such an approach in a low latency environment in the first place. Obviously he did not fully understand what is involved in monitoring performance and how best to apply various performance measurement techniques. When asked what lead him to believe that sampling was suitable, his response was that he had been told to ensure little or no impact on latency and throughput. He naturally, but incorrectly assumed, sampling was appropriate. When he did finally get to look at the performance data collected he was deeply disappointed that it offered very little useful information.

“It is telling me what I already know…we have 20 threads executing the same call path repeatedly and concurrently…what I need to know is the frequency of particular method/path executions and the latency ranges of their execution over time…this [sampling data] is redundant…useless”.

Did you know that most profilers on the market use Java/JVM API’s which cause all threads to block on the collection of thread stacks frames? This is not done asynchronously and worse within the Hotspot JVM requires all threads to wait on other executing (running) threads to reach a safepoint. You can think of this process as a form of musical chairs in which there are no winners. The music stops, then we must wait until all those dancing get seated (counted and collected), and only then can we begin the next musical sequence.

“A [safe]point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. Both compiled Java code and C/C++ code be optimized between safepoints, but less so across safepoints. The JIT compiler emits a GC map at each safepoint. C/C++ code in the VM uses stylized macro-based conventions (e.g., TRAPS) to mark potential safepoints.” - Hotspot Glossary

I’ve written before on how (in general) ineffective call stack sampling is in the performance analysis of applications beyond a single thread of execution, without some sort of data collection that hints at possible frequency (invocation) and latency (duration), which incidentally we addressed in our innovative swatch metering extension which offers an asynchronous (non-blocking) hybrid approach. In benchmarks conducted we have shown how bad its impact can be once the number of sampled threads exceeds 50 or more which is typical of web applications. But what about the impact when the number of concurrent threads is relatively small? Can application performance management tools based largely on call stack sampling be used in low latency environments, leaving aside that they don’t actually provide any useful information?

Note: There are three standard performance measurements collected for a managed (monitored) resource or service. These are utilization (busy), throughput (frequency) and latency (delay). Call stack sampling is focused on the busyness of the application code base though it can’t always accurately distinguish between suspension or spinning in the execution of a thread. 

To determine the degree of impact call stack sampling can have on latency I wrote a small test (code at end of article) which executes a shallow call path (depth < 45) across a specified number of threads. I then instrumented just the entry point (transaction) method to collect a (power of two) quantization of the clock.time in microseconds. Below is the baseline with 5 threads and no instrumentation or call stack sampling of the execution path. The largest band is 64-128 microseconds with an average within this band of 78 microseconds.

Here is the same test this time profiled using a remote call stack sampling solution that uses JMX to collect stack traces (interval=20ms). Strangely this test had a higher overall call frequency and a greater number in a lower latency band (32 – 64) whilst also causing far more garbage collection cycles. More importantly there was an noticeably increase of frequency in higher latency bands. In the baseline test there were 9,920 calls in the 128-256 band but here there are 38,983. There are more outliers and much further a field – a no-no in low latency environments.

I then ran the test again this time using another call stack sampling profiler that uses the JVMTI C API and incurs little or garbage collection (interval=10ms). The overall throughput dropped slightly but more importantly there was a bigger increase in the frequency of the higher (outlier) latency bands. There were more than 58,000 calls above 128 microseconds.

Note: Both samplers were attached to the JVM after the benchmark had started executing, giving them a 5-10 grace period, which is one of the advantages if you don’t need always monitor (that is rarely the case today).

The final test was instrumented with JXInsight/OpenCore. Whilst the average in the 64-128 bands was slightly higher at 81 microseconds compared to the baseline of 78, the frequency in the other bands was much more similar to the baseline that the two other call stack sampling solutions.

Remember JXInsight/OpenCore is an intelligent and adaptive performance monitoring and management solution, so it will incur some additional cost during the initial warmup period as it figures out what it should continue to meter. Not forgetting that it must in addition collect and aggregate metering clock.time count and totals, which none of the other call stack sampling solutions offer.

I then re-ran the test this time with a thread count of 20. Here is the quantization for the baseline test with no instrumentation (other than the test method) and no call stack sampling. The frequency is still largely clustered in the 64-128 microseconds range. The 128-256 latency band has seen a significant increase but from there on it drops off.

With the JMX based call stack sampling solution far more of the calls have been pushed up into much higher latency bands than previously with the increase in the number of threads. There are now more than 120,000 calls above 256 microseconds compared to a little more than 500 in the baseline test.

Running the test with the JVMTI C API call stack sampling solution there are now more than 200,000 calls greater than 256 microseconds. Relatively it is still a low % but in all likelihood not acceptable in production and for what little value it delivers as it does not allow us to control execution or to ascertain whether a transaction (trade) in progress is hung.

JXInsight/OpenCore does remarkable well, even with the increase in number of threads. There are approximately 7,500 calls above 250 microseconds. It does a million more transactions than the baseline and at a lower average, overall. Adaptive instrumentation and measurement scales with increasing number of threads and has a known cost that is attributed to actual execution work whereas call stack sampling does not.

As stated above I expected most of outliers in the JXInsight/OpenCore test runs to be attributed to the adaptive phase of the metering. To show this I ran a second test with 20 threads and performed a mark at the half way point in the execution. The delta table columns below list those calls that occurred following the mark.

Update: 13th Dec 2012 @ 12:15 pm

The initial test code used LockSupport.parkNanos() which has presented some problems for those trying to repeat the tests on MS Windows. It also suffers from spurious wake-ups, which appear to be in some way influenced by stack sampling itself. So I’ve do another set of benchmark tests runs with the following spin implementation of the term() method.

lowlatencytrading.sampling.code.spin

Here are the baseline results with the thread count set to 5. The overall average is 50. Much better.

lowlatencytrading.sampling.p1.5.spin

Below are the results for the remote JMX based call stack sampler. There are close to 30,000 calls above 128 microseconds compared to a total of  129 in the baseline.

lowlatencytrading.sampling.p2.5.spin

The native JVMTI agent based sampler had more than 50,000 calls above 128 microseconds. Much a much higher overall total than the JMX call stack sampler, though that had far more in the higher latency bands, which we can probably attribute to GC.

lowlatencytrading.sampling.p3.5.spin

JXInsight/OpenCore fares far much better with less than 1,000 calls above 128 microseconds, compared to 30,000 and 50,000 for the call stack samplers.

lowlatencytrading.sampling.p4.5.spin

I re-ran the tests again this time with 10 threads which maxed out my cpu capacity. The baseline now has 15,052 calls above 128 microseconds.

lowlatencytrading.sampling.p1.10.spin

The JMX call stack sampler had over 111,000 calls above 128 microseconds with a significant number in the 2,048 to 4,096 latency band.

lowlatencytrading.sampling.p2.10.spin

The native JVMTI call stack sampling solution had ever so slightly more calls at just over 115,000 calls above 128 microseconds.

lowlatencytrading.sampling.p3.10.spin

Below are the results for JXInsight/OpenCore. Once again its adaptive instrumentation and measurement approach fared far better with 16,768 calls above 128 microseconds compared to a baseline of 15,052.

lowlatencytrading.sampling.p4.10.spin

Update: 13th Dec 2012 @ 18:15 am

Another question asked was how much did the call stack depth influence the outliers for in the case of samplers. Below are the baseline results for a stack depth of 80 with a thread count of 5. A slight increase in the 64-128 microsecond latency band, from 4,434 to 6,180.

lowlatencytrading.sampling.p1.5.s2

Here are the baseline results for stack depth of 120. A much bigger jump this time, but not much has changed in the outliers.

lowlatencytrading.sampling.p1.5.s3

From the JMX call stack sampler results below it would appear that many of the class previously in the 128-256 microseconds latency band jumped up into the 256-512 band with the stack depth increased to 80.

lowlatencytrading.sampling.p2.5.s2

The build up in the frequency and average latency in higher bands continues with the increase of the stack depth to 120.

lowlatencytrading.sampling.p2.5.s3

With the stack depth increases the native JVMTI agent call stack sampler also experienced an increase in outlier frequency. Here are the results at a call stack depth of 80.

lowlatencytrading.sampling.p3.5.s2

At a stack depth of 120 the 256-512 latency band continues to increase in frequency with all latency bands averages increasing as well.

lowlatencytrading.sampling.p3.5.s3

As probably expected  JXInsight/OpenCore did not suffer noticeably with the increase in stack depth, even though it did need to do more measurement work during the early stages of the instrumentation and measurement adaptation. Here are the results with a stack depth of 80.

lowlatencytrading.sampling.p4.5.s2

Finally at a stack depth of 120. Pretty good when compared to the changes experienced by the call stack samplers and in light of the increases in the baseline results.

lowlatencytrading.sampling.p4.5.s3

Appendix A – Code Listing

Appendix B - Diagnostics
Jean-Philippe B., a software architect at Ullink, a provider of low-latency connectivity and trading solutions for the global financial community, pointed out on the Java Performance Professionals group on LinkedIn that the safepoint coordination protocol used by Hotspot was a contributor to the frequency increase in the higher latency bands.

“Call stack sampling uses ThreadDump. A ThreadDump involves reaching a safepoint and stop all threads during capture. It is like a GC but generally the pause time is in microseconds. Try running with -XX:+PrintGCApplicationStoppedTime. It will print not only GC pause time but also any VM operation involving safepoint and pause time. You will be surprised by the number of non GC pause time. To know from which VM operation the pause time is attributed add -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1“.

Here is a sample of the output running the native JVMTI agent call stack sampler with the -XX:+PrintGCApplicationStoppedTime VM argument. It is printed in rapid succession during sampling. During a particular run, of 2 minutes, there were close to 11,000 of these lines logged, which tallies with the 10ms call stack sampling interval.

Total time for which application threads were stopped: 0.0002180 seconds
Total time for which application threads were stopped: 0.0002670 seconds
Total time for which application threads were stopped: 0.0002480 seconds
Total time for which application threads were stopped: 0.0002310 seconds
Total time for which application threads were stopped: 0.0002650 seconds
Total time for which application threads were stopped: 0.0003300 seconds
Total time for which application threads were stopped: 0.0002760 seconds
Total time for which application threads were stopped: 0.0004110 seconds
Total time for which application threads were stopped: 0.0005160 seconds
Total time for which application threads were stopped: 0.0004500 seconds
Total time for which application threads were stopped: 0.0003960 seconds
Total time for which application threads were stopped: 0.0003110 seconds

With the remote Thread/JMX call stack sampler the output includes GC timing.

[GC 20184K->1680K(70656K), 0.0025700 secs]
...
Total time for which application threads were stopped: 0.0003040 seconds
Total time for which application threads were stopped: 0.0002410 seconds
Total time for which application threads were stopped: 0.0003360 seconds
Total time for which application threads were stopped: 0.0002370 seconds
Total time for which application threads were stopped: 0.0002670 seconds
Total time for which application threads were stopped: 0.0002660 seconds
Total time for which application threads were stopped: 0.0002580 seconds
Total time for which application threads were stopped: 0.0003150 seconds
Total time for which application threads were stopped: 0.0003030 seconds
Total time for which application threads were stopped: 0.0003160 seconds
Total time for which application threads were stopped: 0.0003830 seconds
...
[GC 20112K->1832K(70656K), 0.0025650 secs]

Adding -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 to the command line prints out the following sample lines for the native JVMTI agent.

vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
17.806: GetAllStackTraces [ 13 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
17.816: GetAllStackTraces [ 13 0 0 ] [ 0 0 0 0 0 ] 0

For the Thread/JMX API call stack sampler the output is slightly different.

vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
19.837: ThreadDump [ 14 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
19.857: ThreadDump [ 14 0 0 ] [ 0 0 0 0 0 ] 0