Detecting hung threads in Java with call execution stack marking and tagging
The JVM (and probably many other languages & runtimes as well) sorely misses three very useful serviceability features with regard to call execution stacks:
- ids: a stack frame is given a unique (within the thread context) identifier prior to being pushed onto the call execution stack
- marking: recording the number of mark operations performed since a frame was pushed onto the call execution stack
- tagging: a stack frame will take either a tag value set at the thread level or process level prior to being pushed onto the call execution stack
With identifying, marking and tagging the job of detecting hung or delayed threads (deadlocked, busy or spinning) is made far easier because it is possible to easily determine whether a repetitive call stack trace across multiple dumps is indeed the same execution call path instance or an entirely new execution (for the same thread). This is made all the more problematic with thread pools and workers (code blocks) that follow the same flow but which operate on different data types not discernible from the code itself.
Fortunately it is possible to enrich a Java runtime with these capabilities using JXInsight/OpenCore’s dynamic instrumentation agent and extensible activity resource metering technology. It gets even better in that we can limit these enhancements to particular namespaces in the code base (that which is of most interest). This can be done remotely from within the JXInsight/OpenCore management console (@see operation toolbar) or locally using the Probes Open API which is accessible to the application as well as metering plugins.
To demonstrate the enrichments to the runtime I downloaded DataStax Cassandra distribution along with its Pricing Portfolio demo application.
Before launching the Cassandra server process on my Mac OS X I set the following environment variable.
export JVM_OPTS="-Xmx256M -agentpath:/OpenCore/bin/osx-32/libjxinsight.jnilib=prod -javaagent:/OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar"
In the root installation directory (from which the
bin/cassandra script was executed) I created an empty
jxinsight.aspectj.filters.config file since the instrumentation agent by default excludes
I then created a
jxinsight.override.config file with the following metering extensions enabled.
Here is the Stack Table view on startup of the server. There are 5 threads listed as currently executing incomplete metered method invocations. The Id column is the unique identifier for the frame within the context of the thread. The # column is the call path depth from a root caller perspective with
0 being the bottom of the stack. The Count column is the number of frames that have been pushed onto a threads execution call stack since a particular frame was pushed (this calculation is deferred until the collection point). The Count Î” column is the change in the Count column since the last mark. No mark had being performed at this stage. The last column, Î”, represents the number of marks that have occurred during the metering period of the frame.
Here is a snapshot taken after starting the
bin/pricer script that comes with the demo application with the parameters
-o UPDATE_PORTFOLIOS. Many new threads have been started and are busy executing. From the Count column we can see the amount of work, frames pushed (and popped), that has gone on underneath (within) a call frame’s execution (scope). The
Thrift-3 named thread has already executed 61,188 instrumented and metered method invocations.
The Count column tells us how busy a thread or frame has been in terms of executed instrumented methods but that it is not enough as we need to be able to see whether it changes over time on a per frame basis. This is what marking attempts to remedy. Below is a snapshot taken using the Refresh command following the execution of the Mark command and a pause of 1 second. Those frames with non-zero Count Î” values have called (directly or indirectly) other instrumented methods. Those with a value in the Î” column were pushed onto the execution stack prior to the mark and have remained on the stack since.
The next snapshot was taken following a Mark, a pause of 1 second, and a Reset which is a Mark and Refresh combined. From the table we can see that the
*.run methods have reached
3 marks. But we can also see in thread
Thrift-8 that called frames have been active prior to the Mark and up to (and possibly following) the Reset. We can now say that these frames have at minimum an execution time of 1 second. If we had paused for 10 seconds and had the similar results we could then claim a 10 second minimum execution time (delay).
Here is a snapshot following a Mark, pause, Mark, pause and Reset (Mark and Refresh). We don’t see any call frames other than the
*.run methods go above 3 marks.
Instead of marking to detect execution delay we can use tagging. I added the following system properties to the
jxinsight.override.config to enable tagging in the metering engine and to set the process tag to “
startup” before the very first frame is pushed.
Below is a snapshot taken following the completion of the server startup with tagging enabled. In the management console I then used the Tag command to clear this value for subsequent frames (and threads).
After starting the
bin/pricer script as above I set three tags, “
second” and “
third“, with a very slight pause between each setting. Those with a tag were pushed onto the stack when that particular tag was active (set).
Here is a snapshot following termination of the
bin/pricer script which generated the server workload. The number of threads and frames have shrunk back down but we also now know that thread
Thread-3 which was present from the start is now at a different execution flow point with the “
third” tag set.