Controlling Runaway Threads in the JVM using Resource Metering Quotas
Over the next year we are going to see a shift away from merely monitoring the performance of applications and runtimes to the active management of performance which involves profiling, protecting, policing, prioritizing and predicting as well as some degree of provisioning – all largely automated by way of some form of local observation and feedback control.
In recent articles, Achieving Required Workload Levels in Performance Testing using QoS for Apps, From Anomaly Detection to Root Cause Analysis via Self Observation, Using System Dynamics for Effective Concurrency & Consumption Control of Code, Optimal Application Performance and Capacity Management via QoS for Apps, we have shown how control can be modeled and implemented using our unique Quality of Service (QoS) for Apps metering extension to JXInsight/OpenCore but there are other ways this can be achieved which might be more applicable when dealing with individual runaway threads normally addressed within our QoS technology using some form of call rate limiting.
Note: A “runaway” thread is a thread that continues to execute indefinitely, consuming resources including monitors, cpu and memory. In some cases this is caused by the entering of a loop without a possible terminating condition. It can also occur due to invalid data input that drives and prolongs the execution beyond normal response time.
One option available is metering supervision, which involves indirectly executing supervisory code routines at various safe points in the execution in which the developer has explicitly allowed for supervision to occur and for a possible exception to be thrown, which the surrounding code (including possible callers) is designed to handle. But there are less intrusive and less destructive means of control and containment of runaway threads using custom code metering interceptors and our new quota metering extension which is entirely configuration driven that pause the execution of such threads.
Another option is metering interception in which the running metering total for a request execution starting from some entry point is checked at the completion of each metered method invocation against some quota. In the event the quota is exceeded the execution is delayed using a
Note: Though this approach actually prolongs the execution further the rate of consumption is hopefully greatly diminished.
Below is a basic implementation of such an approach using the
interception metering extension in JXInsight/OpenCore. In the code an
Interceptor is attached to every probe via the
create(Name, Closure) method in the
InterceptorFactory which is called each time a new probe (method, statement, url,…) is created and before its firing within a thread context. This is irrespective of the source of the probe instrumentation, whether it is dynamic using our JVM agent or manual using our Open API explicitly.
Probe.Name is our method identifier. By not referring to the actual
java.lang.reflect.Method this same code can actually execute in offline mode though naturally you would report rather than control in such an environment.
Here is how the above code would be installed and configured within a metered Java runtime.
j.s.p is a short hand version of
jxinsight.server.probes recognized by our measurement runtimes.
In the latest update to JXInsight/OpenCore 6.3 we have added a built-in metering extension,
quota, that eliminates the need for custom code and extends beyond these capabilities afforded by the
interceptor metering extension offering nesting and scoping of one or more quotas as well as the applying of execution control at both the begin and end phases in the lifecycle of a metered probe (method).
To demonstrate the ease of use of the
quota metering extension I have created a simple Java class that repeats a sequence of calls starting with
$1() and ending with
$9() with each method on the call path executing a
Thread.sleep(1) mimicking a millisecond of processing time.
Here is a snapshot of the metering model during an execution of the above code. Its takes on average 12 milliseconds to execute the complete call sequence.
Lets give the execution of the
$1() method, and its direct and indirect method invocations, a simple quota based on the
clock.time meter with a threshold of 5000 which in the case of the
clock.time meter is 5 milliseconds. Once the execution within the scope of the
$1() method has exceeded 5 milliseconds, by default evaluated at the entry point into called methods, a 1 millisecond pause (sleep) will be introduced.
Here is a new metering snapshot with the above configuration active. Notice the inherent (self)
clock.time average has changed for methods
$9(). Each of these methods has experienced a pause in execution of 1 millisecond (1,000 microseconds) for each execution because of the quota enforcement added at the
$1() with the quota being reset for each completion of the
Note: If the
pause config value is set to 0 (the default) then a
Thread.yield() call will be made.
The quota can be reset each and every time the quota is evaluated and the thread execution paused.
Now only the
$6() method incurs an execution pause of 1 ms (1,000 microseconds).
quota metering extension supports nesting of active quotas during call stack execution. In the following configuration a quota of 5,000
clock.time units is placed on the
$6() method with an even greater delay than the outer quota for the
In the metering model shown below the
$9() method experiences a 10 millisecond execution pause with the
$8() methods experiencing only a 1 millisecond delay because of quota evaluation passing back up the call stack.
The evaluation scope by default is
stack but can be changed to
top (of stack).
Now the only method impacted is
$9() because the
top scope stops the evaluation proceeding back up the stack beyond the
With the quota metering extension it is possible to evaluate quotas at the exiting of a method (ending of a metering).
This has resulted in the execution pause now occurring at the exiting (post metering) of
$7() and being attributed to
Note: The examples used above use low relatively low
threshold values for most web applications. Generally the values would be set much higher to lessen the chance of a false positive.
Note: This approach only works if the runaway thread continues to call methods which are instrumented and still being metered by JXInsight/OpenCore.