Skip to content

Controlling Runaway Threads in the JVM using Resource Metering Quotas

Over the next year we are going to see a shift away from merely monitoring the performance of applications and runtimes to the active management of performance which involves profiling, protecting, policing, prioritizing and predicting as well as some degree of provisioning – all largely automated by way of some form of local observation and feedback control.

In recent articles, Achieving Required Workload Levels in Performance Testing using QoS for Apps, From Anomaly Detection to Root Cause Analysis via Self Observation, Using System Dynamics for Effective Concurrency & Consumption Control of Code, Optimal Application Performance and Capacity Management via QoS for Apps, we have shown how control can be modeled and implemented using our unique Quality of Service (QoS) for Apps metering extension to JXInsight/OpenCore but there are other ways this can be achieved which might be more applicable when dealing with individual runaway threads normally addressed within our QoS technology using some form of call rate limiting.

Note: A “runaway” thread is a thread that continues to execute indefinitely, consuming resources including monitors, cpu and memory. In some cases this is caused by the entering of a loop without a possible terminating condition. It can also occur due to invalid data input that drives and prolongs the execution beyond normal response time.

Metering Supervision

One option available is metering supervision, which involves indirectly executing supervisory code routines at various safe points in the execution in which the developer has explicitly allowed for supervision to occur and for a possible exception to be thrown, which the surrounding code (including possible callers) is designed to handle. But there are less intrusive and less destructive means of control and containment of runaway threads using custom code metering interceptors and our new quota metering extension which is entirely configuration driven that pause the execution of such threads.

Metering Interception

Another option is metering interception in which the running metering total for a request execution starting from some entry point is checked at the completion of each metered method invocation against some quota. In the event the quota is exceeded the execution is delayed using a Thread.sleep(long) or Thread.yield() call.

Note: Though this approach actually prolongs the execution further the rate of consumption is hopefully greatly diminished.

Below is a basic implementation of such an approach using the interception metering extension in JXInsight/OpenCore. In the code an Interceptor is attached to every probe via the create(Name, Closure) method in the InterceptorFactory which is called each time a new probe (method, statement, url,…) is created and before its firing within a thread context. This is irrespective of the source of the probe instrumentation, whether it is dynamic using our JVM agent or manual using our Open API explicitly.

Note: The Probe.Name is our method identifier. By not referring to the actual java.lang.Class or java.lang.reflect.Method this same code can actually execute in offline mode though naturally you would report rather than control in such an environment.

Here is how the above code would be installed and configured within a metered Java runtime.

j.s.p.interceptor.enabled=true
j.s.p.interceptors=y
j.s.p.interceptor.filter.enabled=true
j.s.p.interceptor.filter.exclude.name.groups=com.acme.Client.main
j.s.p.interceptor.y.factory.class=com.acme.opencore.InterceptorFactory
j.s.p.interceptor.y.environment.imports=t,m
j.s.p.interceptor.y.environment.import.t.name=quota.threshold
j.s.p.interceptor.y.environment.import.t.value=1000000
j.s.p.interceptor.y.environment.import.t.type=int
j.s.p.interceptor.y.environment.import.m.name=quota.meter
j.s.p.interceptor.y.environment.import.m.value=clock.time
j.s.p.interceptor.y.environment.import.m.type=name

Note: j.s.p is a short hand version of jxinsight.server.probes recognized by our measurement runtimes.

Metering Quotas

In the latest update to JXInsight/OpenCore 6.3 we have added a built-in metering extension, quota, that eliminates the need for custom code and extends beyond these capabilities afforded by the interceptor metering extension offering nesting and scoping of one or more quotas as well as the applying of execution control at both the begin and end phases in the lifecycle of a metered probe (method).

To demonstrate the ease of use of the quota metering extension I have created a simple Java class that repeats a sequence of calls starting with $1() and ending with $9() with each method on the call path executing a Thread.sleep(1) mimicking a millisecond of processing time.

Here is a snapshot of the metering model during an execution of the above code. Its takes on average 12 milliseconds to execute the complete call sequence.

Lets give the execution of the $1() method, and its direct and indirect method invocations, a simple quota based on the clock.time meter with a threshold of 5000 which in the case of the clock.time meter is 5 milliseconds. Once the execution within the scope of the $1() method has exceeded 5 milliseconds, by default evaluated at the entry point into called methods, a 1 millisecond pause (sleep) will be introduced.

j.s.p.quota.enabled=true
j.s.p.quotas=one
j.s.p.quota.one.name.groups=com.acme.Client.$1
j.s.p.quota.one.meter=clock.time
j.s.p.quota.one.threshold=5000
j.s.p.quota.one.pause=1

Here is a new metering snapshot with the above configuration active. Notice the inherent (self) clock.time average has changed for methods $5(), $6(), $7(), $8() and $9(). Each of these methods has experienced a pause in execution of 1 millisecond (1,000 microseconds) for each execution because of the quota enforcement added at the $1() with the quota being reset for each completion of the $1() method.

Note: If the pause config value is set to 0 (the default) then a Thread.yield() call will be made.

The quota can be reset each and every time the quota is evaluated and the thread execution paused.

j.s.p.quota.evaluate.reset.enabled=true

Now only the $6() method incurs an execution pause of 1 ms (1,000 microseconds).

The quota metering extension supports nesting of active quotas during call stack execution. In the following configuration a quota of 5,000 clock.time units is placed on the $6() method with an even greater delay than the outer quota for the $1() method.

j.s.p.quota.enabled=true
j.s.p.quotas=one,six
j.s.p.quota.one.name.groups=com.acme.Client.$1
j.s.p.quota.one.meter=clock.time
j.s.p.quota.one.threshold=5000
j.s.p.quota.one.pause=1
j.s.p.quota.six.name.groups=com.acme.Client.$6
j.s.p.quota.six.meter=clock.time
j.s.p.quota.six.threshold=5000
j.s.p.quota.six.pause=10

In the metering model shown below the $9() method experiences a 10 millisecond execution pause with the $6(), $7() and $8() methods experiencing only a 1 millisecond delay because of quota evaluation passing back up the call stack.

The evaluation scope by default is stack but can be changed to top (of stack).

j.s.p.quota.enabled=true
j.s.p.quotas=one,six
j.s.p.quota.one.name.groups=com.acme.Client.$1
j.s.p.quota.one.meter=clock.time
j.s.p.quota.one.threshold=5000
j.s.p.quota.one.pause=1
j.s.p.quota.six.name.groups=com.acme.Client.$6
j.s.p.quota.six.meter=clock.time
j.s.p.quota.six.threshold=3000
j.s.p.quota.six.pause=10
j.s.p.quota.evaluate.scope=top

Now the only method impacted is $9() because the top scope stops the evaluation proceeding back up the stack beyond the $6() method.

With the quota metering extension it is possible to evaluate quotas at the exiting of a method (ending of a metering).

j.s.p.quota.evaluate.begin.enabled=false
j.s.p.quota.evaluate.end.enabled=true

This has resulted in the execution pause now occurring at the exiting (post metering) of $9(), $8(), and $7() and being attributed to $8(), $7() and $6().

Note: The examples used above use low relatively low threshold values for most web applications. Generally the values would be set much higher to lessen the chance of a false positive.

Note: This approach only works if the runaway thread continues to call methods which are instrumented and still being metered by JXInsight/OpenCore.