Canceling Uncontrolled Jitter with Adaptively Controlled Jitter
Today we released a new adaptive control valve metering extension,
jvalve, that self regulates the execution of probes (methods) associated with an adaptive control valve in the event of an observed runtime performance jitter. Surprisingly this type of adaptive control handles jitter by introducing delay (thread suspension), which to a client can appear as jitter though deliberate and controlled. In doing so the adaptive control valve minimizes the chance of uncontrolled jitter continuing to plague the processing performance within a metered server runtime. The most common causes of performance jitter are excessive garbage collection, as a result of high allocation rates, and high concurrent work processing, so in general any form of temporary suspension of processing, until jitter is dissipated, can help speed up the resumption to normal processing performance levels.
You might ask how different is this from the typical stop-the-world garbage collection events within an application runtime when all application threads are suspended? Why not allow garbage collection to “naturally” throttle the processing? For one thing improved performance, which we already demonstrated in Adaptively Controlling Apache Cassandra Client Request Processing. What the
jvalve metering extension brings additionally to the table is the ability to be selective in which threads are suspended, by way of pending invocation targets, during the early warning signs of possible performance incident. Valves allow us to temporarily suspend any further invocation of particular methods (and their call paths). For each valve we can associate one or more probes (methods) and set the sensitivity of the valve to jitter, via the
run control set points, offering a kind of ordering (reverse preference) to the suspension of thread processing. By making method
x, and its associated valve, more sensitive than method
y, and its associated valve, those threads calling into method
x will more than likely experience controlled delay before those threads calling into method
y. In fact what we would like to see happen is the throttling of method
x alleviates further performance jitter eliminating the need to suspend threads calling into method
jvalve metering extension is similar to the
svalve metering extension in that it temporarily suspends processing on some runtime condition. It differs from the
svalve in that the suspension is immediately lifted once the jitter is not observed and without any slow ramp up of suspended workload. This allows it to be far more sensitive to jitter observations though there is an automatic adjustment builtin to prevent disruptive oscillation.
Below is the CPU profile from a client machine running an extreme insert stress benchmark against a remote metered Apache Cassandra 1.2.5 server that has no adaptive control in place. The regular dips in CPU utilization across all cores is a result of memory management and table compaction. The frequency of the dips levels off slightly at the end of the run as client processes (+20) are terminated.
We can attach a “jitter” valve to the primary processing entry point by adding the following to a
Without changing the default control set points associated with the
pf adaptive control valve there is already a noticeable difference in the frequency, length and clustering of severe performance jitters. This is achieved without any code changes.
By default the “jitter” control valve will automatically adjust the
tolerance in line with normal behavior. This is generally a good idea but under continued stress, as in a benchmark, can cause the valve to perceive “normal” incorrectly. This mechanism can be disabled with the following system property.
With the automatic adjustment of the
tolerance disabled thereby increasing the sensitivity of the valve the benchmark ended much quicker.
Sometimes to speed things up you need to slow things down. Software can experience its own form of fatigue. Early detection and intervention is far more efficient and can provide increased operational stability. From our experience with high performance large scale systems smaller and less intrusive interventions greatly improve the resilience and availability of systems.