JXInsight/OpenCore 6.4.EA.11 Released – Adaptive “Safety” Control
“Resilience is the ability to absorb disturbances, to be changed and then to re-organise and still have the same identity (retain the same basic structure and ways of functioning). It includes the ability to learn from the disturbance. A resilient system is forgiving of external shocks. As resilience declines the magnitude of a shock from which it cannot recover gets smaller and smaller. Resilience shifts attention from purely growth and efficiency to needed recovery and flexibility…Systems are never static, they tend to move through four, recurring phases, known as an adaptive cycle. Generally, the pattern of change is a sequence from a rapid growth phase through to a conservation phase in which resources are increasingly unavailable, locked up in existing structures, followed by a release phase that quickly moves into a phase of reorganisation, and thence into another growth phase.” – Resilience Alliance
The eleventh early access build of JXInsight/OpenCore 6.4 “ACE” has been published on our developer site. This release includes a new adaptive control in execution (ACE) metering extension,
svalve, that self regulates the concurrency and execution of probes (methods) associated with a “safety” valve in the event of a specified number of incomplete (stalled) probes exceeding a duration
Adaptive Control Valves versus Circuit Breakers
- Adaptive control valves are dynamically added into code at runtime. It is not necessary to explicitly call into an API and introduce a compile time dependency, though OpenCore does offer a Probes Open API for contextual activity/task execution metering, which is used by the
svalveextension to hook in adaptive control into the runtime. No command framework or actor programming model with software circuit breaker capabilities needs to be adopted pervasively (and perversely). Safety valves can be automatically introduced into any method that is metered. Valves can be specified at deployment time via externalized configuration in comparison circuit breaker APIs require hardwiring at development time.
- Adaptive control valves are preemptive and preventive whereas circuit breakers are reactive. Valves attempt to detect problems before a failure is truly observed. Once triggered, valves mitigate further risk by temporarily suspending processing and then slowly reintroducing work load (valve flow). Circuit breakers react post failure and don’t actually attempt to manage the software execution other than to abort abruptly.
- Adaptive control valves are protective, providing a window in which the (predicted) outcome can be altered via self healing and risk mitigation strategies. Circuit breakers on the other hand simply fail fast, replacing one failure type with another, which can increase risk elsewhere in the caller chain and further choke performance especially when the underlying predicate is incorrect, the costs in exception handling and propagation are high, and the resulting system wide dynamics not fully understood.
- Adaptive control valves observe both good (normal) and bad (abnormal) behavior. Control valves can learn to predict transitions from one to another. Valves are able to assess the effectiveness of their predictions and abort intervention early if invalidated. Circuit breakers generally only receive post-failure notifications.
- Adaptive control valves can be applied at the entry point into a web application or service. In doing so, ensures ensuing resource consumption only occurs when it is likely the request will succeed. Being at the entry point also means the adaptive control spans multiple backend resources and external services. Circuit breakers are instead applied at the point of interaction with each backend resource or external service, resulting in request processing blindly proceeding only to fail as soon as it hits a triggered circuit breaker resulting in a forceful termination by way of a thrown exception.
- Adaptive control valves can be applied to safe guard the application from its own problematic workload patterns that cause excessive resource contention or garbage collection cycles. A key trait of resilient systems is immediate and effective action in the event of anticipated (predicted) failure. Whilst our
rvalveadaptive control extensions are perfect at finding optimal levels of concurrency under normal operating conditions, their hill climbing (and descent) algorithms means there is a time lag in adjustment as each new level is assessed. The
svalveadaptive control can be used to compensate by immediately dropping down through all levels. Circuit breakers are not designed for such usage in not modeling and managing concurrency.
Adaptive control valves can’t help when an external service or resource becomes unavailable and remains in this state for protracted period. In such cases the best course of action is to fire a signal to stimulate software adaptation further upstream. Throwing an exception is an invasive and destructive form of signaling. Ideally any sort of “testing the waters” to determine whether a suspended service has become available again should be done separate to the requesting processing path by way of availability probes or (callback) notifications.
Software Resilience Example
The following code will be used to demonstrate the application of a “safety” control valve. The code creates 50 threads, each a second apart, that loop continuously calling the
samples.Surge.process method, which in turn calls
samples.Surge.resource without any delay. The
resource method has been coded to simulate excessive resource service delay under stress as the degree of concurrent call execution increases.
Below is the metered “powers of two” distribution for the
process method. Note the extreme outliers are caused by the contention for permits, repeatedly attempted to be acquired within the
A “safety” control valve periodically checks the progress of associated probes and temporarily suspends the execution of all subsequent associated probe firings (threads/methods) when it detects potentially stalled probes (threads/methods). The suspension of execution remains in place up to a specified time though it can be prematurely lifted in the event of all associated probes, including those not deemed stalled, completing. Here is how a “safety” valve can be defined for the code shown above.
Rerunning the code again with the
svalve probes provider enabled and configured results in a very different metering distribution. The extreme outliers have been removed though a cluster has appeared in the 1 sec to 2 sec and 2 sec to 4 sec latency bands. This cluster can be attributed to the suspension of execution by the control valve (absorption) and the subsequent slow incremental workload increase (release/relief). Both the
sustain states have a default interval of 2 secs.
The suspension exerted by the “safety” control valve is evident in the inherent total for the process method. Though the
process method does nothing other than call the
resource method it has on average had a self clock time of 650 ms with the safety valve dynamically installed at runtime. As delay (pressure) increases in the
resource method the valve starts to temporarily suspend and throttle execution calling into the
Below is the inherent total (injected delay) for all threads plotted (1 sec apart) over the course of the run as the number of threads is increased.
For each “safety” valve defined a number of properties can be set. Here is the list along with defaults.
# the time interval (in ms) between "safety" inspections and measurements
# the duration (in ms) for a thread to be deemed stalled for a "safety" control valve
# the suspend period (in ms) for a closed "safety" control valve
# the suspend period (in ms) for a closed "safety" control valve
# the minimum no. of stalled threads for the "safety" control valve to be closed
# the percentage of stalled threads to active threads for the "safety" control valve to be closed
# the maximum wait time (in ms) for a queued thread in a closed "safety" control valve
# delays (in ms) the intervention of adaptive "safety" control valve
# enables (or disables) fair (fifo) queuing of threads entering the "safety" control valve