Going Beyond Actor Process Supervision with Simulation and Signaling
“Nature, under my supervision, creates all animate and inanimate objects; and thus the creation keeps on going” – Bhagavad Gita
In Simz — The Promise of Near Real-Time Application Simulation in the Cloud I described how Simz can create something out of nothing when no process exists for an application that is dynamically managed in the cloud. The same article also showed how Simz can make multiple application runtimes look like a single one. But in the early inception of Simz we had another equally powerful use case in mind and that was to have a runtime simulation for every actual application runtime.
Simz in this context resembles a (black box) flight recorder, but instead of storing sampled sensor data it replays in unison the metered execution behavior, recorded and transmitted from the real runtime via a live metering feed. If the application runtime suddenly terminates, and closes the connection feed, we can still acquire runtime diagnostics from the simulation. More importantly the stack of each simulated thread will reflect what was in the process of being executed, but not completed, before an abrupt termination. This also addresses the issue where diagnostic routines built into the runtime, called via shutdown hooks, are not able to execute in time before the process is completely ended.
We can take this further by incorporating elements of behavioral supervision and behavioral memorization.
Process level supervision is a familiar concept for those working with actor based languages and runtimes such as Erlang/OTP.
“In general application programming, robust server deployments include an external “nanny” that will monitor the running operating-system process and restart it if it fails. The restarted process reinitializes itself by reading its persistent state from disk and then resumes running. Any pending operations and volatile state will be lost, but assuming that the persistent state isn’t irreparably corrupted, the service can resume. The Erlang version of a nanny is the supervisor behaviour. A supervisor process spawns a set of child processes and links to them so it will be informed if they fail. A supervisor uses an initialization callback to specify a strategy and a list of child specifications. A child specification gives instructions on how to launch a new child. The strategy tells the supervisor what to do if one of its children dies: restart that child, restart all children, or several other possibilities. If the child died from a persistent condition rather than a bad command or a rare heisenbug, then the restarted child will just fail again.” – Erlang for Concurrent Programming
This all sounds good until you realize that supervision in Erlang, and much like other actor systems including Scala/Akka, is largely focused on process lifecycle management and not the internal workings of the worker process (or actor). The supervisor does not actually monitor, or is aware of, the software execution behavior of the worker process other than whether it has terminated and the cause of such termination. It is not at all like a metering supervisor.
Simz on the other hand pairs a simulation runtime with the application that is for all intents and purposes the application under supervision. The supervisor is the worker in terms of metered execution and resource consumption tracking. The simulated runtime assumes the behavior patterns of the real application runtime and thus supervision is self reflective. Instead of asking the question “What is the worker process doing?” it asks “What am I doing?”. In this context you can think of Simz as the “brain in a vat”. Its perception of reality is via the metering feed. Here the worker takes the role of the body of which we can have many over time.
In Erlang the supervisor is responsible for starting, stopping and monitoring its child processes. The basic idea of a supervisor is that it should keep its child processes alive by restarting them when necessary. Life and death but nothing in between. Simz today can support this level of supervision via a metering plugin which could terminate (or restart) a connected worker runtime in the event its detected unusual behavior and not just at the process or actor mailbox level. In fact this will work whether you are using actors or not. And you can chain Simz runtime services, moving from a local to global context.
In Simz 2.0 we plan to go further by offering a bi-directional link between both runtimes that can relay signals generated within the supervisory runtime over to the worker runtime. Both runtimes would have our Signals technology embedded within them, but the worker runtime would be able to process signals that originate within itself as well as within the supervisory runtime via metering plugins. The source of a signal would be completely transparent to the application. Note signals are not your typical process level notification primitives. Signals drive self adaptation and self regulation within a runtime. They are specific to the application behavior and its activities. We are also going to make the metering runtime signal aware allowing us to remotely influence adaptive control valves without needing for the application to explicitly use our Signals technology.
One concern with restarts of a worker process by a supervisor is that the process starts with a clean slate. A clean slate can be good when the failure, which triggered the restart, was caused by corrupted stated now lost or discarded. But it also means that the worker process is none the wiser for why it failed previously. It might be more useful if the worker has the ability to recollect the most recent events and to make some internal adaptations to mitigate the risk this may occur again, somewhat like self self learning. We can achieve this by accessing the behavioral metering model in the Simz simulated runtime that survived the worker process termination. We can also use the supervisor to taint the behavior of the initializing worker by way of signals. The signals are enclosed in a conversational signal boundary maintained in the supervisor runtime. You can think of it as a worker downloading an upgrade from within the Matrix.
“Of course. The new Sam and I will be back to our programming as soon as I finished rebooting.” – Moon (2009)
Note: We can configure the metered application runtime to write the metering feed to a file instead of a network socket and then replay the simulated behavior offline but that would be more so postmortem (root cause) analysis and not runtime supervision in which the simulation is able to influence, even manage the lifecycle, of the application runtime.
Introducing Signals — The Next Big Thing in Application Management
Simz — The Promise of Near Real-Time Application Simulation in the Cloud
How to Execute Software Behavior Faster than (Wall Clock) Time
The Power of Slow in Application Performance Analysis — Record & Playback
Many Parallel Worlds