Introducing Signals – The Next Big Thing in Application Management
For the last 6 months I’ve toured the globe (US, Asia, Australia, Europe) giving talks on performance monitoring (measurement) and management (control) aimed at large scale, distributed, and highly complex Java applications and services. The primary reason for such talks was to increase awareness of self adaptive techniques applied to application software as a means to address inherent management complexity, with our intelligent activity metering engine serving as a prime example. Whilst it is important to be able to demonstrate that such an approach is achievable in practice, it is far more useful to create something that enables other to succeed in such endeavors. I believe Signals, currently in beta is a key technology in enabling the development of self-aware, self adaptive and self regulated software.
This working document introduces the reasoning, thinking and concepts behind a technology we call Signals, which we believe has the potential to have a profound impact on the design and development of software, the performance engineering of systems and the management of distributed interconnected applications and services. We originally designed Signals to address the sub-microsecond execution performance variation analysis of software in extremely low latency high frequency transaction and messaging environments but subsequently evolved our design to incorporate support for the development of new software components, libraries and platforms that are self adaptive.
Adapting to Success (optional)
Signal Driven Development
Continuous Performance Management
In 2007 when we started to get enquiries from financial trading platform developers on the usage (and overhead) of our distributed contextual path tracing solution I started looking at ways to measure our code for the purpose of tuning for such environments. Unfortunately at that time, and still very much the case today, the typical code profiler was pretty much useless once you had removed all the low hanging fruit (unless of course you had a memory leak and heap dump to analyze which fortunately was never a problem for us). The problem we faced was the high overhead of commercial code profilers was much higher than our own overhead, maybe not at the unit cost level at that time but most certainly when the degree of instrumentation and measurement was factored in for all methods in our code base, many of which did not need to be profiled. So I designed a solution, initially for our own internal use, that would adaptively manage the overhead of instrumentation, measurement and collection driven by employing multiple strategies, chained together, that factored in cost, value and relevance of what was measured – hotspot for performance measurement. It was so successful it superseded the measurement solution it was built to measure.
Since 2008 we have engineered away as much overhead as is possible within the design of the metering engine using various low level heuristics along with adaptive selection of hot execution paths. We have achieved impressive benchmarks results that show our software activity metering technology to be 100x-1000x higher performing (less overhead) than all other competing solutions – both code profiles and performance monitoring solutions. But success breeds failure of a sort, so once again we started to get enquires that would push our technology far beyond what we could possibly hope to achieve – nanosecond level profiling. How could we claim to have a low overhead in such environments when the cost (overhead) of two clock counter accesses was above 100 nanoseconds? When we investigated further what we found was that the actual transaction times were between 25 and 250 microseconds so metering at the entry point level or using budgeting could deliver a workable solution. But such a solution would only address measurement (visibility), but would not necessarily help identify and resolve the underlying causes in performance (wall clock time ) variation observed at such points, the kind of variation that could very easily be attributed to slight differences in the execution of a branch or loop statement – something that was not realistically measurable using
One way to capture such micro-level behavioral differences would be to use a hybrid form of sampling and event based measurement. We could wrap the
clock.time meter with a sampling mechanism. Which we did, adding a
sample property to each built-in and custom meter. We could move out the cost of
clock.time meter reading to a separate thread that executed a fast spin loop and reported in ticks (counts). Which we did, adding a built-in
spin.count meter. We could count the execution of code rather than time it. Which we did, offering both a
count meter. We could turn our event based metering engine into a high performance probe stack sampling engine – near zero call latency, no allocation, with no increase in garbage collection pressure. Which we did, developing the
swatch metering extension. But still being mostly method level instrumentation based we could not accurately observe, identify, label and count the branches (and loops) in the code within the method itself other than by inferring it from differences in the observed out-bound calls to other methods across different executions of the (caller) method. Which to some extent we did already address via metering change sets. Whilst it is possible in theory to map hundreds of such counters to individual meters it is not practical for all intents and purposes, and its questionable whether anyone would derive value having the number of times a branch was executed for each and every caller on the metered call stack.
Note: I would never consider traditional Java call stack sampling as a solution for anything beyond a single thread desktop application due to its huge call latency in an environment with many threads executing reasonably deep call stacks and the even bigger issue of more frequent and prolonged gc cycles caused by such collections.
While exploring various extremely low level optimizations that could be applied to the metering engine it occurred to me that we also needed a cheap way to count particular execution behavior deep within the metering measurement code, which would be accessible to our instrumentation up within the metered application code base for the purpose of adaptation. And so I started, yet again, to design a measurement engine to measure the performance drivers of another measurement engine. Here are the requirements I started the design with:
- Support sub-microsecond execution analysis
- Support scoping (hiding) and propagation of named event counts called signals
- Support complex event processing at the boundary of scopes such as aggregation, composition, transformation,…
I had two overriding goals for the design
- Online/offline detection of underlying root causes (signal sets) of observed performance variance (time metering)
- Development of self regulated and adaptive component libraries and systems
During this time I attended the JVM Language Summit, Santa Clara, 2011 with the aim to convince the Oracle (Sun) JVM team that we needed intrinsic support for efficient thread (and stack) scoped named numeric accumulators. Unfortunately at that time I was far too deep in getting the underlying implementation details in place that I could not adequately describe why signals was needed and the enormous potential impact it could have on the JVM and its libraries, but with time and lots of meet-ups at Java User Groups, I have come up with useful analogies and examples. Hopefully what follows will help explain in much greater detail than what I was unable to convey in my talks within the time frame given.
Today when we are tasked with a job to be performed, there is an initial exchange in which job parameters and constraints are defined and set, which then drive planning of the work items needed to complete the task. Then there is a final exchange in which the product is handed over. Typically around the same time that these exchanges occur there are implicit and explicit signals communicated by both parties which can be of a significant importance in the execution of the planned work items. These signals can help explain the cause of observed deviations from expected performance and in addition can be used to drive changes (adaptations) in the future assignment and planning of similar jobs. Here are examples of some explicit signals communicated upon completion of a task beyond its expected (or planned) time:
- “some of the parts were not in stock and so we had to create them as we went along”
- “we had problems sourcing the material from the suppliers you gave us”
- “we underestimated the work effort involved, which we had largely based on your past orders”
- ” we have completed the job but we did incur some additional charges which we are passing on to you”
- “we had some capacity management issues which we could not have foreseen when we took on the job”
- “we did not make a job run window and then had to contend with other jobs some of which had higher priority”
- “we are still learning how best to order and optimize some of the work items particularly to your job submissions”
OK some of the above might not be communicated at all, at least not directly, but more than likely they would be privately recorded within the organization, passed up various levels, possibly in a slightly different form at each level. Those that were communicated would be passed along to various interested parties, again sometimes in a different form and manner.
Signals can also occur during the submission and acceptance of a job.
- “in a hurry here and would appreciate it if this could be done as soon as possible”
- “last time we experienced a delay with the reason being…I trust that will not be a factor in finishing the work this time around”
- “we are using the same materials as before which I know caused some complications but hopefully we’ve learned from that”
- “you are new around here…have you looked at our file and past orders, and planned accordingly”
Most of the API’s we design and call into today don’t allow such exchanges to occur or for signals to be communicated and observed then propagated, profiled and persisted. It is nearly always stateless even when it is stateful as in the case of bindings between caller and callee, user and session, service and resource, and so on. The state that is maintained across exchanges for such bindings is rarely related to the signals, more so the arguments and return values. With little or no memory of past behavior other than state side effects we limit the improvement in performance and quality that we observe (and expect) in the real world over the course of an association and its many interactions. I believe this to be a fundamenatl failing of software. Software must become more adaptable (our greatest strength) and signals can drive this for the most part.
So what exactly is a software signal? A signal has a namespace identifier much like a package or class. It also has a path (or track), consisting of a sequence of boundary names, which reflects its origin and propagation upwards through boundaries. For each signal within a named boundary (scope) the occurrence and strength of the signal is tracked and tallied (summed) within the property values
total respectively. The
count is incremented by one every time a signal is emitted. When code does
emit a signal it can pass an optional numeric argument which we refer to as the strength, with each signal having its own arbitrary scale. The strength is added to the
total for the signal within the boundary. An
emit should not result in any object being created or control being passed elsewhere in the software. It simply updates both the numeric values behind each property as if they were variables local to the function. Think of signals as silent (or passive) exceptions. Here the variables are in fact local to the boundary which encloses one or more callers and callees. When execution exits the scope of the boundary, rules are applied, which in turn can fire actions that propagate one or more signals to the enclosing boundary if any. New signals can also be composed from existing signals within the boundary and then propagated upwards. Signals can be renamed and their strengths transformed (dampened) before propagation. Even at the boundary, signals can be emitted within the current scope, leading to additional rules being evaluated and actions fired, though naturally care must be taken to avoid unwanted recursive cycles. The underlying mechanism is adaptive, filtering out signal noise at the earliest possible opportunity in the evaluation and cutting short signal propagation upwards through enclosing boundaries.
To explain signals in the context of software caller-to-callee interactions lets look at one of the most prevalent data structures used in software –
HashMap. Other than the construction of a
HashMap the two most common API calls are
HashMap.put(Object, Object) and
HashMap.get(Object). Both of these methods can exhibit a performance variance but generally the execution is so fast it would be impractical to time every single method invocation, even then it is not certain whether the measurement would be accurate and/or reliable (precise). Most developers would be aware of the underlying implementation of each method and would be able to identify the drivers for possible variance in execution. Leaving aside the performance of the
hashCode() method called on the first
Object parameter, which is not under the control of the
HashMap library developer, the performance of
HashMap.get(Object) method is largely dictated by the number of entries chained at the index point into the internal table that are navigated over and compared with. For
HashMap.put(Object, Object) it would be likewise with the additional possibility of an expensive table resize for a key that is not already present and where the new size is over the capacity threshold set for the
HashMap. Here are some snippets of code from the latest version of
HashMap shipped with the Oracle JDK 7, if you are not already familiar with the code and its execution logic.
Here are some suggestions for signals that could be emitted by the
To add the
table.scan signal to the
HashMap.getEntry(Object) method we would need to track the number of entries compared,
i, and then before returning execute one of the following statements assuming language support.
Using the Signals Open API this looks like
table.resize signal the strength emitted would be the length of the new
Now with signal instrumentation built into
HashMap and possibly many other utility collection classes within the JDK, we can very easily collect a complete signal set much further back up the caller chain every time a deviation from expected performance (via metering) is detected, by taking a
Snapshot of the cumulative signaling within the thread specific
Context prior to commencement of work, then following completion comparing the
Context again resulting in the generation of a
ChangeSet holding those signals that have changed along with their strengths (
total) and occurrences (
count) since the
With the ability to generate signal sets on every observed deviation we now have the required data to feed directly into an analytical engine that applies (machine) learning algorithms to the data to further eliminate noise from the signal set. What is extremely important to note here is that the callers need not necessarily understand the semantics of the signals. With each signal having an unique namespace within a path, with a path representing the runtime boundary propagation hierarchy, we can not only identifier the component and the particular drivers of variation, we can track the flow across boundaries, which would typically represent application layers and services each having multiple enclosed caller-to-callee stack frames.
Note: Signals can be introduced into any existing class implementation without any impact on the class interface. We don’t need to retrofit existing classes or interfaces such as returning a Pair struct type holding the result and a set of signals. Signals are silent and local to the thread and its stack of boundaries. The introduction of signals should be completely transparent to non-adaptive software.
Unlike the variable block scoping rules in Java, each signal boundary is able to see the observe (and store) the signals propagated up to it by the boundaries it encloses but it can only see the signals in its own boundary enclosure (parent boundary) that it has propagated and then only the cumulative totals and counts for those propagations. This is incredibly important because it reduces the possibilities of signal interference across different call paths with different behavioral characteristics.
Note: Boundaries are what make signals so powerful as they net (capture) and retain the signaling across multiple enclosed calls (interactions). They bring a distinct hierarchy (low to high in terms of abstraction) to signal processing and propagation much like the memory structure that underlies cortical learning algorithms.
Accessing a signal’s cumulative values for both
total within the current thread context boundary, can be done by first looking up the signal.
Note: Boundaries allow enclosed method calls to maintain conversational state in the scope of a direct or indirect caller method by way of a signals
total cumulative values. The
total for a name signal will differ within each enclosing boundary until the outer most boundary scope has exited and all propagation has occurred.
A useful way of thinking about boundaries is as conversation threads (as in the typical online forum definition) nested within another conversation thread which itself can be nested in another conversation thread. You can see the conversations that have forked from your own conversation but you can’t see the conversations that have occurred from the same conversation thread you were forked from. Each conversation thread maintains the memory of its nested conversations but each memory (signal history) is isolated. It is the responsibility of the forking conversation to memorize the essence of each conversation forked from it and to reconstruct its history every time the same conversation starts. This is all automatically handled by the signals runtime without any work on part of the developer other than in specifying suitable boundaries and then most importantly creating such boundaries with names that distinctively classify (or delimit) them.
Note: Boundaries demarcate the scope of a conversation – a kind of transaction for signals.
There are two types of signal boundary instances – an anonymous boundary and named boundary. An anonymous boundary can be used by a developer to restrict the propagation of signals emitted within a method and the methods (directly or indirectly) it calls. It can also be used to hide implementation details or to net (capture) all signals then translate them into something more meaningful to callers similar to a try-catch-throw pattern. In doing so it also limits the duration (time scoped) of the boundaries (conversations) below it. A named boundary has many advantages over an anonymous boundary. Firstly the signaling system can use the name of boundaries, and the uniques paths they create, as way to regionalize the memory it uses to adaptively reduce signal noise. Secondly complex event processing (CEP) like signaling rules can be externalized in configuration and keyed on such names and paths. Thirdly it adds high level (work)flow insight. Fourthly it allows for adaptation via the automatically downward propagation of signal signatures. Finally boundaries can be named based on context rather than the code. One example of this would be in the execution of JRuby code which is executing a Ruby script. Here a boundary would represent a statement in a Ruby script with the boundary enclosing the underlying implementation methods in Java. Signals is a perfect solution for many JVM languages that need to expose fine grain observability further up in the caller stack especially in application server container environments. It also helps immensely with understanding the efficiency of any language-to-bytecode mapping, which is a challenge today considering the clock resolution for such operations in Java 7 and onwards.
Boundaries whilst mapping naturally to (select) execution stack frames can also demarcate integration with particular instances of a data structure or algorithm. For example lets say we have a class that has two
HashMap instance fields and that we wish to separate the signals emitted during the interaction with each instance in the execution of a method. This can be done using a different name for the boundary around each instance interaction. The name of each boundary would be
HashMap variables in a method we could use
Note: The above code does highlight an important modeling decision in that events can be modeled as boundaries or signals. This is very similar to modeling with metering in that an event can be a
Probe (activity) or a
Meter (resource). This is not by accident but a reflection of the underlying cause-and-effect event chain.
In a future follow-up I will explore signal boundaries in much greater and how important they are in filtering out noise, hiding internal self regulatory mechanisms, and strengthening signals, leading to increase information quality (relevance, comprehension,…).
Signals allows for rules to be fired and listeners to be notified at each stage (
AFTER_EXIT) in the life cycle of a named boundary. Rules will conditionally match on the occurrence, strength and sequencing of signals emitted within the scope of a boundary, including that of nested boundaries. Such rules are defined externally in configuration and keyed (mapped) on boundary namespaces. A matching rule can restrict the propagation of one or more signals. It can also create and emit a new higher level signal within the scope of the boundary and/or its enclosing (parent) boundary. The signals runtime will automatically maintain this traceability.
Note: Qualitative signals can be derived from quantitative signals.
Listeners, externally configured much like rules, can also be attached to a named boundary for more elaborate rule matching as well as integration with other measurement and monitoring solutions and systems.
Note: Boundaries can serve as a natural point in the execution to apply adaptive control and quality of service (QoS), before a boundary is entered and after it is exited, with both persisted and present signal sets influencing such intervention.
Most library developers would readily accept the need for greater instrumentation via something like signals, though I expect it would still be given a back seat to functional requirements, and there would be a reluctance to change existing implementations. To alter that viewpoint, greater value needs to be obtained from the signals emitted and collected. Such value can be obtained through the dynamic runtime adaptation of library based on its own signals. Signals can be used in conjunction with adaptive control valves to improve the search for the optimal and sustainable workload for a library component. Signals can be emitted privately within the library and then reflected upon at the ingress points into the library to drive its own internal self regulation mechanisms, in a far more natural and comprehensive manner than is typically done in code today. Signals can be interpreted by callers and used to alter future creation and usage of a particular data structure and algorithm.
Far more important is that signals emitted beyond the library boundaries and passed up the caller-to-callee chain can be sent back down after signal noise reduction, then used to alter the execution flow within the library, with such behavioral change reflecting more of the caller context than the generality of the library. A problem for dynamic runtime adaptation is that the class (or object instance) can’t accurately predict usage patterns or capacity needs within the current execution context because it can’t determine the most appropriate caller or caller-to-callee sequence on which to base such predictions on. It simply can’t see or identify who is interacting with it. There is no readily available means for recollecting or reconstruction conversations, other than what is maintained in its own internal state. Here is where signals can help in being the digest (signature, profile) of such interactions, collected by one or more callers, then made available at the boundary of subsequent interactions. A boundary created by a caller that plans to make multiple calls into a library within the current execution context (and boundary), can also serve as a means to store memory signals (and eventually signal sequencing) across the multiple enclosed interactions – for both stateless and stateful components.
Note: Signal propagation, both upwards and downwards, is automatically managed by the underlying signals runtime.
Signal sets in the context of adaptation acts as classifiers of caller(s) and context. Signal sets are derived, profiled, and refined automatically by the underlying signaling engine over the course of multiple boundary path transverses. Some of the signals within such profiled sets will serve as operational hints or directives in the servicing of future interactions. Other signals will simply be used to uniquely identify or mark a caller. The beauty in all of this is that the direct or indirect caller need not be aware of how marking is done and which signals drive adaptation (or control). The meaning of signals is irrelevant to the caller. The only responsibility of boundary callers is that the signals be aggregated and persisted across deconstructions of boundaries and then materialized on reconstruction. Think of it as an incredibly powerful and flexible feedback and learning loop for software.
Signals also addresses a thorny issue in the design of feedback loops – how should control be exposed. In traditional control systems this is typically done via one or more set points that regulate flows and processes within a plant. A sensor is continuously measured, measurements are interpreted by a controller then translated into actions which interact with the process (plant) being controlled. With signals the operational knowledge does not necessarily need to be spread across system components. The process (plant) can generate the signals that drive its own subsequent execution behavior, with the boundary serving as the context and memory for such adaptations and regulations. When a new boundary is created the process can start the observe, learn and adapt cycle all over again but tailored to this new context.
Signals can be extremely useful for application/library developers, allowing them to collecting much more relevant data (at runtime) needed to guide the development of coding heuristics – optimizing for the common case. Such measurement and collection enables both offline and online software adaptation as well as fine tuning of initial set points and thresholds in control mechanisms. It reduces a good deal of guesswork for the developer when he/she is forced to commit to specifying default parameter values in the construction of data structures (thread pools,…) though poor judgement can still be compensated by online adaptive techniques with some initial cost during the learning phase. Signals are a modern form of usage tracking that goes far beyond the recording of what is called at the surface of an API. Signals can create a powerful design, development and deploy feedback loop.
Most initiatives involving continuous performance monitoring of applications and services fail to deliver the benefits touted by supporters , especially in non-deterministic environments such as the JVM, due to the usage of wall clock time as the means to detect deviation, across builds and releases. Signals being behavioral counters are not subject to the same level of disturbance that contention, suspension and scheduling have on timed execution though signals can be used to detect such disturbances.
Finally we believe that the above approach has much wider applications such as in monitoring and management of cloud services.
It is our strong belief that application management (monitoring + adaptation + control) should be an integral aspect of all application software. The days of an externalized central application performance monitoring solution that is unable to influence the software it monitors in real-time without human intervention are numbered. Open APIs such as our very own Metrics, Probes, and Signals can dramatically enhance the quality and reliability of software and so we plan on making available the Metrics, Probes and Signals APIs, as distinct from their underlying implementation, under an Open Source license at the very beginning of 2013 when we deliver the first release of JXInsight/Signals. We look forward to seeing every new application and service shipping with resource metering, metric monitoring and signaling instrumentation hooks built directly in as a means to profile, police, protect, prioritize, predict and provision execution behavior and resource consumption.
If you are interested in learning more, possibly assisting us with extended academic research, as well as bringing Signals to other languages and runtimes, please email me at
email@example.com, alternatively you can connect on linkedin.
Calling Time on Performance Measurement – JFall Conf 2012 Slides (PDF)
The Complexity and Challenges of IT Management within the Cloud
Prediction is the Future in Application Performance Management & Software Optimization
Visions of Cloud Computing – PaaS Everywhere
Metering In The Cloud – Visualizations Part 1 [PDF]
java.util.concurrent.locks.AbstractQueuedSynchronizer class could be enhanced to emit signals when
tryAcquire is is successful or not, in which case a more expensive call,
doAcquireNanos(), is executed.
Signals could also be added to
doAcquireNanos() method to track the number of loops performed especially as the relatively expensive
System.nanoTime() method is called for each iteration.
java.lang.AbstractStringBuilder class, which is the super class of
StringBuffer, could be enhanced with signals that track (within the scope of a thread boundary) the number of times expansion occurs as well as the size of such expansions. Trust me you will be amazed how much resizing goes on in the processing of web application requests.