From Anomaly Detection to Root Cause Analysis via Self Observation
“Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior. The patterns thus detected are called anomalies and often translate to critical and actionable information in several application domains. Anomalies are also referred to as outliers, change, deviation, surprise, aberrant, peculiarity, intrusion, etc.” – Wikipedia
There are two general techniques used in performing anomaly detection in software systems. The first technique is based on time series analysis of sampled measures (metrics) which is generally done offline (or online but sufficiently in the past). The second technique is event based comparing one or more event specific measurements (clock, cpu,…) with predefined or dynamic thresholds, which is generally performed at the point of its occurrence (in time and space).
In the context of event based analysis a number of approaches have been used that allow moving on from detection through to root cause analysis. One approach used by solutions that are largely call stack sample based in their measuring of code performance is to have each thread on beginning of a request to register its self for observation with a supervisory thread. This supervisory thread then every so often (in milliseconds) checks on the progress of the registered threads. When the supervisory thread detects a thread has passed the time threshold (which maybe pre-defined or dynamic) for a particular request/operation it begins sampling the call stack of the request thread at regular fixed intervals (in milliseconds) until the thread eventually completes and unregisters itself for further observation until the next request.
The primary advantage of this approach is that no measurement is performed until the threshold has been exceeded, then when it does measure it is only call stack sample based. The primary disadvantages of this approach stems from the fact that no measurement is performed until the threshold has been exceeded. Yes you are reading it correctly. Why is it that what is good is at the same time also bad? Well because its simply a trade-off between possible lower overhead measurement cost and many important information quality metrics such as accuracy, precision, resolution, coverage, composition, comprehension and completeness.
With call sampling measurements following on from the point of the time threshold being exceeded information quality is lost in terms of
- Accuracy: It is not possible to accurately determine the time spent by methods that are sampled. Even if measurement were event based it would still not be possible for those methods started before the threshold which completed sometime after the threshold.
- Precision: By its very nature call stack sampling is not precise especially in the JVM without unique identifiable stack frames.
- Resolution: The default time interval for most call stack samplers is 10 ms. Which by todays standards is pretty coarse. There are a number of reasons for this including the cost of stack collection for a high number of threads, the every increasing depth of such stacks and the resulting impact on garbage collection. Many of our customers execute transactions (trades) in under 1 ms so this approach has no value whatsoever.
- Coverage, Composition, Comprehension and Completeness: These are all severely impacted because there is no detailed behavioral evidence (code frequency & timing) gathered before the threshold point which is more than likely were the actual problem is. Data collected following on from the threshold could very well only report on the normal execution timing of completion and cleanup code. This is further compounded by the fact that most (humans as well as machine algorithms) will set thresholds significantly high so as not to have too many alarms. On top of this there is no understanding of what constitutes normal behavior within the processing itself so there is effectively nothing to compare with other than all other requests that also exceeded the threshold. If requests don’t exceed the threshold by a huge amount then you are out of luck…if they do then you are out of excuses.
Note: This approach first appeared on the Java performance scene in the now defunct Glassbox project but more recently used by AppDynamics and NewRelic. It’s typically used by vendors with very little in the way of intelligent adaptive measurement and by those with high overhead due to inefficient measurement and collection code.
Self Observation – The Better Approach
A much better approach is to have the thread themselves (continuously) measure and observe their own execution behavior (code) and performance (clock, cpu,…) and then only store such a collection of aggregated measurements in the event of a threshold being exceeded at completion of the request processing. Combining this with some degree of intelligence in measurement gives the best of both worlds, low overhead and high information value, whether its monitoring requests that take seconds, milliseconds, or microseconds.
Before a thread begins processing a request it creates a
SavePoint (checkpoint) referring a particular point in time in terms of the metering (frequency & timing) state for the threads
Context. Then on completion of the request it generates a
ChangeSet holding the
Changes in the metering state since the
SavePoint by way of comparison with the threads metering
Context. Note the
compare function need not be called by the thread unless the threshold is exceeded.
All of the problems with the previous approach are mostly solved. The information is complete allowing comparative analysis with normal behavior patterns which can be defined by aggregating, binning and classifying such collections. What is even more important is that the collection set can already be trimmed or even better a signal raised and all data discarded immediately. We seriously have to start (re)considering are we and not the machine the primary and most appropriate consumers of management data in this new era of computing in the cloud.
Here is how this is achieved using our Probes Open API though its not necessary as our built-in transaction metering extension will perform the same function under the hood of the metering instrumentation inserted at runtime into classes.
With such self observation capabilities it is very easy to ask self reflecting questions of a threads code execution behavior between two points in time. Here is how to determine the number of times the method
com.acme.App.leaf was called directly or indirectly by the
com.acme.App.call method using the
ChangeSet that was generated following completion of the
Whilst this is a unique and novel approach it is still relatively simple to understand and incredibly powerful in its application to the many management tasks that applications, platforms and runtimes will be required to perform over the next coming years as the influence and usage of the cloud expands. Here is how the latency for each package, class or method executed can be determined.
Note: This is not exactly tracing as the collection only holds information on what has transpired in terms of measurement aggregation but not how such measured execution was called and in what order it was performed because as all professional performance engineers know path tracing (especially distributed) does not scale in terms of overhead at runtime or analysis during offline viewing.
Video: Here is a screen recording showing behavioral introspection in action.
Software with embedded self observation can get a much better sense of the underlying execution behavior of classes and components up and down layers in the stack without ever knowing of such beforehand. A thread can use this self diagnosis during inflight processing of a request at particular check points in its execution, then reason on the behavior and measurements collected and take possible corrective action at that moment or in the immediate future such as holding back further processing of requests for a short time whilst an underlying resource is experiencing performance or reliability issues.
Building efficient and effective application performance management solutions on top of this approach and creating behavioral signatures from such change sets becomes relatively straightforward. Take a look at how change sets make up transactions in our management console.
Note: We firmly believe that it is this kind of innovation which is sorely missed in the Java runtime today and which could inspire a whole new crop of innovations in the area of self regulation, self correction, resource management and execution optimization once it is included.
Video: Here is a screen recording showing automatic collection of metering changes by the transaction metering extension without the need to explicitly use the Probes Open API.