Metric Monitoring – Tracking Change Over Time But Not Changing With Time
Metrics have for a very long time served as the view model for visualizations within IT monitoring dashboards, but are they really suitable as the foundation for a data collection model that powers analytics and control in the performance monitoring and management of applications and services? Well just recently at a customer site I was given further cause for concern regarding the suitability of metrics as the basis for a monitoring data model when it was found out that a metric collected by another application performance monitoring vendor solution did not match a similar metric derived from our metering model. After an investigation it was found that the other monitoring vendor being replaced had wrongly calculated the metrics measure. That was the good news. The bad news was the customer wanted us to calculate the metric wrongly as well. Why? Because all their trend management reports would be thrown off. It was better to be consistently wrong as long as the change was tracked correctly across time samples and reports.
I have always had a preference for behavioral based data collection models, such as activity metering, that allow sufficient context and causation to be captured, recorded, replayed and related. That said metrics can be useful in answering questions of a temporal nature that are focused on direction and rate of change.
We managed to get the customer to see it from our perspective and in the process view metrics as merely snapshots of answers to questions which may have been wrongly proposed, constructed and answered. Once this view takes hold other issues and concerns come to mind including:
- Maturity: Metrics are constructed and hardwired into an application during development, when there is very limited amount of operational information and knowledge of factors that drive changes in behavior, consumption and cost. It is like a game of jeopardy, starting with the answer and then thinking what the question might be when the application does finally get deployed into production. Developers think of what an useful information model looks like and later operations define a subset as the management model.
- Traceability: Metrics being snapshots of answers there is no traceability as to the method and means of calculation. This problem is further compounded when metrics are derived from other metrics with little hint of dependency or expected correlation.
- Reproducibility: Metrics are collected at fixed intervals but not instantly and never exactly at the same time and so we have samples built on snapshots making it near impossible to reproduce and hence test without some measurement tolerance which can mask underlying calculation errors.
Metrics for the most part break the cause-and-event chain because the essence of software execution, the flow, is lost but this can also be said of logging and event based monitoring, both of which have not improved or evolved much over the years if we discount indexing and searching of logs.
Well there is an approach that can address all of the points above and that is to record the underlying behavior activity and resource consumption, that are the basis for a metrics measure, and then have the metric always derive its snapshot value from the recording or a model as a result of a recording played back. This gives us not only traceability but also reproducibility especially when the recording can be simulated in controlled time increments without the metric having any knowledge whatsoever of this construct. Even better is that we can always recalculate the past snapshot values of a metric if we later on determine that it was incorrectly calculated (or derived).
As soon as you start viewing a metered activity recording as the primary source of any monitoring data model, metrics become basically queries that define and refine the transient backing model for a view in a management dashboard. But this also unlocks many other benefits such as the ability to narrow a metrics measurement window and scope to not only a time interval but a particular event or a chain of events as well as the flow of activity that caused the metric to change. Metrics become named queries and it is the playback in the simulation that reconstructs the world, a smaller world if need be, that holds the answer to a question which need not be known in advance.
It is still highly likely that metrics will still be persisted to an externalized model and storage system but mainly as a repository cache of previously computed answers which can always be invalidated and then repopulated.