Skip to content

How (not) to design a Metrics API – Part 1: “Millions of Metrics”

This is part one in a series of articles on how best to design an application metrics monitoring library, in particular its API, providing versatility both in terms of application of the library across domains & environments and in its implementation by one or more vendors. In the series we will discuss the underlying thought process, principles and patterns guiding the design and development of the JXInsight/OpenCore Metrics API and in turn compare it with other libraries that don’t exhibit the resulting qualities which have emerged from such software engineering discipline.

“Millions of Metrics” is a statement made by Adrian Cockcroft, Cloud Architect @ Netflix, in discussing the monitoring of applications deployed using their own proprietary PaaS solution. Granted it’s somewhat boastful (he was speaking on stage) but to those not very well experienced in application performance monitoring it not only sounds impressive it sounds like a goal worth pursuing which would be a mistake. The problem with this unqualified statement is that it does not make the distinction between an information model and a management model, a measure and a metric, a measurement and a sample. There is no software, model, process or human that can effectively use, scale and benefit from a set of such size at least not from a management perspective. It would be terribly (cost) ineffective as the behavior of most applications and systems are driven (determined) and signaled (distinguished) by a very small number of measures.

Note: It’s not certain whether Adrian confused the (measure) instance count with the (metric) type count.

The information model is the (super)set of all measures that can be monitored. The management model is the (sub)set that is sufficient and suitable for monitoring purposes being sampled and saved. It should always be possible to inspect on an adhoc basis the values of the measures pertaining to the information model but only a few of such measures in this model should form the management model that is itself manageable. A metrics library that does not make this distinction very apparent in the design of its interface, interaction and implementation will ultimately be a failure forcing engineers and operations to make poor trade-offs at inappropriate points and limiting its usage beyond very simple use cases.

The JXInsight/OpenCore Metrics Open API makes this distinction in having a separate class for Metric and Measure. Whilst instances of Measure (Counter and Gauge) are managed by the runtime only those instances of Measure that are registered with the runtime for the purpose of monitoring (sampling & collection) are viewed (associated) as a Metric. A Measure becomes a Metric when it is both managed and monitored. The Measure need not be a Metric though it can still be accessible to callers for the purpose of update and access. This is a much better option than having static (global) fields scattered throughout the code base holding references to such measures.

A benefit of this design is that more than one Metric can be mapped to a particular Measure under a different Name. This design also allows for the composition of a Measure from other instances of Measure and then only for this composite to be registered (associated) as a Metric. It also allows for the registration of a Measure as a Metric to be eliminated from the code itself and instead configured externally. The design does not tie the lifecycle of both types and it does not expose state and functioning that would otherwise be made accessible if both types were combined.

Here is a snippet of code showing how the JXInsight/OpenCore Metrics Open API is used to lookup (and create if not present) an instance of Counter, increment it and then register it with the runtime for sampling and collection purposes using its own Name.

Note: The registration of a Counter as a Metric can be done automatically using the jxinsight.server.metrics.counters=${c1},${c2} system property.

The following snippet shows the registration of a Counter as a Metric with an alternative Name mapping.

The interaction story line for a Gauge is pretty much near identical.

Note: The registration of a Gauge as a Metric can be done automatically using the jxinsight.server.metrics.gauges=${g1},${g2} system property.

Gauges like Counters can also be registered as a Metric under different Name mappings.

Note: Use of Counter and Gauge is optional as an implementation of Measure can be registered as a Metric providing extra immutability safe guards.

Due to a deliberate and stylistic design choice in using largely inner interfaces to represent elements of the model we can do a static import on the Metrics class and increase the readability of the code.

Note: Except for some struct & enum classes and the Metrics entry point class we only make public (inner) interfaces. This allows for different implementations of the runtime to offer alternatives for Counter and Gauge.

Note: Since the release of our Metrics Open API over 3 years ago we have not broken backwards compatibility and have never deprecated any interfaces or classes or methods. It took us a year to design the Open API prior to its public release. That time was well spent in getting the model, concepts, names and signatures optimal, aligned and consistent.

A Note on Alternatives

Before comparing the OpenCore Metrics Open API with two open source alternatives found on GitHub its important to note that our Metrics Open API was made public on the 15th May 2009 the same day that Amazon AWS announced CloudWatch and its own Metrics API. It remained publicly accessible up until Oct 2011 when we moved all our developer content over to our http://developer.jinspired.com site.

Since we don’t view the alternatives used throughout this series as examples of good design we have changed the package namespace whilst keeping them phonetically similar.

Yammer

This popular metrics library first published in Feb 2010 (though it was much more different then) is similar to OpenCore in using a main entry point class named Metrics within an enclosing metrics named package. Originally this class was named com.jammer.metrics.core.MetricsFactory but renamed in Dec 2010. Evidence of this factory heritage can be seen in its use of “new” as a prefix in methods which lookup and create a Counter.

There is no distinction between an information model and a management model. The Counter is a class (a poor design decision) that implements the Metric interface and which has one very peculiar method named processWith (which will be revisited in a future part).

In the creation of the Counter this intersection (joining) with the information model and management model is done automatically without recourse. Of course one could always instantiate the Counter without using the Metrics class but then it still needs to be added to some shared map structure to make it accessible to adhoc inspection tools. Which brings us on to the next issue with this API and that is the exposure of an underlying registry data structure, MetricsRegistry, which can be used to determine whether a Metric is available without actually resulting in its creation. Incidentally the interface also requires the caller to know the class type just to be able to determine the current value. There are far too many aspects of the underlying implementation exposed – even MetricsRegistry is a class and not an interface. It’s a very bad code smell in a class, Metrics, that is only a few lines.

Note: Version 2 of this library released in 2012 was a complete rewrite breaking any sort of backwards compatibility. Expect the same to occur repeatedly for version 3, version 4 and onwards.

For measures of type Gauge that same issues are present along with others including not offering an actual implementation of this abstract class in the library, usage of numeric (un)boxing and an alternative named value accessor.

Netflix

The main entry point into this metrics library is DefaultMonitorRegistry which allows for instances of Monitor (a metric in our management model) to be registered via cumbersome call sequences such as the creation of a MonitorContext (which is effectively a metric name) using the fluent builder call pattern. There is a Metric class but that is in fact named incorrectly as it actually represents a timestamped record of a sampled monitor value. Pretty much every primary class in this library is poorly named (and the underlying implementation is not any better).

It makes the same fundamental mistake that all newcomers to metric monitoring make in not distinguishing the management model from the information model. Amazingly there is no way to lookup a Monitor (measure) and its value subsequent to it being added to the MonitorRegistry (model) without actually going through another management library, JMX, which it is naively coupled to (in a horrendous manner) the underlying implementation. Leaving aside the JMX access path there is in fact no information model which is wrong as both models are needed though not always at the sometime and under the same (incident mgmt) circumstances.

That said it does at least define a getValue method in the Monitor interface which is extended by both the Gauge and Counter interfaces though this is largely a reflection of it being simply an embellishment on JMX which has many other problems in particular in its name identifiers which this library has also inherited via what it calls “tags” in the MonitorContext.

Here is how a Counter is published and registered (and made unaccessible).

The Gauge has a similar call sequence though not without its own issues in numeric boxing.

Note: To make all metrics registered with the OpenCore Metrics Open API be accessible via JMX all that is needed is a single system property to enable this extension jxinsight.server.metrics.management.enabled=true.

Part 2: Delegation & Separation