Skip to content

How (not) to design a Metrics API – Part 5: Alternatives to Reset

This is part five in a series of articles on how best to design an application metrics monitoring library, in particular its API, providing versatility both in terms of application of the library across domains & environments and in its implementation by one or more vendors. Please read part 1, part 2, part 3 and part 4 if you have not already done so.

Probably one of the worst design choices a metrics library developer can make is supporting a reset like operation. Whilst it is well understood why a developer (with limited operational experience) might request such capabilities it should be avoided as it not only adds to the development effort of measures (i.e. concurrent state management) that form the basis of metrics it invariably complicates the development of plugins and console clients especially with regard to cumulative measurements which are now no longer guaranteed to be cumulative in value. It also raises concern around security.

Measurement data is collected at a cost and holds value though admittedly it does deprecate once removed from its space and time context. In our opinion it should never be possible to wipe the slate clean for even a single metric – at least not for base metric measures.

Reset, as in clearing or zeroing values, in our design view is a misinterpretation of the actual client requirements. What is really being asked is the ability to set apart the change in value of a metric across one or more points in time. This is best achieved with marking, tagging or explicit change tracking via SavePoint and ChangeSet interfaces.

Here is some sample Metrics Open API code which will be used to demonstrate both marking and tagging capability available within JXInsight/OpenCore.

Here is an example of the lines printed.

uptime=2,005,000
uptime=3,018,000
uptime=4,019,000
uptime=5,020,000
uptime=6,022,000
uptime=7,023,000
uptime=8,023,000
uptime=9,025,000
uptime=10,025,000

I used the following jxinsight.override.config contents to install the metrics event extension in order to listen with the collection interval set to 1 second.

j.s.m.collection.interval=1000
j.s.m.event.enabled=true

Marking
A single method, Metrics.mark(), is used in the Metrics Open API to support the marking of metrics though like many optional metric extensions this capability is not enabled by default.

j.s.m.mark.enabled=true

When enabled a mark metric, ${metric}.mark, of type GAUGE will be automatically created (under the hood) for each metric of type COUNTER registered by either clients or other metric extensions.

Below is a snippet of code used to mark the metric collection every 3 seconds.

Here is an example of the lines printed out by the listener above.

uptime=2,010,000
uptime=3,023,000
uptime=4,024,000 uptime.mark=1,001,000
uptime=5,025,000 uptime.mark=2,002,000
uptime=6,026,000 uptime.mark=0
uptime=7,027,000 uptime.mark=1,001,000
uptime=8,028,000 uptime.mark=2,002,000
uptime=9,030,000 uptime.mark=0
uptime=10,031,000 uptime.mark=1,001,000
uptime=11,032,000 uptime.mark=2,002,000

From the above output the uptime metric is never impacted by the mark operation (being a counter) whilst the associated uptime.mark gauge metric is on every 3rd collection.

Tagging
Tagging is very similar to marking but instead of a single additional metric generated for each registered metric there can be multiple metrics each with a different name tag. Tagging is done via the Metrics.tag(String) Open API method.

Just like marking, tagging needs to be enabled in the jxinsight.override.config file.

j.s.m.tag.enabled=true

Below is a snippet of code used to alternate between two different tags every 3 seconds.

Here is an example of the lines printed out by the listener above.

uptime=2,006,000
uptime=3,019,000
uptime=4,021,000 uptime.tag.@peak=0
uptime=5,023,000 uptime.tag.@peak=1,002,000
uptime=6,024,000 uptime.tag.@peak=1,002,000 uptime.tag.@offpeak=0
uptime=7,026,000 uptime.tag.@peak=1,002,000 uptime.tag.@offpeak=1,002,000
uptime=8,027,000 uptime.tag.@peak=1,002,000 uptime.tag.@offpeak=2,003,000
uptime=9,028,000 uptime.tag.@peak=1,002,000 uptime.tag.@offpeak=2,003,000
uptime=10,029,000 uptime.tag.@peak=2,003,000 uptime.tag.@offpeak=2,003,000
uptime=11,030,000 uptime.tag.@peak=3,004,000 uptime.tag.@offpeak=2,003,000

Unlike marking, tagging automatically creates and registers metrics, ${metric}.tag.${tag}, of type COUNTER so the setting of the current tag to a previous tag value does not reset the associated metric measure values. And because tagged metrics are cumulative, marking can be combined with tagging resulting in metrics, ${metric}.tag.${tag}.mark and ${metric}.mark.tag.${tag}, of type GAUGE.

SavePointing
Finally the JXInsight/OpenCore Metrics Open API offers the ability to create arbitrary snapshots of the metric collection and compare it at a later point in the process execution without creating any additional metrics or enabling any optional metric extensions.

Here is an example of the lines printed by the above code.

uptime=3,019,000
uptime=3,004,000
uptime=3,003,000
uptime=3,003,000

Alternatives
In previous articles in this series we compared our design approach to that of the engineering of similar open source metric libraries at Netflix and Yammer (now Microsoft). Unfortunately neither alternative library has anything to offer in terms of reset functionality (or tagging or marking for that matter) though an issue has already been logged with one such library under the title “JMX reset operation on metrics”.

Judging by recent web traffic to this article series from persons at each company and subsequent commits on GitHub referencing criticism of previous poor design choices, expect to see reset being passed over. Which to be honest is better than breaking compatibility with all previous versions of a library which such commits have a habit of doing (including ones listed below).