Rewriting Application Performance Monitoring Histories with Record & Playback
Many application performance monitoring solutions offer the ability to filter fixed time interval collected metrics (measurement samples) by time slice, source and metric name in live custom dashboards. But since metrics are observations and not actual execution behavior recordings it is not possible to recreate a life-like experience (viewing) of activity events unfolding that is filtered to a particular type of activity (transaction, task, action, job, etc). And it is simply impossible to recalculate metrics that reflect a specific context you are interested in as a metric can be (and generally is) updated in multiple contexts. How does one ask how many JDBC/SQL operations per second occurred as a result of business activity X when a single metric represents the business activity X and another metric the database activity? You simply can’t answer this with metrics as the cause-effect chain is missing. This is the result of “the world is flat” metric model viewpoint held by nearly all traditional application performance monitoring vendors.
One way around this is to partition up the deployment such that only one type of activity is deployed to an application runtime. Operational, this is not practical, especially when a system consists of 1,000 of activities, which is typical of enterprise applications. It would also be extraordinary to take such a deployment approach solely for the sake of monitoring, ignoring the operation overhead it presents. There are many valid reasons for partitioning and modularizing applications into service types and service runtime instances. Monitoring is not one.
An approach that can meet this important challenge is one that involves (1) recording of the software’s behavior, (2) fast playing back of the recorded behavior with contextual filtering and finally (3) time synched (normal speed) playback of the re-recording for experiential learning purposes. You can think of it along the same lines of video editing, but instead of removing scenes and frames we eliminate playback events using the very same intelligently activity metering filter techniques available at the time of actual recording in production. Below is a graphic that visualizes the process. In the playback of the original recording we have installed a metering extension that eliminates all but the orange circles, here representing our screen actors or software activities.
What is really nice about this approach is that you can easily serve up different refined recordings from a single source recording to multiple consumers, each with their own perspective and filtered view of the application behavior. Even if your application and its composite services are deployed as one big monolithic runtime you can still slice and dice the monitoring data and get life-like replay that makes it look like each service was deployed separately (isolated). This addresses the different reporting needs of Operations and Development. Operations can have a single pane of observation across all services and development teams can get a service/component specific monitoring view. All parties can repeatedly relive and refine the events that are of interest to them as if they were happening in the present and in front of their eyes. You can have different “director” cuts of the same film.
To demonstrate this I have created a screen recording that takes a metering recording (+200 million events) from an Apache Cassandra 2.0 server and plays it back with different configurations to generate alternative snapshots (metering reporting) and recordings (metering events). Here is the screenshot of the snapshot generated from a playback without any contextual filtering.
The following snapshot screenshot was taken following a playback of a recording with the
entrypoint metering extension enabled and configured to restrict metering to nested callee (direct and indirect) probes of the
org.apache.cassandra.db.RowMutation.apply method. Note how the
RowMutation.apply method is the top probe in this revision. We have basically relived the past and limited our vision of it to those metering events that occurred within the thread scope of the
apply method execution.
We can go even further by enabling the
timesync metering extension during the playing back of a recording and experience the “live” monitoring of an application including the spread of activity across threads and probes. Here is a screenshot of the console whilst the unfiltered playback was in progress. Please take note of the fourth row section (all green circles) in the metering table page showing threads that have performed metered activity in the last second interval.
The following screenshot was taken during the playback of a recording that was created from a playback of the original recording with the
entrypoint metering extension enabled. Whilst row sections 5 through to 8 bear a strong resemblance to the above, as they represent the live activity of the
RowMutation.apply probe, there is a noticeable change elsewhere in the visualizations in particular those related to global (process level) metering measurements (row sections 1 through to 4).
Below is a graphic that aligns and calls out the visual differences in the above screenshots. The most striking difference is in the live “now” threads row section. The revised recording playback has significantly less threads because these are the threads that executed the
The screen video recording is also available on YouTube and ideally should be viewed in full screen HD mode.
Note: In the playing back of the original recording more than 2 GB of data is read (holding 200 million events) resulting in the creation of hundreds of threads with each thread executing millions of call frames pushes and pops on and off a call stack.