JXInsight/Simz – Time, Space & Data in Application Performance Monitoring
In considering runtime application diagnostics and performance analysis provided by an application performance monitoring solution particular attention should be given to time, space and data, wherein time is the delay period from the moment an event occurs until it is classified and analyzed, space the distribution of monitoring functionality which can be centralized or distributed, partitioned or replicated, and finally data the collection, modeling and sharing of measurement based observations.
Note: Time delay typically manifests at the point of data transmission latency, across processes and storage devices, as well as scheduled execution of monitoring tasks and the waited completion of measured transactions and traces.
Note: Data is a large topic which won’t be covered here though it will be discussed briefly in relation to space and time which are intertwined with it as well as with regard to the implementation and operation of monitoring functionality.
A typical performance profiler co-locates the diagnosis and (data) analysis in the same runtime. This architecture introduces very little delay as long as the observations are consumed within the application itself. Delay is only introduced, along with distribution, if the monitoring data is pulled or pushed to a management console. For most monitoring consoles a small subset of the data is “continuously” polled (every few seconds) with the option to perform a full data snapshot on demand.
Whilst this architecture is ideal for applications developed (or deployed) with self-regulation and/or self adaptation, as should be the case for most cloud computing applications, it is not concerned with global performance issues though it should be noted that local application/component diagnosis is a very good starting point in any investigation especially if used in conjunction with adaptive controlled execution and signal relaying.
First Generation APM
Most first generation application performance monitoring solutions are focused on the global high level picture of application health, pulling (or pushing) application runtime performance measurements from a central monitoring server and storing the data as records in a database which are then later queries and retrieved by sub-component routines within the server responsible for diagnosis and analysis.
This type of architecture introduces delay at the data transmission points, the data storage (and retrieval), and in the scheduled execution of monitoring routines. Additional delay is also introduced in the waiting for completion of a transaction, trace or metric collection as well as the batching of such things within the application runtime before transmission.
There is no distribution of monitoring functionality beyond the data collection itself performed within the application runtime.
The scope of the monitoring functionality can be both local and distributed though in practice it tends to very high level with low resolution due to most of the richness of the software execution behavioural model being lost in translation to record form as well as the significant difficulties in reconstructing a suitable model for comprehensive execution analysis.
“Next” Generation of APM
To address issues with older generation of application performance monitoring solutions, such as limited scalability and poor quality of information, newer solutions limit the data distribution by collocating much of the diagnosis function within the application runtime itself.
Unfortunately this approach does not fully address delay as results still need to be transmitted in the form of filtered (“on exception”) annotated transaction or trace snapshots, to the monitoring server as generally there is no ability to connect directly with the application runtime from a management console. Much like the older generation this type of solution does introduce delay in the waiting for completion of a transaction or trace.
The monitoring function is partitioned with the diagnosis component distributed to the application runtime and the analysis centralized within the monitoring server.
The scope of the diagnosis is exclusively local though this is often compensated, in a piecemeal manner, by the analysis function processing data at a global context (exclusively).
Like first generation solutions the effectiveness of any sort of global analysis is hindered by an impedance mismatch between the live application runtime model (threads, stacks, code, resources) and the database table record design resulting in convoluted data access, transformation and analysis processing code. Further compounding this is the usage of different (data) models and libraries by the diagnosis and analysis functions.
CEP Generation APM
A few application performance monitoring solutions have tried using a complex event processing (CEP) platform as the foundation for a monitoring server. The big difference here is that instead of storing records in a database and then later querying the data, events are streamed over from the application runtime into the monitoring and then pushed immediately, via some form of callback/notification, to rules (and routines) responsible for runtime diagnosis and data analysis.
Delay is still present though far much less than first and “next” generation solutions as most event processing batches writes to the event stream in the form of completed traces and transactions (complex events) processed locally from lower level events.
All monitoring functionality is centralized except for some event processing within the application runtime in constructing complete transactions and traces.
The scope of the event analysis performed by the diagnosis and analysis sub-components can be both local or global though as with previous architectures the richness of the application runtime model and ease of its inspection is all but lost in the translation to event and its transmission in a stream.
Simz Generation APM
In setting out to design a solution to address many of the weaknesses in existing application performance monitoring solutions we had the following requirements and objectives:
- It should be possible to distribute and replicate both the diagnosis and analysis functions across application runtimes and multiple monitoring service instances. In addition replication of such should support local parameterization of operation.
- Both the diagnosis and analysis functions should use the same software execution behavior/ resource model irrespective of the distribution (and location).
- The scope (global or local) of which should be for the most part transparent to existing monitoring code routines. Analysis of parallel software execution across both threads and processes should be handle by the same code routines. Such function should place more emphasis on the thread execution irrespective of its location (process container).
- There should be no significant difference in monitoring an application from a monitoring console whether connected directly to an application runtime or indirectly via a monitoring service. Both the model and operation provided should be consistent across such different connectivity options.
- Where possible delay should be eliminated via replication of monitoring function, immediate partial event transmission, and continuous execution of both diagnosis and analysis routines.
In stepping back and looking at the tall order we had given ourselves it became apparent that what we needed is for all applications to be deployed to a single very BIG JVM, providing us with the low latency and efficiencies (in terms of data sharing & transmission) of a profiling solution, with the global scope of a typical application performance management solution and more importantly without any code changes to the large number of existing extensions and plugins built on our activity based resource metering engine and its Open API.
“When you have eliminated the impossible, whatever remains, however improbable, must be the truth” – Sherlock Holmes
“When you have discounted all other possibilities, whatever remains, however impossible, must be the solution” – William Louth
Fortunately after immediately discounting our BIG JVM scenario we had an amazing and insightful breakthrough. What if we could bring the applications to a BIG JVM – at least the execution essence of the application. Companies such as Azul Systems have shown they you can move bytecode and execution from a client proxy JVM to a JVM running within a remote compute appliance. Our job would be much easier, efficient and scalable as we did not need to execute bytecode only the activity metering calls into Probes Open API. To do so we would make the monitoring server appear like a normal application runtime to both the measurement engines (metering & metrics) and monitoring consoles. A single very big application runtime that would simulate and replay all metered software execution behavior occurring in connected application runtimes in near realtime. Threads would be created, started and stopped. Probes (methods) would be fired (executed) and metered (measured) by the same thread that existed within the application runtime and with the same reading(s) from the application meter(s). Scalability would not be hindered by state (heap) concerns as we already had the perfect lightweight model of what constitutes the essence of software execution behavior and resource consumption.
Of course realizing such an ambitious vision and design has not being without its technical challenges but we have pretty much overcome all in part due to our superior Open API design, existing extensibility within our metering engine and through the development and testing of high performance code. In doing so we have increased the level of security and reliability by offering a developers an alternative application universe in which to probe and observe as well as allowing the monitoring service be accessible to other non-JVM languages and runtimes via a portable binary metering feed protocol that makes it possible to simulate and analyze across both language and process boundaries.
Effectively we have engineered, JXInsight/Simz, as a near real-time scalable discrete (concurrent) event simulation solution in which events represent the stages (start, stop) in the execution of software behavior.