JXInsight/Simz – Near Real-Time Application Replay – 10 Million Calls / Second
In previous articles detailing applications of our rather novel live replaying of metered software execution it was stated that this would occur in near real-time. But what does “near real-time” really mean? Lets start with a definition of “real-time” extracted from wikipedia.
“In the domain of simulations, the term means that the simulation’s clock runs as fast as a real clock would; and in the domain of data transfer, media processing and enterprise systems, the term is used to mean ‘without perceivable delay’.”
This leads us on to the next question (putting aside temporarily the delay resolution). What exactly is being simulated? JXInsight/Simz does not play back the the actual execution of application code nor does it track the data that is passed, shared and changed across calls and threads. Both of these would be impractical in terms of scalability and next to impossible in not impacting runtime performance.
What JXInsight/Simz does is record the application calls into the JXInsight/OpenCore Probes Open API made by the instrumented code storing the thread context and probe name (Java/Ruby/Python method, SQL, HTTP URL) metered as well as the metered readings (performed locally). This is then asynchronously streamed to the Simz Service wherein the thread is reconstructed, the Open API calls made and the meter readings synched with the stream data.
From the perspective of the JXInsight/OpenCore metering engine there is no discernible difference in either runtimes. From a deployment perspective there is a huge difference in that none of the application codebase, configuration, resources and or heap state are deployed in the simulated runtime.
JXInsight/Simz captures and replays what we regard as the true essence of the software execution behavior – activities (code namespaces), flows (threads), resources (meter usage) and time (sequencing).
Getting back to delay resolution, a quick way to estimate this is to determine the maximum steady throughput of the client metering feed assuming that no significant delay or queueing is in place at either end of the transmission. The following code does just that with the
RUN_COUNT parameter set to 1 billion.
Note: With the
call() method being a no-op we are effectively measuring the metering instrumentation added dynamically by our runtime agent.
So as not to be delayed in the client by the cost in making a system call I disabled the default enabled
clock.time meter and instead used a simple but extremely fast built-in meter called
count which increments on each meter read. I also disabled the
strategy metering extension to ensure all fired probes were indeed metered.
To get rid of further measurement overhead I disabled both process and group level metering aggregation.
Finally the simz metering extension is enabled which by default attempts to connect to a locally running simz service.
For the simz service the following configuration was used in keeping with the measurement profile of the client.
Running both processes on the same host with a single executing thread the feed throughput was measured at 20.8 million records a second. That is 10.4 million calls per second as each call pushes a
BEGIN type record (13 bytes) and
END type record (9 bytes) into the metering feed. These measurements were corroborated by network statistics reporting between 228MB/s and 230MB/s. At 10.4 million calls a second the average latency of a call was ~96 nanoseconds. Though this is not an entirely accurate measurement of the period between the real event and it’s replay it does hint at the resolution and scale that can be handled. With such numbers the delay is most certainly in the millisecond range (if not microsecond) with some intelligent and efficient chunking and coalescing of records at either end.
10 million a second at just one metering point probably sounds very much like Big Data but it is much more than that because we are not just streaming data from one process (or node) to another process (or node) we are actually replaying the execution behavior (activity ~= probe ~= method) with its contextual data (resource ~= meter ~= consumption). Data implies a passive data flat plane. Behavior is active, richer and more importantly far easier to reason from a plugin development and management reporting perspective.
Morpheus: Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.
Morpheus: If real is what you can feel, smell, taste and see, then ‘real’ is simply electrical signals interpreted by your brain.
OS: Mountain Lion OS X 10.8
Processor: 2.6 Ghz Intel Core i7
Memory: 8GB 1600 Mhz DDR3
L2 Cache: 256 KB
L3 Cache: 6 MB
Java Runtime: Java Hotspot(TM) 1.7.0_06 (build 23.2-b09, mixed mode)